Status: New
Owner: ----

New issue 781 by [email protected]: Stalled hbal after long running replace-disks
http://code.google.com/p/ganeti/issues/detail?id=781

# gnt-cluster --version
gnt-cluster (ganeti v2.9.3) 2.9.3

# gnt-cluster version
Software version: 2.9.3
Internode protocol: 2090000
Configuration format: 2090000
OS api version: 20
Export interface: 0
VCS version: v2.9.3

# hspace --version
hspace (ganeti) version v2.9.3
compiled with ghc 7.4
running on linux x86_64

What distribution are you using? Debian Wheezy

What steps will reproduce the problem?
1. hbal -L -X (that leads to replace-disks)
2. gnt-cluster verify (might not be needed/relevant)
3. gnt-cluster verify (might not be needed/relevant)

What is the expected output? What do you see instead?
hbal get's stalled after some long running replace-disks command. hbal should be able to proceed and re-queue jobs.

Please provide any additional information below.
relevant gnt-job list output:
148259 success INSTANCE_MIGRATE(vm1.gr)
148260 success INSTANCE_REPLACE_DISKS(vm2),INSTANCE_MIGRATE(vm2)
148261 success INSTANCE_MIGRATE(vm3.gr)
148264 success INSTANCE_REPLACE_DISKS(vm4),INSTANCE_MIGRATE(vm4)
148265 success INSTANCE_MIGRATE(vm6)
148268 success INSTANCE_MIGRATE(vm5.gr),INSTANCE_REPLACE_DISKS(vm5.gr)
148270 success CLUSTER_VERIFY
148271 success CLUSTER_VERIFY_CONFIG
148272 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148273 success CLUSTER_VERIFY
148274 success CLUSTER_VERIFY_CONFIG
148275 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148277 success INSTANCE_REPLACE_DISKS(vm7),INSTANCE_MIGRATE(vm7)
148285 success CLUSTER_VERIFY
148286 success CLUSTER_VERIFY_CONFIG
148287 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148300 success CLUSTER_VERIFY
148302 success CLUSTER_VERIFY_CONFIG
148303 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)

148268
<snip>
    OP_INSTANCE_MIGRATE
      Status: success
      Processing start: 2014-03-27 12:21:28.209916
      Execution start:  2014-03-27 12:21:49.471006
      Processing end:   2014-03-27 12:22:25.294181
<snip>
    OP_INSTANCE_REPLACE_DISKS
      Status: success
      Processing start: 2014-03-27 12:22:25.464814
      Execution start:  2014-03-27 12:22:25.593059
      Processing end:   2014-03-27 12:31:05.106991

14272
    OP_CLUSTER_VERIFY_GROUP
      Status: success
      Processing start: 2014-03-27 12:25:05.371858
      Execution start:  2014-03-27 12:31:06.099407
      Processing end:   2014-03-27 12:31:13.275953

148275
    OP_CLUSTER_VERIFY_GROUP
      Status: success
      Processing start: 2014-03-27 12:30:05.412300
      Execution start:  2014-03-27 12:31:06.081994
      Processing end:   2014-03-27 12:31:13.283875

148277 was created by hbal and has finished properly:

<snip>
    OP_INSTANCE_REPLACE_DISKS
      Status: success
      Processing start: 2014-03-27 12:31:14.552945
      Execution start:  2014-03-27 12:31:29.330166
      Processing end:   2014-03-27 12:44:14.963871
<snip>
    OP_INSTANCE_MIGRATE
      Status: success
      Processing start: 2014-03-27 12:44:15.152519
      Execution start:  2014-03-27 12:44:15.434923
      Processing end:   2014-03-27 12:44:45.168525

148285
    OP_CLUSTER_VERIFY
      Status: success
      Processing start: 2014-03-27 13:00:04.963330
      Execution start:  2014-03-27 13:00:05.100860
      Processing end:   2014-03-27 13:00:06.092148

148287
    OP_CLUSTER_VERIFY_GROUP
      Status: success
      Processing start: 2014-03-27 13:00:06.090845
      Execution start:  2014-03-27 13:00:06.943139
      Processing end:   2014-03-27 13:00:16.951912

in hbal output I can see:
Cluster score improved from 7.24502595 to 1.73841313
Solution length=12
Executing jobset for instances vm1.gr
Got job IDs148259
Executing jobset for instances vm2,vm3.gr
Got job IDs148260,148261
Executing jobset for instances vm4,vm6
Got job IDs148264,148265
Executing jobset for instances vm5.gr
Got job IDs148268
Executing jobset for instances vm7
Got job IDs148277
(after waiting >45' I press ctrl+c)
^CCancel request registered, will exit at the end of the current job set...
^CCancel request registered, will exit at the end of the current job set...
^CCancel request registered, will exit at the end of the current job set...
(hbal needs to be killed here)

it's been >45' since hbal sent the last job (148277) to queue.
maybe hbal hit some timeout and it doesn't move on ?

oh and please add a space after "Got job IDs"
Got job IDs148264,148265 -> Got job IDs 148264,148265

--- src/Ganeti/HTools/Program/Hbal.hs   2014-03-27 13:18:35.000000000 +0200
+++ src/Ganeti/HTools/Program/Hbal.hs   2014-03-27 13:18:50.000000000 +0200
@@ -201,7 +201,7 @@
   let jobs = map (\(_, idx, move, _) ->
                     map anno $ Cluster.iMoveToJob nl il idx move) js
       descr = map (\(_, idx, _, _) -> Container.nameOf il idx) js
- logfn = putStrLn . ("Got job IDs" ++) . commaJoin . map (show . fromJobId) + logfn = putStrLn . ("Got job IDs " ++) . commaJoin . map (show . fromJobId)
   putStrLn $ "Executing jobset for instances " ++ commaJoin descr
   jrs <- bracket (L.getClient master) L.closeClient $
          Jobs.execJobsWait jobs logfn



--
You received this message because this project is configured to send all issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

Reply via email to