Status: New
Owner: ----
New issue 781 by [email protected]: Stalled hbal after long running
replace-disks
http://code.google.com/p/ganeti/issues/detail?id=781
# gnt-cluster --version
gnt-cluster (ganeti v2.9.3) 2.9.3
# gnt-cluster version
Software version: 2.9.3
Internode protocol: 2090000
Configuration format: 2090000
OS api version: 20
Export interface: 0
VCS version: v2.9.3
# hspace --version
hspace (ganeti) version v2.9.3
compiled with ghc 7.4
running on linux x86_64
What distribution are you using? Debian Wheezy
What steps will reproduce the problem?
1. hbal -L -X (that leads to replace-disks)
2. gnt-cluster verify (might not be needed/relevant)
3. gnt-cluster verify (might not be needed/relevant)
What is the expected output? What do you see instead?
hbal get's stalled after some long running replace-disks command. hbal
should be able to proceed and re-queue jobs.
Please provide any additional information below.
relevant gnt-job list output:
148259 success INSTANCE_MIGRATE(vm1.gr)
148260 success INSTANCE_REPLACE_DISKS(vm2),INSTANCE_MIGRATE(vm2)
148261 success INSTANCE_MIGRATE(vm3.gr)
148264 success INSTANCE_REPLACE_DISKS(vm4),INSTANCE_MIGRATE(vm4)
148265 success INSTANCE_MIGRATE(vm6)
148268 success INSTANCE_MIGRATE(vm5.gr),INSTANCE_REPLACE_DISKS(vm5.gr)
148270 success CLUSTER_VERIFY
148271 success CLUSTER_VERIFY_CONFIG
148272 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148273 success CLUSTER_VERIFY
148274 success CLUSTER_VERIFY_CONFIG
148275 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148277 success INSTANCE_REPLACE_DISKS(vm7),INSTANCE_MIGRATE(vm7)
148285 success CLUSTER_VERIFY
148286 success CLUSTER_VERIFY_CONFIG
148287 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148300 success CLUSTER_VERIFY
148302 success CLUSTER_VERIFY_CONFIG
148303 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
148268
<snip>
OP_INSTANCE_MIGRATE
Status: success
Processing start: 2014-03-27 12:21:28.209916
Execution start: 2014-03-27 12:21:49.471006
Processing end: 2014-03-27 12:22:25.294181
<snip>
OP_INSTANCE_REPLACE_DISKS
Status: success
Processing start: 2014-03-27 12:22:25.464814
Execution start: 2014-03-27 12:22:25.593059
Processing end: 2014-03-27 12:31:05.106991
14272
OP_CLUSTER_VERIFY_GROUP
Status: success
Processing start: 2014-03-27 12:25:05.371858
Execution start: 2014-03-27 12:31:06.099407
Processing end: 2014-03-27 12:31:13.275953
148275
OP_CLUSTER_VERIFY_GROUP
Status: success
Processing start: 2014-03-27 12:30:05.412300
Execution start: 2014-03-27 12:31:06.081994
Processing end: 2014-03-27 12:31:13.283875
148277 was created by hbal and has finished properly:
<snip>
OP_INSTANCE_REPLACE_DISKS
Status: success
Processing start: 2014-03-27 12:31:14.552945
Execution start: 2014-03-27 12:31:29.330166
Processing end: 2014-03-27 12:44:14.963871
<snip>
OP_INSTANCE_MIGRATE
Status: success
Processing start: 2014-03-27 12:44:15.152519
Execution start: 2014-03-27 12:44:15.434923
Processing end: 2014-03-27 12:44:45.168525
148285
OP_CLUSTER_VERIFY
Status: success
Processing start: 2014-03-27 13:00:04.963330
Execution start: 2014-03-27 13:00:05.100860
Processing end: 2014-03-27 13:00:06.092148
148287
OP_CLUSTER_VERIFY_GROUP
Status: success
Processing start: 2014-03-27 13:00:06.090845
Execution start: 2014-03-27 13:00:06.943139
Processing end: 2014-03-27 13:00:16.951912
in hbal output I can see:
Cluster score improved from 7.24502595 to 1.73841313
Solution length=12
Executing jobset for instances vm1.gr
Got job IDs148259
Executing jobset for instances vm2,vm3.gr
Got job IDs148260,148261
Executing jobset for instances vm4,vm6
Got job IDs148264,148265
Executing jobset for instances vm5.gr
Got job IDs148268
Executing jobset for instances vm7
Got job IDs148277
(after waiting >45' I press ctrl+c)
^CCancel request registered, will exit at the end of the current job set...
^CCancel request registered, will exit at the end of the current job set...
^CCancel request registered, will exit at the end of the current job set...
(hbal needs to be killed here)
it's been >45' since hbal sent the last job (148277) to queue.
maybe hbal hit some timeout and it doesn't move on ?
oh and please add a space after "Got job IDs"
Got job IDs148264,148265 -> Got job IDs 148264,148265
--- src/Ganeti/HTools/Program/Hbal.hs 2014-03-27 13:18:35.000000000 +0200
+++ src/Ganeti/HTools/Program/Hbal.hs 2014-03-27 13:18:50.000000000 +0200
@@ -201,7 +201,7 @@
let jobs = map (\(_, idx, move, _) ->
map anno $ Cluster.iMoveToJob nl il idx move) js
descr = map (\(_, idx, _, _) -> Container.nameOf il idx) js
- logfn = putStrLn . ("Got job IDs" ++) . commaJoin . map (show .
fromJobId)
+ logfn = putStrLn . ("Got job IDs " ++) . commaJoin . map (show .
fromJobId)
putStrLn $ "Executing jobset for instances " ++ commaJoin descr
jrs <- bracket (L.getClient master) L.closeClient $
Jobs.execJobsWait jobs logfn
--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings