Comment #1 on issue 781 by [email protected]: Stalled hbal after long running replace-disks
https://code.google.com/p/ganeti/issues/detail?id=781

Hi, I bumped into the same issue. Here's some context details:

# gnt-cluster --version
gnt-cluster (ganeti v2.9.3) 2.9.3

# gnt-cluster version
Software version: 2.9.3
Internode protocol: 2090000
Configuration format: 2090000
OS api version: 20
Export interface: 0
VCS version: v2.9.3

# hspace --version
hspace (ganeti) version v2.9.3
compiled with ghc 7.4
running on linux x86_64

Cluster's nodes are running Debian Wheezy 7.8


What steps will reproduce the problem?

running 'hbal -L -X' where a 'replace-disks' job is included. That particular job affected an instance with 200GB disk.

What is the expected output? What do you see instead?

Expected behavior for hbal would be to execute the whole series of jobs calculated to rebalance the cluster. Instead, hbal stalls after successfully executing the first job which is replacing-disks and migrating an instance with 200GB disk.

Please provide any additional information below:

775710 success INSTANCE_REPLACE_DISKS(problematic.instance),INSTANCE_MIGRATE(problematic.instance)
775711 success INSTANCE_QUERY_DATA
775714 success CLUSTER_VERIFY
775715 success CLUSTER_VERIFY_CONFIG
775716 success CLUSTER_VERIFY_GROUP(5d3aed89-4f19-4a87-8d0d-cff6159a6926)
775717 success CLUSTER_VERIFY_GROUP(e4c3ade3-f126-4d5f-aebe-0d114c9c5006)
775718 success INSTANCE_QUERY_DATA

# gnt-job info 775710
Job ID: 775710
  Status: success
  Received:         2015-06-22 13:12:17.127187
  Processing start: 2015-06-22 13:12:17.262588 (delta 0.135401s)
  Processing end:   2015-06-22 13:39:40.079557 (delta 1642.816969s)
  Total processing time: 1642.952370 seconds
  Opcodes:
    OP_INSTANCE_REPLACE_DISKS
      Status: success
      Processing start: 2015-06-22 13:12:17.262588
      Execution start:  2015-06-22 13:12:17.431253
      Processing end:   2015-06-22 13:37:31.807807

OP_INSTANCE_MIGRATE
      Status: success
      Processing start: 2015-06-22 13:37:32.058913
      Execution start:  2015-06-22 13:38:17.707905
      Processing end:   2015-06-22 13:39:40.079539


I had to manually kill hbal process at 2015-06-22 14:26, and then re-issue it to execute the rest of the commands.


--
You received this message because this project is configured to send all issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

Reply via email to