Hi guys.

I've experiencing weir "handling" of VirtualDomain by the cluster. It seems that cluster sometimes fails to report real state of VM which results sometime in troubles - like when cluster thinks VM is not running, which is running then cluster starts it on another node which fcuks up qcow image. Right now for example I'm looking at cluster report VM is up & okey while it is not, on none of the nodes (because VM was 'poweroff' from itself)
So I:

-> $ pcs resource refresh c8kubermaster1
Cleaned up c8kubermaster1 on swir
Cleaned up c8kubermaster1 on dzien
Waiting for 2 replies from the controller
... got reply
... got reply (done)

In logs where VM is supposed to be running, according to cluster
..
 notice: Requesting local execution of probe operation for c8kubermaster1 on swir  notice: Result of probe operation for c8kubermaster1 on swir: ok  notice: Requesting local execution of monitor operation for c8kubermaster1 on swir  notice: Result of monitor operation for c8kubermaster1 on swir: ok

, on the second node (2-node cluster) in logs:
..
 notice: State transition S_IDLE -> S_POLICY_ENGINE
 notice: Ignoring expired c8kubernode1_migrate_to_0 failure on dzien
 notice:  * Start      c8kubermaster1     (          swir )
 notice: Calculated transition 42, saving inputs in /var/lib/pacemaker/pengine/pe-input-2655.bz2  notice: Initiating monitor operation c8kubermaster1_monitor_0 on swir  notice: Initiating monitor operation c8kubermaster1_monitor_0 locally on dzien  notice: Requesting local execution of probe operation for c8kubermaster1 on dzien  notice: Result of probe operation for c8kubermaster1 on dzien: not running  notice: Transition 42 aborted by operation c8kubermaster1_monitor_0 'modify' on swir: Event failed  notice: Transition 42 action 11 (c8kubermaster1_monitor_0 on swir): expected 'not running' but got 'ok'

-> $ pcs resource config c8kubermaster1
 Resource: c8kubermaster1 (class=ocf provider=heartbeat type=VirtualDomain)   Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml hypervisor=qemu:///system migration_transport=ssh
  Meta Attrs: allow-migrate=true failure-timeout=120s
  Operations: migrate_from interval=0s timeout=180s (c8kubermaster1-migrate_from-interval-0s)               migrate_to interval=0s timeout=180s (c8kubermaster1-migrate_to-interval-0s)               monitor interval=30s (c8kubermaster1-monitor-interval-30s)               start interval=0s timeout=90s (c8kubermaster1-start-interval-0s)               stop interval=0s timeout=90s (c8kubermaster1-stop-interval-0s)

Disable + enable the resource 'fixes' the glitch but, naturally the obvious question would be - why that is even allowed to happen?
many thanks, L.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to