[ClusterLabs Developers] Pacemaker issues found while testing a big setup

Vladislav Bogdanov Thu, 26 May 2016 13:18:24 -0700

Hi all,

here is a list of issues found during testing of a setup with 2 clusternodes, 8 remote nodes and around 450 resources. I hope it could beuseful to do some polishing before 1.1.15 release. pacemaker version isquite close to 1.1.15-rc1


* templates are not supported for ocf:pacemaker:remote

* fencing events may be lost due to long transition run time ( alreadydiscussed)* cib becomes unresponsive when uploading many changes, that leads tosbd fencing (if sbd is enabled)* node-action-limit seems to work on a per-cluster-node basis, so itlimits number of operations run on all remote nodes connected by a givencluster node* changing many node attributes during the transition run may lead totransition-recalculation-storm (found with a resource-agent whichchanges dozens of attributes)* notice: Relying on watchdog integration for fencing - this shouldprobably needs to be reworded/downgraded* application of a big enough CIB diff results in monitor failures - CPUhog? CIB hang?* crmd[9834]: crit: GLib: g_hash_table_lookup: assertion 'hash_table!= NULL' failed - hope to catch this again next week as coredump is lost* pacemaker looses resource exit from a pending state(Starting/Stopping/Migrating) change is visible in logs of a local node(or crmd manages a given remote node) but is not propagated to CIB

* crmd crash discovered after moving DC node to standby

segfault in crmd's remote-related code (lrmd client) - hope to catchthis again next week* failcounts for resources on remote nodes are not properly cleaned up(related to pending states enabled???)* many "warning: No reason to expect node XXX to be down" when deletingattributes on remote nodes* "error: Query resulted in an error: Timer expired" when addingattributes on remote nodes

* the same when uploading CIB patch

* attrd[23798]: notice: Update error (unknown peer uuid, retry will beattempted once uuid is discovered): <node>[<attribute>]=(null) failed(host=0x2921ae0) - needs to be reinvestigated

If there any interest in additional information, I can gather it nextweek when I have access to a hardware again.


Hope this could be useful,

Vladislav

_______________________________________________
Developers mailing list
[email protected]
http://clusterlabs.org/mailman/listinfo/developers

[ClusterLabs Developers] Pacemaker issues found while testing a big setup

Reply via email to