Hi Andrew, I just found another problem with dlm_controld.pcmk (with your latest patch from github applied and also my fixes to actually build it - they are included in a message referenced by this one). One node which just requested fencing of another one stucks at printing that message where you print ctime() in fence_node_time() (pacemaker.c near 293) every second. No other messages appear, although fence_node_time() is called only from check_fencing_done() (cpg.c near 444). So, both of (last_fenced_time >= node->fail_time) and (!node->fence_queries || node->fence_time != last_fenced_time) are false, otherwise one of messages for that cases should be shown. Then, fence_node_time() seems to return 0 from if (wait_count) return 0; (wait_count is incremented if (last_fenced_time >= node->fail_time) is false), so it never reaches check_fencing_done() call and never return expected 1. Offending node was actually fenced, but that was actually not handled by dlm_controld.
May I ask you to help me a bit with all that logic (as you already dived into dlm_controld sources again), I seem to be so near the success... :| btw, I cant find what source is your dlm repo forked from, may be you remember? Best, Vladislav 28.09.2011 17:41, Vladislav Bogdanov wrote: > Hi Andrew, > >>> All the more reason to start using the stonith api directly. >>> I was playing around list night with the dlm_controld.pcmk code: >>> >>> https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 >> >> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for >> my build. Then it doesn't compile without attached patch. >> It may need to be rebased a bit against your tree. >> >> Now I have package built and am building node images. Will try shortly. > > Fencing from within dlm_controld.pcmk still did not work with your first > patch against that _no_mainloop function (expected). > > So I did my best to build packages from the current git tree. > > Voila! I got failed node correctly fenced! > I'll do some more extensive testing next days, but I believe everything > should be much better now. > > I knew you're genius he-he ;) > > So, here are steps to get DLM handle CPG NODEDOWN events correctly with > pacemaker using openais stack: > > 1. Build pacemaker (as of 2011-09-28) from git. > 2. Apply attached patches to cluster-3.0.17 source tree. > 3. Build dlm_controld.pcmk > > One note - gfs2_controld probably needs to be fixed too (FIXME). > > Best regards, > Vladislav > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker