Re: [Pacemaker] pacemaker/dlm problems
On Mon, Oct 3, 2011 at 7:29 PM, Vladislav Bogdanov wrote: > 03.10.2011 10:56, Andrew Beekhof wrote: >> >> On Mon, Oct 3, 2011 at 3:34 PM, Vladislav Bogdanov >> wrote: >>> >>> 03.10.2011 04:41, Andrew Beekhof wrote: >>> [...] >>> >>> If pacemaker fully finish processing of one membership change - elect >>> new DC on a quorate partition, and do not try to take over dc role >>> (or >>> release it) on a non-quorate partition if quorate one exists, that >>> problem could be gone. >> >> Non quorate partitions still have a DC. >> They're just not supposed to do anything (depending on the value of >> no-quorum-policy). > > I actually meant "do not try to take over dc role in a rejoined cluster > (or release that role) if it was running on a non-quorate partition > before rejoin if quorate one existed". All existing DC's give up the role and a new one is elected when two partitions join. So I'm unsure what you're referring to here :-) > Sorry for confusion. Not very > natural wording again, but should be better. > > May be DC from non-quorate partition should just have lower priority to > become DC when cluster rejoins and new election happen (does it?)? There is no bias towards past DCs in the election. >>> >>> From what I understand, election result highly depends on nodes >>> (pacemaker processes) uptime. And DC.old has a great chance to win an >>> election, just because it won it before, and nothing changed in election >>> parameters after that. Please fix me. >> >> Correct. But its not getting an advantage because it was DC. > > But it could have it because it f.e. has greater uptime (and that actually > was a reason it won previous elections, before split-brain). > And then it can drop all cib modifications which happened in a quorate > partition during split-brain. At least some messages in logs (you should > have them) make me think so. If it is possible to avoid this - it would be > great. So, from my PoV, one of two should happen > * DC.old does not win > * DC old wins and replaces its CIB with copy from DC.new > > Am I wrong here? The CIB which is used depends not on which node was DC but which node had CIB.latest. > > Vladislav > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
03.10.2011 10:56, Andrew Beekhof wrote: On Mon, Oct 3, 2011 at 3:34 PM, Vladislav Bogdanov wrote: 03.10.2011 04:41, Andrew Beekhof wrote: [...] If pacemaker fully finish processing of one membership change - elect new DC on a quorate partition, and do not try to take over dc role (or release it) on a non-quorate partition if quorate one exists, that problem could be gone. Non quorate partitions still have a DC. They're just not supposed to do anything (depending on the value of no-quorum-policy). I actually meant "do not try to take over dc role in a rejoined cluster (or release that role) if it was running on a non-quorate partition before rejoin if quorate one existed". All existing DC's give up the role and a new one is elected when two partitions join. So I'm unsure what you're referring to here :-) Sorry for confusion. Not very natural wording again, but should be better. May be DC from non-quorate partition should just have lower priority to become DC when cluster rejoins and new election happen (does it?)? There is no bias towards past DCs in the election. From what I understand, election result highly depends on nodes (pacemaker processes) uptime. And DC.old has a great chance to win an election, just because it won it before, and nothing changed in election parameters after that. Please fix me. Correct. But its not getting an advantage because it was DC. But it could have it because it f.e. has greater uptime (and that actually was a reason it won previous elections, before split-brain). And then it can drop all cib modifications which happened in a quorate partition during split-brain. At least some messages in logs (you should have them) make me think so. If it is possible to avoid this - it would be great. So, from my PoV, one of two should happen * DC.old does not win * DC old wins and replaces its CIB with copy from DC.new Am I wrong here? Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
On Mon, Oct 3, 2011 at 3:34 PM, Vladislav Bogdanov wrote: > 03.10.2011 04:41, Andrew Beekhof wrote: > [...] > If pacemaker fully finish processing of one membership change - elect > new DC on a quorate partition, and do not try to take over dc role (or > release it) on a non-quorate partition if quorate one exists, that > problem could be gone. Non quorate partitions still have a DC. They're just not supposed to do anything (depending on the value of no-quorum-policy). >>> >>> I actually meant "do not try to take over dc role in a rejoined cluster >>> (or release that role) if it was running on a non-quorate partition >>> before rejoin if quorate one existed". >> >> All existing DC's give up the role and a new one is elected when two >> partitions join. >> So I'm unsure what you're referring to here :-) >> >>> Sorry for confusion. Not very >>> natural wording again, but should be better. >>> >>> May be DC from non-quorate partition should just have lower priority to >>> become DC when cluster rejoins and new election happen (does it?)? >> >> There is no bias towards past DCs in the election. > > From what I understand, election result highly depends on nodes > (pacemaker processes) uptime. And DC.old has a great chance to win an > election, just because it won it before, and nothing changed in election > parameters after that. Please fix me. Correct. But its not getting an advantage because it was DC. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
03.10.2011 04:41, Andrew Beekhof wrote: [...] If pacemaker fully finish processing of one membership change - elect new DC on a quorate partition, and do not try to take over dc role (or release it) on a non-quorate partition if quorate one exists, that problem could be gone. >>> >>> Non quorate partitions still have a DC. >>> They're just not supposed to do anything (depending on the value of >>> no-quorum-policy). >> >> I actually meant "do not try to take over dc role in a rejoined cluster >> (or release that role) if it was running on a non-quorate partition >> before rejoin if quorate one existed". > > All existing DC's give up the role and a new one is elected when two > partitions join. > So I'm unsure what you're referring to here :-) > >> Sorry for confusion. Not very >> natural wording again, but should be better. >> >> May be DC from non-quorate partition should just have lower priority to >> become DC when cluster rejoins and new election happen (does it?)? > > There is no bias towards past DCs in the election. >From what I understand, election result highly depends on nodes (pacemaker processes) uptime. And DC.old has a great chance to win an election, just because it won it before, and nothing changed in election parameters after that. Please fix me. Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
On Tue, Sep 27, 2011 at 6:24 PM, Vladislav Bogdanov wrote: > 27.09.2011 10:56, Andrew Beekhof wrote: >> On Tue, Sep 27, 2011 at 5:07 PM, Vladislav Bogdanov >> wrote: >>> 27.09.2011 08:59, Andrew Beekhof wrote: >>> [snip] > I agree with Jiaju > (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), > that could be solely pacemaker problem, because it probably should > originate fencing itself is such situation I think. > > So, using pacemaker/dlm with openais stack is currently risky due to > possible hangs of dlm_lockspaces. It shouldn't be, failing to connect to attrd is very unusual. >>> >>> By the way, one of underlying problems, which actually made me to notice >>> all this, is that pacemaker cluster does not fence its DC if it leaves >>> the cluster for a very short time. That is what Jiaju told in his notes. >>> And I can confirm that. >> >> Thats highly surprising. Do the logs you sent display this behaviour? > > They do. Rest of the cluster begins the election, but then accepts > returned DC back (I write this from memory, I looked at logs Sep 5-6, so > I may mix up something). Actually, this might be possible - if DC.old came back before DC.new had a chance to get elected, run the PE and initiate fencing, then there would be no need to fence. >>> >>> (text below is for pacemaker on top of openais stack, not for cman) >>> >>> Except dlm lockspaces are in kern_stop state, so a whole dlm-related >>> part is frozen :( - clvmd in my case, but I expect the same from gfs2 >>> and ocfs2. >>> And fencing requests originated on CPG NODEDOWN event by dlm_controld >>> (with my patch to dlm_controld and your patch for >>> crm_terminate_member_common()) on a quorate partition are lost. DC.old >>> doesn't accept CIB updates from other nodes, so that fencing requests >>> are discarded. >> >> All the more reason to start using the stonith api directly. >> I was playing around list night with the dlm_controld.pcmk code: >> >> https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 > > Wow, I'll try it! > > Btw (offtopic), don't you think that it could be interesting to have > stacks support in dlopened modules there? From what I see in that code, > it could be almost easily achieved. One just needs to create module API > structure, enumerate functions in each stack, add module loading to > dlm_controld core and change calls to module functions. I'm sure its possible. Just up to David if he wants to support it. > >> >>> >>> I think that problem is that membership changes are handled in a >>> non-transactional way (?). >> >> Sounds more like the dlm/etc is being dumb - if the host is back and >> healthy, why would we want to shoot it? > > A. No comments from me on this ;) > > But, anyways, something needs to be done at either side... > >> >>> If pacemaker fully finish processing of one membership change - elect >>> new DC on a quorate partition, and do not try to take over dc role (or >>> release it) on a non-quorate partition if quorate one exists, that >>> problem could be gone. >> >> Non quorate partitions still have a DC. >> They're just not supposed to do anything (depending on the value of >> no-quorum-policy). > > I actually meant "do not try to take over dc role in a rejoined cluster > (or release that role) if it was running on a non-quorate partition > before rejoin if quorate one existed". All existing DC's give up the role and a new one is elected when two partitions join. So I'm unsure what you're referring to here :-) > Sorry for confusion. Not very > natural wording again, but should be better. > > May be DC from non-quorate partition should just have lower priority to > become DC when cluster rejoins and new election happen (does it?)? There is no bias towards past DCs in the election. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
Hi, 27.09.2011 10:56, Andrew Beekhof wrote: [snip] > All the more reason to start using the stonith api directly. > I was playing around list night with the dlm_controld.pcmk code: > > https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 Doesn't seem to apply to 3.0.17, so I rebased that commit against it for my build. Then it doesn't compile without attached patch. It may need to be rebased a bit against your tree. Now I have package built and am building node images. Will try shortly. [snip] Best, Vladislav --- cluster-3.0.17/group/dlm_controld/Makefile.orig 2010-10-04 12:24:34.0 + +++ cluster-3.0.17/group/dlm_controld/Makefile 2011-09-28 09:01:23.453252437 + @@ -54,7 +54,7 @@ LDFLAGS += -L${libdir} LDDEPS += ../lib/libgroup.a -PCMK_LDFLAGS += -lcib -lcrmcommon -lcrmcluster -ltotem_pg +PCMK_LDFLAGS += -lcib -lcrmcommon -lcrmcluster -ltotem_pg -lplumb -lstonithd PCMK_LDFLAGS += `pkg-config glib-2.0 --libs` PCMK_LDFLAGS += `xml2-config --libs` --- cluster-3.0.17/group/dlm_controld/pacemaker.c.orig 2011-09-28 08:49:00.0 + +++ cluster-3.0.17/group/dlm_controld/pacemaker.c 2011-09-28 08:59:50.678375731 + @@ -20,6 +20,7 @@ #include #include #include +#include #define COMMS_DIR "/sys/kernel/config/dlm/cluster/comms" @@ -249,16 +250,17 @@ int fence_in_progress(int *in_progress) int fence_node_time(int nodeid, uint64_t *last_fenced_time) { int rc = 0; -const char *uname = NULL; -crm_node_t *node = crm_get_peer(nodeid, uname); +const char *node_uname = NULL; +crm_node_t *node = crm_get_peer(nodeid, node_uname); stonith_history_t *history, *hp = NULL; +stonith_t *st = NULL; if(last_fenced_time) { *last_fenced_time = 0; } if (node && node->uname) { -uname = node->uname; +node_uname = node->uname; st = stonith_api_new(); } else { @@ -271,7 +273,7 @@ int fence_node_time(int nodeid, uint64_t } if(rc == stonith_ok) { -st->cmds->history(st, st_opt_sync_call, uname, &history, 120); +st->cmds->history(st, st_opt_sync_call, node_uname, &history, 120); for(hp = history; hp; hp = hp->next) { if(hp->state == st_done) { *last_fenced_time = hp->completed; @@ -280,9 +282,9 @@ int fence_node_time(int nodeid, uint64_t } if(*last_fenced_time != 0) { -log_debug("Node %d/%s was last shot at: %s", nodeid, ctime(*last_fenced_time)); +log_debug("Node %d/%s was last shot at: %s", nodeid, node_uname, ctime(last_fenced_time)); } else { -log_debug("It does not appear node %d/%s has been shot", nodeid, uname); +log_debug("It does not appear node %d/%s has been shot", nodeid, node_uname); } if(st) { ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
27.09.2011 10:56, Andrew Beekhof wrote: > On Tue, Sep 27, 2011 at 5:07 PM, Vladislav Bogdanov > wrote: >> 27.09.2011 08:59, Andrew Beekhof wrote: >> [snip] I agree with Jiaju (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), that could be solely pacemaker problem, because it probably should originate fencing itself is such situation I think. So, using pacemaker/dlm with openais stack is currently risky due to possible hangs of dlm_lockspaces. >>> >>> It shouldn't be, failing to connect to attrd is very unusual. >> >> By the way, one of underlying problems, which actually made me to notice >> all this, is that pacemaker cluster does not fence its DC if it leaves >> the cluster for a very short time. That is what Jiaju told in his notes. >> And I can confirm that. > > Thats highly surprising. Do the logs you sent display this behaviour? They do. Rest of the cluster begins the election, but then accepts returned DC back (I write this from memory, I looked at logs Sep 5-6, so I may mix up something). >>> >>> Actually, this might be possible - if DC.old came back before DC.new >>> had a chance to get elected, run the PE and initiate fencing, then >>> there would be no need to fence. >>> >> >> (text below is for pacemaker on top of openais stack, not for cman) >> >> Except dlm lockspaces are in kern_stop state, so a whole dlm-related >> part is frozen :( - clvmd in my case, but I expect the same from gfs2 >> and ocfs2. >> And fencing requests originated on CPG NODEDOWN event by dlm_controld >> (with my patch to dlm_controld and your patch for >> crm_terminate_member_common()) on a quorate partition are lost. DC.old >> doesn't accept CIB updates from other nodes, so that fencing requests >> are discarded. > > All the more reason to start using the stonith api directly. > I was playing around list night with the dlm_controld.pcmk code: > > https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 Wow, I'll try it! Btw (offtopic), don't you think that it could be interesting to have stacks support in dlopened modules there? From what I see in that code, it could be almost easily achieved. One just needs to create module API structure, enumerate functions in each stack, add module loading to dlm_controld core and change calls to module functions. > >> >> I think that problem is that membership changes are handled in a >> non-transactional way (?). > > Sounds more like the dlm/etc is being dumb - if the host is back and > healthy, why would we want to shoot it? A. No comments from me on this ;) But, anyways, something needs to be done at either side... > >> If pacemaker fully finish processing of one membership change - elect >> new DC on a quorate partition, and do not try to take over dc role (or >> release it) on a non-quorate partition if quorate one exists, that >> problem could be gone. > > Non quorate partitions still have a DC. > They're just not supposed to do anything (depending on the value of > no-quorum-policy). I actually meant "do not try to take over dc role in a rejoined cluster (or release that role) if it was running on a non-quorate partition before rejoin if quorate one existed". Sorry for confusion. Not very natural wording again, but should be better. May be DC from non-quorate partition should just have lower priority to become DC when cluster rejoins and new election happen (does it?)? > >> I didn't dig into code so much, so all above is just my deduction which >> may be completely wrong. >> And of course real logic could (should) be much more complicated, with >> handling of just rebooted members, etc. >> >> (end of openais specific part) >> [snip] Although it took 25 seconds instead of 3 to break the cluster (I understand, this is almost impossible to load host so much, but anyways), then I got a real nightmare: two nodes of 3-node cluster had cman stopped (and pacemaker too because of cman connection loss) - they asked to kick_node_from_cluster() for each other, and that succeeded. But fencing didn't happen (I still need to look why, but this is cman specific). Btw this part is tricky for me to understand the underlying logic: * cman just stops cman processes on remote nodes, disregarding the quorum. I hope that could be fixed in corosync If I understand one of latest threads there right. * But cman does not do fencing of that nodes, and they still run resources. And this could be extremely dangerous under some circumstances. And cman does not do fencing even if it has fence devices configure in cluster.conf (I verified that). Remaining node had pacemaker hanged, it doesn't even notice cluster infrastructure change, down nodes were listed as a onlin
Re: [Pacemaker] pacemaker/dlm problems
On Tue, Sep 27, 2011 at 5:07 PM, Vladislav Bogdanov wrote: > 27.09.2011 08:59, Andrew Beekhof wrote: > [snip] >>> I agree with Jiaju >>> (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), >>> that could be solely pacemaker problem, because it probably should >>> originate fencing itself is such situation I think. >>> >>> So, using pacemaker/dlm with openais stack is currently risky due to >>> possible hangs of dlm_lockspaces. >> >> It shouldn't be, failing to connect to attrd is very unusual. > > By the way, one of underlying problems, which actually made me to notice > all this, is that pacemaker cluster does not fence its DC if it leaves > the cluster for a very short time. That is what Jiaju told in his notes. > And I can confirm that. Thats highly surprising. Do the logs you sent display this behaviour? >>> >>> They do. Rest of the cluster begins the election, but then accepts >>> returned DC back (I write this from memory, I looked at logs Sep 5-6, so >>> I may mix up something). >> >> Actually, this might be possible - if DC.old came back before DC.new >> had a chance to get elected, run the PE and initiate fencing, then >> there would be no need to fence. >> > > (text below is for pacemaker on top of openais stack, not for cman) > > Except dlm lockspaces are in kern_stop state, so a whole dlm-related > part is frozen :( - clvmd in my case, but I expect the same from gfs2 > and ocfs2. > And fencing requests originated on CPG NODEDOWN event by dlm_controld > (with my patch to dlm_controld and your patch for > crm_terminate_member_common()) on a quorate partition are lost. DC.old > doesn't accept CIB updates from other nodes, so that fencing requests > are discarded. All the more reason to start using the stonith api directly. I was playing around list night with the dlm_controld.pcmk code: https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787 > > I think that problem is that membership changes are handled in a > non-transactional way (?). Sounds more like the dlm/etc is being dumb - if the host is back and healthy, why would we want to shoot it? > If pacemaker fully finish processing of one membership change - elect > new DC on a quorate partition, and do not try to take over dc role (or > release it) on a non-quorate partition if quorate one exists, that > problem could be gone. Non quorate partitions still have a DC. They're just not supposed to do anything (depending on the value of no-quorum-policy). > I didn't dig into code so much, so all above is just my deduction which > may be completely wrong. > And of course real logic could (should) be much more complicated, with > handling of just rebooted members, etc. > > (end of openais specific part) > >>> [snip] >>> Although it took 25 seconds instead of 3 to break the cluster (I >>> understand, this is almost impossible to load host so much, but >>> anyways), then I got a real nightmare: two nodes of 3-node cluster had >>> cman stopped (and pacemaker too because of cman connection loss) - they >>> asked to kick_node_from_cluster() for each other, and that succeeded. >>> But fencing didn't happen (I still need to look why, but this is cman >>> specific). >>> >>> Btw this part is tricky for me to understand the underlying logic: >>> * cman just stops cman processes on remote nodes, disregarding the >>> quorum. I hope that could be fixed in corosync If I understand one of >>> latest threads there right. >>> * But cman does not do fencing of that nodes, and they still run >>> resources. And this could be extremely dangerous under some >>> circumstances. And cman does not do fencing even if it has fence devices >>> configure in cluster.conf (I verified that). >>> >>> Remaining node had pacemaker hanged, it doesn't even >>> notice cluster infrastructure change, down nodes were listed as a >>> online, one of them was a DC, all resources are marked as started on all >>> (down too) nodes. No log entries from pacemaker at all. >> >> Well I can't see any logs from anyone to its hard for me to comment. > > Logs are sent privately. > >> >>> >>> Vladislav >>> >>> >>> ___ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >>> >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.li
Re: [Pacemaker] pacemaker/dlm problems
27.09.2011 08:59, Andrew Beekhof wrote: [snip] >> I agree with Jiaju >> (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), >> that could be solely pacemaker problem, because it probably should >> originate fencing itself is such situation I think. >> >> So, using pacemaker/dlm with openais stack is currently risky due to >> possible hangs of dlm_lockspaces. > > It shouldn't be, failing to connect to attrd is very unusual. By the way, one of underlying problems, which actually made me to notice all this, is that pacemaker cluster does not fence its DC if it leaves the cluster for a very short time. That is what Jiaju told in his notes. And I can confirm that. >>> >>> Thats highly surprising. Do the logs you sent display this behaviour? >> >> They do. Rest of the cluster begins the election, but then accepts >> returned DC back (I write this from memory, I looked at logs Sep 5-6, so >> I may mix up something). > > Actually, this might be possible - if DC.old came back before DC.new > had a chance to get elected, run the PE and initiate fencing, then > there would be no need to fence. > (text below is for pacemaker on top of openais stack, not for cman) Except dlm lockspaces are in kern_stop state, so a whole dlm-related part is frozen :( - clvmd in my case, but I expect the same from gfs2 and ocfs2. And fencing requests originated on CPG NODEDOWN event by dlm_controld (with my patch to dlm_controld and your patch for crm_terminate_member_common()) on a quorate partition are lost. DC.old doesn't accept CIB updates from other nodes, so that fencing requests are discarded. I think that problem is that membership changes are handled in a non-transactional way (?). If pacemaker fully finish processing of one membership change - elect new DC on a quorate partition, and do not try to take over dc role (or release it) on a non-quorate partition if quorate one exists, that problem could be gone. I didn't dig into code so much, so all above is just my deduction which may be completely wrong. And of course real logic could (should) be much more complicated, with handling of just rebooted members, etc. (end of openais specific part) >> [snip] >> Although it took 25 seconds instead of 3 to break the cluster (I >> understand, this is almost impossible to load host so much, but >> anyways), then I got a real nightmare: two nodes of 3-node cluster had >> cman stopped (and pacemaker too because of cman connection loss) - they >> asked to kick_node_from_cluster() for each other, and that succeeded. >> But fencing didn't happen (I still need to look why, but this is cman >> specific). >> >> Btw this part is tricky for me to understand the underlying logic: >> * cman just stops cman processes on remote nodes, disregarding the >> quorum. I hope that could be fixed in corosync If I understand one of >> latest threads there right. >> * But cman does not do fencing of that nodes, and they still run >> resources. And this could be extremely dangerous under some >> circumstances. And cman does not do fencing even if it has fence devices >> configure in cluster.conf (I verified that). >> >> Remaining node had pacemaker hanged, it doesn't even >> notice cluster infrastructure change, down nodes were listed as a >> online, one of them was a DC, all resources are marked as started on all >> (down too) nodes. No log entries from pacemaker at all. > > Well I can't see any logs from anyone to its hard for me to comment. Logs are sent privately. > >> >> Vladislav >> >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
On Mon, Sep 26, 2011 at 6:41 PM, Vladislav Bogdanov wrote: > 26.09.2011 11:16, Andrew Beekhof wrote: > [snip] >>> Regardless, for 1.1.6 the dlm would be better off making a call like: rc = st->cmds->fence(st, st_opts, target, "reboot", 120); from fencing/admin.c That would talk directly to the fencing daemon, bypassing attrd, crnd and PE - and thus be more reliable. This is what the cman plugin will be doing soon too. >>> >>> Great to know, I'll try that in near future. Thank you very much for >>> pointer. >> >> 1.1.7 will actually make use of this API regardless of any *_controld >> changes - i'm in the middle of updating the two library functions they >> use (crm_terminate_member and crm_terminate_member_no_mainloop). > > Ah, I then try your patch and wait for that to be resolved. > >> >>> > > I agree with Jiaju > (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), > that could be solely pacemaker problem, because it probably should > originate fencing itself is such situation I think. > > So, using pacemaker/dlm with openais stack is currently risky due to > possible hangs of dlm_lockspaces. It shouldn't be, failing to connect to attrd is very unusual. >>> >>> By the way, one of underlying problems, which actually made me to notice >>> all this, is that pacemaker cluster does not fence its DC if it leaves >>> the cluster for a very short time. That is what Jiaju told in his notes. >>> And I can confirm that. >> >> Thats highly surprising. Do the logs you sent display this behaviour? > > They do. Rest of the cluster begins the election, but then accepts > returned DC back (I write this from memory, I looked at logs Sep 5-6, so > I may mix up something). Actually, this might be possible - if DC.old came back before DC.new had a chance to get elected, run the PE and initiate fencing, then there would be no need to fence. > [snip] > Although it took 25 seconds instead of 3 to break the cluster (I > understand, this is almost impossible to load host so much, but > anyways), then I got a real nightmare: two nodes of 3-node cluster had > cman stopped (and pacemaker too because of cman connection loss) - they > asked to kick_node_from_cluster() for each other, and that succeeded. > But fencing didn't happen (I still need to look why, but this is cman > specific). > > Btw this part is tricky for me to understand the underlying logic: > * cman just stops cman processes on remote nodes, disregarding the > quorum. I hope that could be fixed in corosync If I understand one of > latest threads there right. > * But cman does not do fencing of that nodes, and they still run > resources. And this could be extremely dangerous under some > circumstances. And cman does not do fencing even if it has fence devices > configure in cluster.conf (I verified that). > > Remaining node had pacemaker hanged, it doesn't even > notice cluster infrastructure change, down nodes were listed as a > online, one of them was a DC, all resources are marked as started on all > (down too) nodes. No log entries from pacemaker at all. Well I can't see any logs from anyone to its hard for me to comment. >>> >>> Logs are sent privately. >>> > > Vladislav > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
26.09.2011 11:16, Andrew Beekhof wrote: [snip] >> >>> >>> Regardless, for 1.1.6 the dlm would be better off making a call like: >>> >>> rc = st->cmds->fence(st, st_opts, target, "reboot", 120); >>> >>> from fencing/admin.c >>> >>> That would talk directly to the fencing daemon, bypassing attrd, crnd >>> and PE - and thus be more reliable. >>> >>> This is what the cman plugin will be doing soon too. >> >> Great to know, I'll try that in near future. Thank you very much for >> pointer. > > 1.1.7 will actually make use of this API regardless of any *_controld > changes - i'm in the middle of updating the two library functions they > use (crm_terminate_member and crm_terminate_member_no_mainloop). Ah, I then try your patch and wait for that to be resolved. > >> >>> I agree with Jiaju (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), that could be solely pacemaker problem, because it probably should originate fencing itself is such situation I think. So, using pacemaker/dlm with openais stack is currently risky due to possible hangs of dlm_lockspaces. >>> >>> It shouldn't be, failing to connect to attrd is very unusual. >> >> By the way, one of underlying problems, which actually made me to notice >> all this, is that pacemaker cluster does not fence its DC if it leaves >> the cluster for a very short time. That is what Jiaju told in his notes. >> And I can confirm that. > > Thats highly surprising. Do the logs you sent display this behaviour? They do. Rest of the cluster begins the election, but then accepts returned DC back (I write this from memory, I looked at logs Sep 5-6, so I may mix up something). [snip] Although it took 25 seconds instead of 3 to break the cluster (I understand, this is almost impossible to load host so much, but anyways), then I got a real nightmare: two nodes of 3-node cluster had cman stopped (and pacemaker too because of cman connection loss) - they asked to kick_node_from_cluster() for each other, and that succeeded. But fencing didn't happen (I still need to look why, but this is cman specific). Btw this part is tricky for me to understand the underlying logic: * cman just stops cman processes on remote nodes, disregarding the quorum. I hope that could be fixed in corosync If I understand one of latest threads there right. * But cman does not do fencing of that nodes, and they still run resources. And this could be extremely dangerous under some circumstances. And cman does not do fencing even if it has fence devices configure in cluster.conf (I verified that). Remaining node had pacemaker hanged, it doesn't even notice cluster infrastructure change, down nodes were listed as a online, one of them was a DC, all resources are marked as started on all (down too) nodes. No log entries from pacemaker at all. >>> >>> Well I can't see any logs from anyone to its hard for me to comment. >> >> Logs are sent privately. >> >>> Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] pacemaker/dlm problems
On Mon, Sep 26, 2011 at 5:38 PM, Vladislav Bogdanov wrote: > Hi Andrew, > > 26.09.2011 10:10, Andrew Beekhof wrote: >> On Tue, Sep 6, 2011 at 5:27 PM, Vladislav Bogdanov >> wrote: >>> Hi Andrew, hi all, >>> >>> I'm further investigating dlm lockspace hangs I described in >>> https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html >>> and in the thread starting from >>> https://lists.linux-foundation.org/pipermail/openais/2011-September/016701.html >>> . >>> >>> What I described there is setup which involves pacemaker-1.1.6 with >>> corosync-1.4.1 and dlm_controld.pcmk from cluster-3.0.17 (without cman). >>> I use openais stack for pacemaker. >>> >>> I found that it is possible to reproduce dlm kern_stop state across a >>> whole cluster with iptables on just one node, it is sufficient to block >>> all (or just corosync-specific) incoming/outgoing UDP for several >>> seconds (that time probably depends on corosync settings). I my case I >>> reproduced hang with 3-seconds traffic block: >>> iptables -I INPUT 1 -p udp -j REJECT; \ >>> iptables -I OUTPUT 1 -p udp -j REJECT; \ >>> sleep 3; \ >>> iptables -D INPUT 1; \ >>> iptables -D OUTPUT 1 >>> >>> I tried to make dlm_controld schedule fencing on CPG_REASON_NODEDOWN >>> event (just to look if it helps with problems I described in posts >>> referenced above), but without much success, following code does not work: >>> >>> int fd = pcmk_cluster_fd; >>> int rc = crm_terminate_member_no_mainloop(nodeid, NULL, &fd); >>> >>> I get "Could not kick node XXX from the cluster" message accompanied >>> with "No connection to the cluster". That means that >>> attrd_update_no_mainloop() fails. >>> >>> Andrew, could you please give some pointers why may it fail? I'd then >>> try to fix dlm_controld. I do not see any other uses of that function >>> except than in dlm_controld.pcmk. >> >> I can't think of anything except that attrd might not be running. Is it? > > Will recheck. > >> >> Regardless, for 1.1.6 the dlm would be better off making a call like: >> >> rc = st->cmds->fence(st, st_opts, target, "reboot", 120); >> >> from fencing/admin.c >> >> That would talk directly to the fencing daemon, bypassing attrd, crnd >> and PE - and thus be more reliable. >> >> This is what the cman plugin will be doing soon too. > > Great to know, I'll try that in near future. Thank you very much for > pointer. 1.1.7 will actually make use of this API regardless of any *_controld changes - i'm in the middle of updating the two library functions they use (crm_terminate_member and crm_terminate_member_no_mainloop). > >> >>> >>> I agree with Jiaju >>> (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), >>> that could be solely pacemaker problem, because it probably should >>> originate fencing itself is such situation I think. >>> >>> So, using pacemaker/dlm with openais stack is currently risky due to >>> possible hangs of dlm_lockspaces. >> >> It shouldn't be, failing to connect to attrd is very unusual. > > By the way, one of underlying problems, which actually made me to notice > all this, is that pacemaker cluster does not fence its DC if it leaves > the cluster for a very short time. That is what Jiaju told in his notes. > And I can confirm that. Thats highly surprising. Do the logs you sent display this behaviour? > >> >>> Originally I got it due to heavy load >>> on one cluster nodes (actually on a host which has that cluster node >>> running as virtual guest). >>> >>> Ok, I switched to cman to see if it helps. Fencing is configured in >>> pacemaker, not in cluster.conf. >>> >>> Things became even worse ;( . >>> >>> Although it took 25 seconds instead of 3 to break the cluster (I >>> understand, this is almost impossible to load host so much, but >>> anyways), then I got a real nightmare: two nodes of 3-node cluster had >>> cman stopped (and pacemaker too because of cman connection loss) - they >>> asked to kick_node_from_cluster() for each other, and that succeeded. >>> But fencing didn't happen (I still need to look why, but this is cman >>> specific). >>> Remaining node had pacemaker hanged, it doesn't even >>> notice cluster infrastructure change, down nodes were listed as a >>> online, one of them was a DC, all resources are marked as started on all >>> (down too) nodes. No log entries from pacemaker at all. >> >> Well I can't see any logs from anyone to its hard for me to comment. > > Logs are sent privately. > >> >>> So, from my PoV cman+pacemaker is not currently suitable for HA tasks too. >>> >>> That means that both possible alternatives are currently unusable if one >>> needs self-repairing pacemaker cluster with dlm support ;( That is >>> really regrettable. >>> >>> I can provide all needed information and really hope that it is possible >>> to fix both issues: >>> * dlm blockage with openais and >>> * pacemaker lock with cman and no fencing from within dlm_controld >>> >>> I think both issues are really
Re: [Pacemaker] pacemaker/dlm problems
Hi Andrew, 26.09.2011 10:10, Andrew Beekhof wrote: > On Tue, Sep 6, 2011 at 5:27 PM, Vladislav Bogdanov > wrote: >> Hi Andrew, hi all, >> >> I'm further investigating dlm lockspace hangs I described in >> https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html >> and in the thread starting from >> https://lists.linux-foundation.org/pipermail/openais/2011-September/016701.html >> . >> >> What I described there is setup which involves pacemaker-1.1.6 with >> corosync-1.4.1 and dlm_controld.pcmk from cluster-3.0.17 (without cman). >> I use openais stack for pacemaker. >> >> I found that it is possible to reproduce dlm kern_stop state across a >> whole cluster with iptables on just one node, it is sufficient to block >> all (or just corosync-specific) incoming/outgoing UDP for several >> seconds (that time probably depends on corosync settings). I my case I >> reproduced hang with 3-seconds traffic block: >> iptables -I INPUT 1 -p udp -j REJECT; \ >> iptables -I OUTPUT 1 -p udp -j REJECT; \ >> sleep 3; \ >> iptables -D INPUT 1; \ >> iptables -D OUTPUT 1 >> >> I tried to make dlm_controld schedule fencing on CPG_REASON_NODEDOWN >> event (just to look if it helps with problems I described in posts >> referenced above), but without much success, following code does not work: >> >>int fd = pcmk_cluster_fd; >>int rc = crm_terminate_member_no_mainloop(nodeid, NULL, &fd); >> >> I get "Could not kick node XXX from the cluster" message accompanied >> with "No connection to the cluster". That means that >> attrd_update_no_mainloop() fails. >> >> Andrew, could you please give some pointers why may it fail? I'd then >> try to fix dlm_controld. I do not see any other uses of that function >> except than in dlm_controld.pcmk. > > I can't think of anything except that attrd might not be running. Is it? Will recheck. > > Regardless, for 1.1.6 the dlm would be better off making a call like: > > rc = st->cmds->fence(st, st_opts, target, "reboot", 120); > > from fencing/admin.c > > That would talk directly to the fencing daemon, bypassing attrd, crnd > and PE - and thus be more reliable. > > This is what the cman plugin will be doing soon too. Great to know, I'll try that in near future. Thank you very much for pointer. > >> >> I agree with Jiaju >> (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), >> that could be solely pacemaker problem, because it probably should >> originate fencing itself is such situation I think. >> >> So, using pacemaker/dlm with openais stack is currently risky due to >> possible hangs of dlm_lockspaces. > > It shouldn't be, failing to connect to attrd is very unusual. By the way, one of underlying problems, which actually made me to notice all this, is that pacemaker cluster does not fence its DC if it leaves the cluster for a very short time. That is what Jiaju told in his notes. And I can confirm that. > >> Originally I got it due to heavy load >> on one cluster nodes (actually on a host which has that cluster node >> running as virtual guest). >> >> Ok, I switched to cman to see if it helps. Fencing is configured in >> pacemaker, not in cluster.conf. >> >> Things became even worse ;( . >> >> Although it took 25 seconds instead of 3 to break the cluster (I >> understand, this is almost impossible to load host so much, but >> anyways), then I got a real nightmare: two nodes of 3-node cluster had >> cman stopped (and pacemaker too because of cman connection loss) - they >> asked to kick_node_from_cluster() for each other, and that succeeded. >> But fencing didn't happen (I still need to look why, but this is cman >> specific). >> Remaining node had pacemaker hanged, it doesn't even >> notice cluster infrastructure change, down nodes were listed as a >> online, one of them was a DC, all resources are marked as started on all >> (down too) nodes. No log entries from pacemaker at all. > > Well I can't see any logs from anyone to its hard for me to comment. Logs are sent privately. > >> So, from my PoV cman+pacemaker is not currently suitable for HA tasks too. >> >> That means that both possible alternatives are currently unusable if one >> needs self-repairing pacemaker cluster with dlm support ;( That is >> really regrettable. >> >> I can provide all needed information and really hope that it is possible >> to fix both issues: >> * dlm blockage with openais and >> * pacemaker lock with cman and no fencing from within dlm_controld >> >> I think both issues are really high priority, because it is definitely >> not acceptable when problems with load on one cluster node (or with link >> to that node) lead to a total cluster lock or even crash. >> >> I also offer any possible assistance from my side (f.e. patch trials >> etc.) to get that all fixed. I can run either openais or cman and can >> quickly switch between that stacks. >> >> Sorry for not being brief, >> >> Best regards, >> Vladislav >> >> >> __
Re: [Pacemaker] pacemaker/dlm problems
On Tue, Sep 6, 2011 at 5:27 PM, Vladislav Bogdanov wrote: > Hi Andrew, hi all, > > I'm further investigating dlm lockspace hangs I described in > https://www.redhat.com/archives/cluster-devel/2011-August/msg00133.html > and in the thread starting from > https://lists.linux-foundation.org/pipermail/openais/2011-September/016701.html > . > > What I described there is setup which involves pacemaker-1.1.6 with > corosync-1.4.1 and dlm_controld.pcmk from cluster-3.0.17 (without cman). > I use openais stack for pacemaker. > > I found that it is possible to reproduce dlm kern_stop state across a > whole cluster with iptables on just one node, it is sufficient to block > all (or just corosync-specific) incoming/outgoing UDP for several > seconds (that time probably depends on corosync settings). I my case I > reproduced hang with 3-seconds traffic block: > iptables -I INPUT 1 -p udp -j REJECT; \ > iptables -I OUTPUT 1 -p udp -j REJECT; \ > sleep 3; \ > iptables -D INPUT 1; \ > iptables -D OUTPUT 1 > > I tried to make dlm_controld schedule fencing on CPG_REASON_NODEDOWN > event (just to look if it helps with problems I described in posts > referenced above), but without much success, following code does not work: > > int fd = pcmk_cluster_fd; > int rc = crm_terminate_member_no_mainloop(nodeid, NULL, &fd); > > I get "Could not kick node XXX from the cluster" message accompanied > with "No connection to the cluster". That means that > attrd_update_no_mainloop() fails. > > Andrew, could you please give some pointers why may it fail? I'd then > try to fix dlm_controld. I do not see any other uses of that function > except than in dlm_controld.pcmk. I can't think of anything except that attrd might not be running. Is it? Regardless, for 1.1.6 the dlm would be better off making a call like: rc = st->cmds->fence(st, st_opts, target, "reboot", 120); from fencing/admin.c That would talk directly to the fencing daemon, bypassing attrd, crnd and PE - and thus be more reliable. This is what the cman plugin will be doing soon too. > > I agree with Jiaju > (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html), > that could be solely pacemaker problem, because it probably should > originate fencing itself is such situation I think. > > So, using pacemaker/dlm with openais stack is currently risky due to > possible hangs of dlm_lockspaces. It shouldn't be, failing to connect to attrd is very unusual. > Originally I got it due to heavy load > on one cluster nodes (actually on a host which has that cluster node > running as virtual guest). > > Ok, I switched to cman to see if it helps. Fencing is configured in > pacemaker, not in cluster.conf. > > Things became even worse ;( . > > Although it took 25 seconds instead of 3 to break the cluster (I > understand, this is almost impossible to load host so much, but > anyways), then I got a real nightmare: two nodes of 3-node cluster had > cman stopped (and pacemaker too because of cman connection loss) - they > asked to kick_node_from_cluster() for each other, and that succeeded. > But fencing didn't happen (I still need to look why, but this is cman > specific). > Remaining node had pacemaker hanged, it doesn't even > notice cluster infrastructure change, down nodes were listed as a > online, one of them was a DC, all resources are marked as started on all > (down too) nodes. No log entries from pacemaker at all. Well I can't see any logs from anyone to its hard for me to comment. > So, from my PoV cman+pacemaker is not currently suitable for HA tasks too. > > That means that both possible alternatives are currently unusable if one > needs self-repairing pacemaker cluster with dlm support ;( That is > really regrettable. > > I can provide all needed information and really hope that it is possible > to fix both issues: > * dlm blockage with openais and > * pacemaker lock with cman and no fencing from within dlm_controld > > I think both issues are really high priority, because it is definitely > not acceptable when problems with load on one cluster node (or with link > to that node) lead to a total cluster lock or even crash. > > I also offer any possible assistance from my side (f.e. patch trials > etc.) to get that all fixed. I can run either openais or cman and can > quickly switch between that stacks. > > Sorry for not being brief, > > Best regards, > Vladislav > > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Gett