On 07/06/18 18:32, Prasad Nagaraj wrote: > Hi Christine - Got it:) > > I have collected few seconds of debug logs from all nodes after startup. > Please find them attached. > Please let me know if this will help us to identify rootcause. >
The problem is on the node coro.4 - it never gets out of the JOIN "Jun 07 16:55:37 corosync [TOTEM ] entering GATHER state from 11." process so something is wrong on that node, either a rogue routing table entry, dangling iptables rule or even a broken NIC. Chrissie > Thanks! > > On Thu, Jun 7, 2018 at 8:43 PM, Christine Caulfield <[email protected] > <mailto:[email protected]>> wrote: > > On 07/06/18 15:53, Prasad Nagaraj wrote: > > Hi - As you can see in the corosync.conf details - i have already kept > > debug: on > > > > But only in the (disabled) AMF subsystem, not for corosync as a whole :) > > logger_subsys { > subsys: AMF > debug: on > } > > > Chrissie > > > > > > On Thu, 7 Jun 2018, 8:03 pm Christine Caulfield, <[email protected] > <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>>> wrote: > > > > On 07/06/18 15:24, Prasad Nagaraj wrote: > > > > > > No iptables or otherwise firewalls are setup on these nodes. > > > > > > One observation is that each node sends messages on with its > own ring > > > sequence number which is not converging.. I have seen that > in a good > > > cluster, when nodes respond with same sequence number, the > > membership is > > > automatically formed. But in our case, that is not the case. > > > > > > > That's just a side-effect of the cluster not forming. It's not > causing > > it. Can you enable full corosync debugging (just add debug:on > to the end > > of the logging {} stanza) and see if that has any more useful > > information (I only need the corosync bits, not the pcmk ones) > > > > Chrissie > > > > > Example: we can see that one node sends > > > Jun 07 07:55:04 corosync [pcmk ] notice: pcmk_peer_update: > > Transitional > > > membership event on ring 71084: memb=1, new=0, lost=0 > > > ..... > > > Jun 07 07:55:16 corosync [pcmk ] notice: pcmk_peer_update: > > Transitional > > > membership event on ring 71096: memb=1, new=0, lost=0 > > > Jun 07 07:55:16 corosync [pcmk ] notice: pcmk_peer_update: > Stable > > > membership event on ring 71096: memb=1, new=0, lost=0 > > > > > > other node sends messages with its own numbers > > > Jun 07 07:55:12 corosync [pcmk ] notice: pcmk_peer_update: > > Transitional > > > membership event on ring 71088: memb=1, new=0, lost=0 > > > Jun 07 07:55:12 corosync [pcmk ] notice: pcmk_peer_update: > Stable > > > membership event on ring 71088: memb=1, new=0, lost=0 > > > ....... > > > Jun 07 07:55:24 corosync [pcmk ] notice: pcmk_peer_update: > > Transitional > > > membership event on ring 71100: memb=1, new=0, lost=0 > > > Jun 07 07:55:24 corosync [pcmk ] notice: pcmk_peer_update: > Stable > > > membership event on ring 71100: memb=1, new=0, lost=0 > > > > > > Any idea why this happens, and why the seq. numbers from > different > > nodes > > > are not converging ? > > > > > > Thanks! > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Users mailing list: [email protected] > <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>> > > > https://lists.clusterlabs.org/mailman/listinfo/users > <https://lists.clusterlabs.org/mailman/listinfo/users> > > > > > > Project Home: http://www.clusterlabs.org > > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > > > Bugs: http://bugs.clusterlabs.org > > > > > > > _______________________________________________ > > Users mailing list: [email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>> > > https://lists.clusterlabs.org/mailman/listinfo/users > <https://lists.clusterlabs.org/mailman/listinfo/users> > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > > Bugs: http://bugs.clusterlabs.org > > > > > > > > _______________________________________________ > > Users mailing list: [email protected] > <mailto:[email protected]> > > https://lists.clusterlabs.org/mailman/listinfo/users > <https://lists.clusterlabs.org/mailman/listinfo/users> > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: [email protected] <mailto:[email protected]> > https://lists.clusterlabs.org/mailman/listinfo/users > <https://lists.clusterlabs.org/mailman/listinfo/users> > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> > Bugs: http://bugs.clusterlabs.org > > > > > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
