Ok, I've uploaded data to S3. Links below. There shouldn't have been any splits. We haven't had any network interruption that I am aware of. I bounced corosync on the 10.20.0.127 node and everything cleared up.
As this occurred in our development environment, there is a ton of background noise, so I'm unable to pinpoint exactly when the issue started. But I noticed it around 2014-02-07 01:00 GMT. blackbox: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync-blackbox.10.20.0.127.gz core: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.core.26622.gz log: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.log.gz Thanks -Patrick ------------------------------------------------------------------------ *From: *Jan Friesse <[email protected]> *Sent: * 2014-02-07 03:24:36 E *To: *Patrick Hemmer <[email protected]>, [email protected] *Subject: *Re: [corosync] CPG reporting group member that doesn't exist > Patrick, > blackbox may be useful. Also log may help us trace what happened. This > looks like some kind of problem when corosync nodes split and then join > again... anyway, it's weird and looks like a bug. Another helpful thing > may be coredump of corosync from affected node (so 10.20.0.127) to > ensure it is not memory corruption problem. > > Regards, > Honza > > > Patrick Hemmer napsal(a): >> I've currently got a 3 node cluster with several processes on each box >> using CPG. CPG on one of the boxes is reporting a member of a group that >> isn't there. >> >> # 10.20.2.124 # corosync-cpgtool >> Group Name PID Node ID >> r53clip >> 17891 169083092 (10.20.0.212) >> 21792 169083516 (10.20.2.124) >> hapi >> 17837 169083092 (10.20.0.212) >> 21717 169083516 (10.20.2.124) >> arbiter >> 21590 169083007 (10.20.0.127) >> 31886 169083516 (10.20.2.124) >> 3137 169083092 (10.20.0.212) >> >> >> # 10.20.0.212 # corosync-cpgtool >> Group Name PID Node ID >> r53clip >> 17891 169083092 (10.20.0.212) >> 21792 169083516 (10.20.2.124) >> hapi >> 17837 169083092 (10.20.0.212) >> 21717 169083516 (10.20.2.124) >> arbiter >> 21590 169083007 (10.20.0.127) >> 31886 169083516 (10.20.2.124) >> 3137 169083092 (10.20.0.212) >> >> >> # 10.20.0.127 # corosync-cpgtool >> Group Name PID Node ID >> r53clip >> 17891 169083092 (10.20.0.212) >> 21792 169083516 (10.20.2.124) >> hapi >> 7036 169083092 (10.20.0.212) >> 21717 169083516 (10.20.2.124) >> 17837 169083092 (10.20.0.212) >> arbiter >> 21590 169083007 (10.20.0.127) >> 31886 169083516 (10.20.2.124) >> 3137 169083092 (10.20.0.212) >> >> Notice the first 2 nodes report the same info, but the third node is >> reporting PID 7036 on 169083092. Logging into that box, there is no such >> process running. >> >> I have a capture of the corosync-blackbox data from all 3 nodes. Can >> provide if needed. >> >> corosync 2.3.2 >> libqb 0.16.0 >> >> I'll leave the nodes like this for a few hours if anyone responds and >> wants additional information. After that I'm going to bounce corosync to >> get everything running again. >> >> -Patrick >> >> >> >> _______________________________________________ >> discuss mailing list >> [email protected] >> http://lists.corosync.org/mailman/listinfo/discuss >>
_______________________________________________ discuss mailing list [email protected] http://lists.corosync.org/mailman/listinfo/discuss
