Ok, I've uploaded data to S3. Links below.
There shouldn't have been any splits. We haven't had any network
interruption that I am aware of.
I bounced corosync on the 10.20.0.127 node and everything cleared up.

As this occurred in our development environment, there is a ton of
background noise, so I'm unable to pinpoint exactly when the issue
started. But I noticed it around 2014-02-07 01:00 GMT.

blackbox:
https://s3.amazonaws.com/cloudcom-cliff-misc/corosync-blackbox.10.20.0.127.gz
core:
https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.core.26622.gz
log:
https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.log.gz

Thanks

-Patrick

------------------------------------------------------------------------
*From: *Jan Friesse <[email protected]>
*Sent: * 2014-02-07 03:24:36 E
*To: *Patrick Hemmer <[email protected]>, [email protected]
*Subject: *Re: [corosync] CPG reporting group member that doesn't exist

> Patrick,
> blackbox may be useful. Also log may help us trace what happened. This
> looks like some kind of problem when corosync nodes split and then join
> again... anyway, it's weird and looks like a bug. Another helpful thing
> may be coredump of corosync from affected node (so 10.20.0.127) to
> ensure it is not memory corruption problem.
>
> Regards,
>   Honza
>
>
> Patrick Hemmer napsal(a):
>> I've currently got a 3 node cluster with several processes on each box
>> using CPG. CPG on one of the boxes is reporting a member of a group that
>> isn't there.
>>
>> # 10.20.2.124 # corosync-cpgtool
>> Group Name           PID       Node ID
>> r53clip
>>              17891     169083092 (10.20.0.212)
>>              21792     169083516 (10.20.2.124)
>> hapi
>>              17837     169083092 (10.20.0.212)
>>              21717     169083516 (10.20.2.124)
>> arbiter
>>              21590     169083007 (10.20.0.127)
>>              31886     169083516 (10.20.2.124)
>>               3137     169083092 (10.20.0.212)
>>
>>
>> # 10.20.0.212 # corosync-cpgtool
>> Group Name           PID       Node ID
>> r53clip
>>              17891     169083092 (10.20.0.212)
>>              21792     169083516 (10.20.2.124)
>> hapi
>>              17837     169083092 (10.20.0.212)
>>              21717     169083516 (10.20.2.124)
>> arbiter
>>              21590     169083007 (10.20.0.127)
>>              31886     169083516 (10.20.2.124)
>>               3137     169083092 (10.20.0.212)
>>
>>
>> # 10.20.0.127 # corosync-cpgtool
>> Group Name           PID       Node ID
>> r53clip
>>              17891     169083092 (10.20.0.212)
>>              21792     169083516 (10.20.2.124)
>> hapi
>>               7036     169083092 (10.20.0.212)
>>              21717     169083516 (10.20.2.124)
>>              17837     169083092 (10.20.0.212)
>> arbiter
>>              21590     169083007 (10.20.0.127)
>>              31886     169083516 (10.20.2.124)
>>               3137     169083092 (10.20.0.212)
>>
>> Notice the first 2 nodes report the same info, but the third node is
>> reporting PID 7036 on 169083092. Logging into that box, there is no such
>> process running.
>>
>> I have a capture of the corosync-blackbox data from all 3 nodes. Can
>> provide if needed.
>>
>> corosync 2.3.2
>> libqb 0.16.0
>>
>> I'll leave the nodes like this for a few hours if anyone responds and
>> wants additional information. After that I'm going to bounce corosync to
>> get everything running again.
>>
>> -Patrick
>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> [email protected]
>> http://lists.corosync.org/mailman/listinfo/discuss
>>

_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss

Reply via email to