I have not been able to reproduce the problem so far. I even introduced a delay
in sending an internal message from one of the nodes in the Totem config.
change callback to increase the window of opportunity for the message
misdelivery to occur, but even with that it doesn't seem to happen.
However I did run into another problem. I noticed that there're times when you
get a CPG config. change callback without a Totem Config. change callback. When
multiple nodes join or leave a cluster, you get one CPG callback per member
that left or joined. The way we determine if all CPG callbacks are in, is to
compare the membership list from CPG to the one returned by Totem callback.
It's only when they match that we conclude that there's has been a change in
cluster membership. If we miss a Totem config. change here, the CPG callbacks
are essentially ignored making a subsequent config. change to appear spurious.
And as this happens on one set of nodes and not the other when a cluster splits
into two, the two halves get out of sync.
For this one, I do have the fplay records from all nodes in the cluster. If
someone wants to look at them, I can upload them.
Thanks,
Sathya
On Thursday, October 24, 2013 9:11 PM, sathya bettadapura
<[email protected]> wrote:
I'll try to reproduce it again tomorrow. It happened somewhat accidentally when
I first noticed it. But now that I know what to look for in the logs when it
happens, hopefully, I'll have better luck reproducing it.
Thanks,
Sathya
On Thursday, October 24, 2013 8:55 PM, Steven Dake <[email protected]> wrote:
On 10/24/2013 06:57 PM, sathya bettadapura wrote:
Any state in application happens only upon receipt of a message (or a
config.change), not merely queueing it via cpg_mcast(). Upon receipt of a
config. change, internal messages are broadcast via cpg_mcast() as the last
thing they do in handling the callback.
>
>Thanks,
>
> Sathya
>
>
>
>
>
>
Well that blows my theory. Do you have a test case you can share?
Regards
-steve
On Thursday, October 24, 2013 5:51 PM, Steven Dake <[email protected]> wrote:
>
>On 10/24/2013 03:31 PM, sathya bettadapura wrote:
>
>Hi All,
>
>
>I think I am noticing what appears to be an anomaly, so just posting here for
>sanity check.
>
>
>We have a stress test that does frequent network partitioning/reunification to
>exercise code related to node fail-back. We're based on version 1.4.6. We were
>were based on 2.x.x until libqb made its way into the core of corosync. As our
>company policy precludes us from using anything but BSD style licensed third
>part source code, we had to either rewrite libqb or go back.
>
>
>Lets's say we have four nodes A, B, C and D. A and B are one side of a network
>segment and C and D on the other. The network can be partitioned by pulling a
>cable connecting the two segments.
>
>
>When there is a configuration change, we need to re-compute application state
>by sending messages to the new members. Such a message identifies the
>originating node and the size of the cluster at the time. And this message is
>logged in application log.
>
>
>
>When the cluster goes from A,B,C,D to (A,B) and (C, D), on A-B side, we see
>message from A that says "From A, cluster size is 2". Immediately thereafter
>there's another config. chage to take the cluster back to (A, B, C, D). Now
>we see messages from A, C and D that the cluster size is 4. But we see two
>messages from B, the first one says the cluster size is 2 and the second one
>says it's 4. It appears that the message from B when the cluster size was 2,
>could not be delivered as there was a config. change right on its heel, but
>it's being delivered to a configuration different from the one where it
>originated. Is this expected behaviour ?
>
>
>Messages are originated by the totem protocol and
ordered according to EVS when they are taken off the
new message queue and transmitted into the network.
This is different then queing a message (via cpg),
which is not origination. Are you sure your not
confusing origination with cpg_mcast?
>
>Generally the correct way for an application to
behave according to EVS is to originate all state
change messages via the protocol, and act on them
when received. Some devs tend to change state when
they used cpg_mcast rather then change state when a
message is delivered. This would result in your
example behavior.
>
>Just to clarify, your application only changes state
on delivery of a message to the cpg application (not
on queueing via cpg_mcast)?
>
>Regards
>-steve
>
>
>
>
> Sathya
>>
>>
>>_______________________________________________
discuss mailing list [email protected]
http://lists.corosync.org/mailman/listinfo/discuss
>
>
>_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss