Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

Jan Friesse Fri, 13 Oct 2017 09:28:02 -0700

Jonathan Davies napsal(a):



On 12/10/17 11:54, Jan Friesse wrote:

I'm on corosync-2.3.4 plus my patch


Finally noticed ^^^ 2.3.4 is really old and as long as it is not some
patched version, I wouldn't recommend to use it. Can you give a try to
current needle?


I was mistaken to think I was on 2.3.4. Actually I am on the version
from CentOS 7.4 which is 2.4.0+patches.


Ok, so this is basically like needle.


I will try to reproduce it with needle.

But often at this point, cluster1's disappearance is not reflected in
the votequorum info on cluster2:


... Is this permanent (= until new node join/leave it , or it will fix
itself over (short) time? If this is permanent, it's a bug. If it
fixes itself it's result of votequorum not being virtual synchronous.


Yes, it's permanent. After several minutes of waiting, votequorum still
reports "total votes: 2" even though there's only one member.



That's bad. I've tried following setup:

- Both nodes with current needle
- Your config
- Second node is just running corosync
- First node is running following command:
   while true;do corosync -f; ssh node2 'corosync-quorumtool | grep
Total | grep 1' || exit 1;done

Running it for quite a while and I'm unable to reproduce the bug.
Sadly I'm unable to reproduce the bug even with 2.3.4. Do you think
that reproducer is correct?


Yes, that's similar enough to what I'm doing. The bug happens about 50%
of the time for me, so I do it manually rather than needing a loop. So
I'm not sure why you can't reproduce it.

That's really strange. Chrissie is also unable to reproduce it so we maymiss some part of puzzle.


I've done a bit of digging and am getting closer to the root cause of
the race.

We rely on having votequorum_sync_init called twice -- once when node 1
joins (with member_list_entries=2) and once when node 1 leaves (with
member_list_entries=1). This is important because votequorum_sync_init
marks nodes as NODESTATE_DEAD if they are not in quorum_members[] -- so
it needs to have seen the node appear then disappear. This is important
because get_total_votes only counts votes from nodes in state
NODESTATE_MEMBER.


So there are basically two problems.

Actually first (main) problem is that votequorum_sync_init is evercalled when that node joins. It really shouldn't. And problem is simplybecause calling api->shutdown_request() is not enough. Can you tryreplace it with exit(1) (for testing) and reproduce the problem? I'mpretty sure problem disappears.


When it goes wrong, I see that votequorum_sync_init is only called
*once* (with member_list_entries=1) -- after node 1 has joined and left.
So it never sees node 1 in member_list, hence never marks it as
NODESTATE_DEAD. But message_handler_req_exec_votequorum_nodeinfo has
indepedently marked the node as NODESTATE_MEMBER, hence get_total_votes
counts it and quorate is set to 1.

This is second problem and it has to be also fixed. You have to bepretty lucky to reproduce it so often.


Anyway, here is theory:

- node 1 calls api->shutdown_request() and continue processing (this isproblem 1)- both nodes gets to a state where they should callvotequorum_sync_init, but node 2 is now not scheduled (as I said, itmust be pretty luck)- node 1 calls votequorum_sync_init and votequorum_sync_process (sendingnodeinfo) and it's shutdown

- node 2 gets nodeinfo
- node 2 sees node 1 shutdown
- node 2 calls votequorum_sync_init and votequorum_sync_process

If this theory is true, we must probably fix the sync.c to have 2 orideally 3-4 barriers instead of 1.


So why is votequorum_sync_init sometimes only called once? It looks like
it's all down to whether we manage to iterate through all the calls to
schedwrk_processor before entering the OPERATIONAL state. I haven't yet
looked into exactly what controls the timing of these two things.

Adding the following patch helps me to demonstrate the problem more
clearly:

diff --git a/exec/sync.c b/exec/sync.c
index e7b71bd..a2fb06d 100644
--- a/exec/sync.c
+++ b/exec/sync.c
@@ -544,6 +545,7 @@ static int schedwrk_processor (const void *context)
                 }

                 if
(my_sync_callbacks_retrieve(my_service_list[my_processing_idx].service_id,
NULL) != -1) {
+                       log_printf(LOGSYS_LEVEL_NOTICE, "calling
sync_init on service '%s' (%d) with my_member_list_entries = %d",
my_service_list[my_processing_idx].name, my_processing_idx,
my_member_list_entries);
                         my_service_list[my_processing_idx].sync_init
(my_trans_list,
                                 my_trans_list_entries, my_member_list,
                                 my_member_list_entries,
diff --git a/exec/votequorum.c b/exec/votequorum.c
index d5f06c1..aab6c15 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -2336,6 +2353,8 @@ static void votequorum_sync_init (
         int left_nodes;
         struct cluster_node *node;

+       log_printf(LOGSYS_LEVEL_NOTICE, "votequorum_sync_init has %d
member_list_entries", member_list_entries);
+
         ENTER();

         sync_in_progress = 1;

When it works correctly I see the following (selected log lines):

notice  [TOTEM ] A new membership (10.71.218.17:2016) was formed.
Members joined: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 2
notice  [VOTEQ ] votequorum_sync_init has 2 member_list_entries
notice  [TOTEM ] A new membership (10.71.218.18:2020) was formed.
Members left: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 1
notice  [VOTEQ ] votequorum_sync_init has 1 member_list_entries

  -- Notice that votequorum_sync_init is called once with 2 members and
once with 1 member.

When it goes wrong I see the following (selected log lines):

notice  [TOTEM ] A new membership (10.71.218.17:2004) was formed.
Members joined: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 2
notice  [TOTEM ] A new membership (10.71.218.18:2008) was formed.
Members left: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 1
notice  [VOTEQ ] votequorum_sync_init has 1 member_list_entries

  -- Notice the value of my_member_list_entries in the different
sync_init calls, and that votequorum_sync_init is only called once.

Does this help explain the issue?

It's definitively helpful. I will try to thing about what may behappening a little deeper. Can you please add similar debug info as youhave for sync_init also to other sync_ and send me the log of both nodeswhen everything works and when do not?


Honza


Thanks,
Jonathan



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

Reply via email to