Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

Jan Friesse Mon, 16 Oct 2017 08:00:59 -0700

Jonathan,



On 13/10/17 17:24, Jan Friesse wrote:

I've done a bit of digging and am getting closer to the root cause of
the race.

We rely on having votequorum_sync_init called twice -- once when node 1
joins (with member_list_entries=2) and once when node 1 leaves (with
member_list_entries=1). This is important because votequorum_sync_init
marks nodes as NODESTATE_DEAD if they are not in quorum_members[] -- so
it needs to have seen the node appear then disappear. This is important
because get_total_votes only counts votes from nodes in state
NODESTATE_MEMBER.


So there are basically two problems.

Actually first (main) problem is that votequorum_sync_init is ever
called when that node joins. It really shouldn't. And problem is
simply because calling api->shutdown_request() is not enough. Can you
try replace it with exit(1) (for testing) and reproduce the problem?
I'm pretty sure problem disappears.


No, the problem still happens :-(


Not good.


I am using the following patch:

diff --git a/exec/cmap.c b/exec/cmap.c
index de730d2..1125cef 100644
--- a/exec/cmap.c
+++ b/exec/cmap.c
@@ -406,7 +406,7 @@ static void cmap_sync_activate (void)
                 log_printf(LOGSYS_LEVEL_ERROR,
                     "Received config version (%"PRIu64") is different
than my config version (%"PRIu64")! Exiting",
                     cmap_highest_config_version_received,
cmap_my_config_version);
-               api->shutdown_request();
+               exit(1);
                 return ;
         }
  }
diff --git a/exec/main.c b/exec/main.c
index b0d5639..4fd3e68 100644
--- a/exec/main.c
+++ b/exec/main.c
@@ -627,6 +627,7 @@ static void deliver_fn (
                         ((void *)msg);
         }

+       log_printf(LOGSYS_LEVEL_NOTICE, "executing '%s' exec_handler_fn
%p for node %d (fn %d)", corosync_service[service]->name,
corosync_service[service]->exec_engine[fn_id].exec_handler_fn, nodeid,
fn_id);
         corosync_service[service]->exec_engine[fn_id].exec_handler_fn
                 (msg, nodeid);
  }
diff --git a/exec/votequorum.c b/exec/votequorum.c
index 1a97c6d..7c0f34f 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -2099,6 +2100,7 @@ static void
message_handler_req_exec_votequorum_nodeinfo (
         node->flags = req_exec_quorum_nodeinfo->flags;
         node->votes = req_exec_quorum_nodeinfo->votes;
         node->state = NODESTATE_MEMBER;
+       log_printf(LOGSYS_LEVEL_NOTICE,
"message_handler_req_exec_votequorum_nodeinfo (%p) marking node %d as
MEMBER", message_handler_req_exec_votequorum_nodeinfo, nodeid);

         if (node->flags & NODE_FLAGS_LEAVING) {
                 node->state = NODESTATE_LEAVING;

When it's working correctly I see this:

1508151960.072927 notice  [TOTEM ] A new membership (10.71.218.17:2304)
was formed. Members joined: 1
1508151960.073082 notice  [SYNC  ] calling sync_init on service
'corosync configuration map access' (0) with my_member_list_entries = 2
1508151960.073150 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x55b5eb504ca0 for node 1 (fn 0)
1508151960.073197 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x55b5eb504ca0 for node 2 (fn 0)
1508151960.073238 notice  [SYNC  ] calling sync_init on service
'corosync cluster closed process group service v1.01' (1) with
my_member_list_entries = 2
1508151961.073033 notice  [TOTEM ] A processor failed, forming new
configuration.

When it's not working correctly I see this:

1508151908.447584 notice  [TOTEM ] A new membership (10.71.218.17:2292)
was formed. Members joined: 1
1508151908.447757 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x558b39fbbaa0 for node 1 (fn 0)
1508151908.447866 notice  [VOTEQ ]
message_handler_req_exec_votequorum_nodeinfo (0x558b39fbbaa0) marking
node 1 as MEMBER
1508151908.447972 notice  [VOTEQ ] get_total_votes: node 1 is a MEMBER
so counting vote
1508151908.448045 notice  [VOTEQ ] get_total_votes: node 2 is a MEMBER
so counting vote
1508151908.448091 notice  [QUORUM] This node is within the primary
component and will provide service.
1508151908.448134 notice  [QUORUM] Members[1]: 2
1508151908.448175 notice  [SYNC  ] calling sync_init on service
'corosync configuration map access' (0) with my_member_list_entries = 2
1508151908.448205 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x558b39fb3ca0 for node 1 (fn 0)
1508151908.448247 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x558b39fb3ca0 for node 2 (fn 0)
1508151908.448307 notice  [SYNC  ] calling sync_init on service
'corosync cluster closed process group service v1.01' (1) with
my_member_list_entries = 2
1508151909.447182 notice  [TOTEM ] A processor failed, forming new
configuration.

... and at that point I already see "Total votes: 2" in the
corosync-quorumtool output.

The key difference seems to be whether votequorum's nodeinfo
exec_handler_fn is called. If it is called, it marks the node as
NODESTATE_MEMBER, which is sufficient to trigger the bug because it is
never subsequently marked as NODESTATE_DEAD.

(Note that in both cases there is now a one-second delay before the "A
processor failed" message. This is due to calling exit(1) rather than
api->shutdown_request().)

When it goes wrong, I see that votequorum_sync_init is only called
*once* (with member_list_entries=1) -- after node 1 has joined and left.
So it never sees node 1 in member_list, hence never marks it as
NODESTATE_DEAD. But message_handler_req_exec_votequorum_nodeinfo has
indepedently marked the node as NODESTATE_MEMBER, hence get_total_votes
counts it and quorate is set to 1.


This is second problem and it has to be also fixed. You have to be
pretty lucky to reproduce it so often.

Anyway, here is theory:
- node 1 calls api->shutdown_request() and continue processing (this
is problem 1)
- both nodes gets to a state where they should call
votequorum_sync_init, but node 2 is now not scheduled (as I said, it
must be pretty luck)
- node 1 calls votequorum_sync_init and votequorum_sync_process
(sending nodeinfo) and it's shutdown
- node 2 gets nodeinfo
- node 2 sees node 1 shutdown
- node 2 calls votequorum_sync_init and votequorum_sync_process

If this theory is true, we must probably fix the sync.c to have 2 or
ideally 3-4 barriers instead of 1.


So why is votequorum_sync_init sometimes only called once? It looks like
it's all down to whether we manage to iterate through all the calls to
schedwrk_processor before entering the OPERATIONAL state. I haven't yet
looked into exactly what controls the timing of these two things.

Adding the following patch helps me to demonstrate the problem more
clearly:

diff --git a/exec/sync.c b/exec/sync.c
index e7b71bd..a2fb06d 100644
--- a/exec/sync.c
+++ b/exec/sync.c
@@ -544,6 +545,7 @@ static int schedwrk_processor (const void *context)
                 }

                 if
(my_sync_callbacks_retrieve(my_service_list[my_processing_idx].service_id,

NULL) != -1) {
+                       log_printf(LOGSYS_LEVEL_NOTICE, "calling
sync_init on service '%s' (%d) with my_member_list_entries = %d",
my_service_list[my_processing_idx].name, my_processing_idx,
my_member_list_entries);
                         my_service_list[my_processing_idx].sync_init
(my_trans_list,
                                 my_trans_list_entries, my_member_list,
                                 my_member_list_entries,
diff --git a/exec/votequorum.c b/exec/votequorum.c
index d5f06c1..aab6c15 100644
--- a/exec/votequorum.c
+++ b/exec/votequorum.c
@@ -2336,6 +2353,8 @@ static void votequorum_sync_init (
         int left_nodes;
         struct cluster_node *node;

+       log_printf(LOGSYS_LEVEL_NOTICE, "votequorum_sync_init has %d
member_list_entries", member_list_entries);
+
         ENTER();

         sync_in_progress = 1;

When it works correctly I see the following (selected log lines):

notice  [TOTEM ] A new membership (10.71.218.17:2016) was formed.
Members joined: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 2
notice  [VOTEQ ] votequorum_sync_init has 2 member_list_entries
notice  [TOTEM ] A new membership (10.71.218.18:2020) was formed.
Members left: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 1
notice  [VOTEQ ] votequorum_sync_init has 1 member_list_entries

  -- Notice that votequorum_sync_init is called once with 2 members and
once with 1 member.

When it goes wrong I see the following (selected log lines):

notice  [TOTEM ] A new membership (10.71.218.17:2004) was formed.
Members joined: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 2
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 2
notice  [TOTEM ] A new membership (10.71.218.18:2008) was formed.
Members left: 1
notice  [SYNC  ] calling sync_init on service 'corosync configuration
map access' (0) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync cluster closed
process group service v1.01' (1) with my_member_list_entries = 1
notice  [SYNC  ] calling sync_init on service 'corosync vote quorum
service v1.0' (2) with my_member_list_entries = 1
notice  [VOTEQ ] votequorum_sync_init has 1 member_list_entries

  -- Notice the value of my_member_list_entries in the different
sync_init calls, and that votequorum_sync_init is only called once.

Does this help explain the issue?


It's definitively helpful. I will try to thing about what may be
happening a little deeper. Can you please add similar debug info as
you have for sync_init also to other sync_ and send me the log of both
nodes when everything works and when do not?


Okay. I've rolled back to make the call to api->shutdown_request() on
config_version mismatch, and have added debug lines in sync.c just
before calls to sync_activate, sync_abort and sync_process.

When it's working correctly:

1508158482.991782 notice  [TOTEM ] A new membership (10.71.218.17:2384)
was formed. Members joined: 1
1508158482.991930 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 1 (fn 0)
1508158482.992029 notice  [VOTEQ ]
message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
node 1 as MEMBER
1508158482.992122 notice  [VOTEQ ] get_total_votes: node 1 is a MEMBER
so counting vote
1508158482.992206 notice  [VOTEQ ] get_total_votes: node 2 is a MEMBER
so counting vote
1508158482.992263 notice  [QUORUM] This node is within the primary
component and will provide service.
1508158482.992307 notice  [QUORUM] Members[1]: 2
1508158482.992351 notice  [SYNC  ] calling sync_init on service
'corosync configuration map access' (0) with my_member_list_entries = 2
1508158482.992382 notice  [SYNC  ] calling sync_process on service
'corosync configuration map access' (0)
1508158482.992425 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x56492b0cedf0 for node 1 (fn 0)
1508158482.992456 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x56492b0cedf0 for node 2 (fn 0)
1508158482.992498 notice  [SYNC  ] calling sync_activate on service
'corosync configuration map access' (0)
1508158482.992529 notice  [SYNC  ] calling sync_init on service
'corosync cluster closed process group service v1.01' (1) with
my_member_list_entries = 2
1508158482.992572 notice  [SYNC  ] calling sync_process on service
'corosync cluster closed process group service v1.01' (1)
1508158482.992603 notice  [MAIN  ] executing 'corosync cluster closed
process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 2
(fn 5)
1508158482.992646 notice  [MAIN  ] executing 'corosync cluster closed
process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 1
(fn 5)
1508158482.992677 notice  [SYNC  ] calling sync_activate on service
'corosync cluster closed process group service v1.01' (1)
1508158482.992720 notice  [SYNC  ] calling sync_init on service
'corosync vote quorum service v1.0' (2) with my_member_list_entries = 2
1508158482.992750 notice  [VOTEQ ] votequorum_sync_init has 2
member_list_entries
1508158482.992792 notice  [SYNC  ] calling sync_process on service
'corosync vote quorum service v1.0' (2)
1508158482.992850 notice  [TOTEM ] A new membership (10.71.218.18:2388)
was formed. Members left: 1
1508158482.992895 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
1508158482.992938 notice  [VOTEQ ]
message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
node 2 as MEMBER
1508158482.992968 notice  [VOTEQ ] get_total_votes: node 1 is a MEMBER
so counting vote
1508158482.993011 notice  [VOTEQ ] get_total_votes: node 2 is a MEMBER
so counting vote
1508158482.993042 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
1508158482.993084 notice  [SYNC  ] calling sync_init on service
'corosync configuration map access' (0) with my_member_list_entries = 1
1508158482.993115 notice  [SYNC  ] calling sync_process on service
'corosync configuration map access' (0)
1508158482.993157 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x56492b0cedf0 for node 2 (fn 0)
1508158482.993188 notice  [SYNC  ] calling sync_activate on service
'corosync configuration map access' (0)
1508158482.993231 notice  [SYNC  ] calling sync_init on service
'corosync cluster closed process group service v1.01' (1) with
my_member_list_entries = 1
1508158482.993261 notice  [SYNC  ] calling sync_process on service
'corosync cluster closed process group service v1.01' (1)
1508158482.993304 notice  [MAIN  ] executing 'corosync cluster closed
process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 2
(fn 5)
1508158482.993335 notice  [SYNC  ] calling sync_activate on service
'corosync cluster closed process group service v1.01' (1)
1508158482.993378 notice  [SYNC  ] calling sync_init on service
'corosync vote quorum service v1.0' (2) with my_member_list_entries = 1
1508158482.993409 notice  [VOTEQ ] votequorum_sync_init has 1
member_list_entries
1508158482.993452 notice  [VOTEQ ] votequorum_sync_init marking node 1
as DEAD
1508158482.993482 notice  [SYNC  ] calling sync_process on service
'corosync vote quorum service v1.0' (2)
1508158482.993525 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
1508158482.993555 notice  [VOTEQ ]
message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
node 2 as MEMBER
1508158482.993598 notice  [VOTEQ ] get_total_votes: node 2 is a MEMBER
so counting vote
1508158482.993628 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
1508158482.993673 notice  [SYNC  ] calling sync_activate on service
'corosync vote quorum service v1.0' (2)
1508158482.993703 notice  [VOTEQ ] get_total_votes: node 2 is a MEMBER
so counting vote
1508158482.993746 notice  [QUORUM] This node is within the non-primary
component and will NOT provide any services.
1508158482.993776 notice  [QUORUM] Members[1]: 2
1508158482.993818 notice  [MAIN  ] Completed service synchronization,
ready to provide service.

When it's working incorrectly:

1508158493.905246 notice  [TOTEM ] A new membership (10.71.218.17:2392)
was formed. Members joined: 1
1508158493.905392 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 1 (fn 0)
1508158493.905522 notice  [VOTEQ ]
message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
node 1 as MEMBER
1508158493.905623 notice  [VOTEQ ] get_total_votes: node 1 is a MEMBER
so counting vote
1508158493.905697 notice  [VOTEQ ] get_total_votes: node 2 is a MEMBER
so counting vote
1508158493.905742 notice  [QUORUM] This node is within the primary
component and will provide service.
1508158493.905773 notice  [QUORUM] Members[1]: 2
1508158493.905815 notice  [SYNC  ] calling sync_init on service
'corosync configuration map access' (0) with my_member_list_entries = 2
1508158493.905889 notice  [SYNC  ] calling sync_process on service
'corosync configuration map access' (0)
1508158493.905921 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x56492b0cedf0 for node 1 (fn 0)
1508158493.905951 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x56492b0cedf0 for node 2 (fn 0)
1508158493.905993 notice  [SYNC  ] calling sync_activate on service
'corosync configuration map access' (0)
1508158493.906023 notice  [SYNC  ] calling sync_init on service
'corosync cluster closed process group service v1.01' (1) with
my_member_list_entries = 2
1508158493.906066 notice  [SYNC  ] calling sync_process on service
'corosync cluster closed process group service v1.01' (1)
1508158493.906096 notice  [MAIN  ] executing 'corosync cluster closed
process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 2
(fn 5)
1508158493.906138 notice  [TOTEM ] A new membership (10.71.218.18:2396)
was formed. Members left: 1
1508158493.906168 notice  [SYNC  ] calling sync_init on service
'corosync configuration map access' (0) with my_member_list_entries = 1
1508158493.906210 notice  [SYNC  ] calling sync_process on service
'corosync configuration map access' (0)
1508158493.906240 notice  [MAIN  ] executing 'corosync configuration map
access' exec_handler_fn 0x56492b0cedf0 for node 2 (fn 0)
1508158493.906282 notice  [SYNC  ] calling sync_activate on service
'corosync configuration map access' (0)
1508158493.906312 notice  [SYNC  ] calling sync_init on service
'corosync cluster closed process group service v1.01' (1) with
my_member_list_entries = 1
1508158493.906354 notice  [SYNC  ] calling sync_process on service
'corosync cluster closed process group service v1.01' (1)
1508158493.906384 notice  [MAIN  ] executing 'corosync cluster closed
process group service v1.01' exec_handler_fn 0x56492b0d2940 for node 2
(fn 5)
1508158493.906427 notice  [SYNC  ] calling sync_activate on service
'corosync cluster closed process group service v1.01' (1)
1508158493.906457 notice  [SYNC  ] calling sync_init on service
'corosync vote quorum service v1.0' (2) with my_member_list_entries = 1
1508158493.906499 notice  [VOTEQ ] votequorum_sync_init has 1
member_list_entries
1508158493.906528 notice  [SYNC  ] calling sync_process on service
'corosync vote quorum service v1.0' (2)
1508158493.906570 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
1508158493.906600 notice  [VOTEQ ]
message_handler_req_exec_votequorum_nodeinfo (0x56492b0d6bf0) marking
node 2 as MEMBER
1508158493.906642 notice  [VOTEQ ] get_total_votes: node 1 is a MEMBER
so counting vote
1508158493.906672 notice  [VOTEQ ] get_total_votes: node 2 is a MEMBER
so counting vote
1508158493.906714 notice  [MAIN  ] executing 'corosync vote quorum
service v1.0' exec_handler_fn 0x56492b0d6bf0 for node 2 (fn 0)
1508158493.906743 notice  [SYNC  ] calling sync_activate on service
'corosync vote quorum service v1.0' (2)
1508158493.906785 notice  [VOTEQ ] get_total_votes: node 1 is a MEMBER
so counting vote
1508158493.906815 notice  [VOTEQ ] get_total_votes: node 2 is a MEMBER
so counting vote
1508158493.906856 notice  [QUORUM] Members[1]: 2
1508158493.906886 notice  [MAIN  ] Completed service synchronization,
ready to provide service.

Now it seems that sync calls votequorum's exec_handler_fn in both
working and non-working cases. So now it relies on votequorum_sync_init
being called once with both members and once without node 1 in order to
set the node as NODESTATE_DEAD.


votequorum's exec_handler_fn is probably executed because of icmap events.

Can you please try to set debug to trace (so config file will look like

logging {
  to_syslog: yes
  to_stderr: yes
  debug: trace
}

and send full log (so including "notice [MAIN ] Corosync ClusterEngine ('2.4.2.45-2fbb'): started and ready to provide service.") fromboth nodes with description which one is which (ip + nodeid) for failingcase?


Thanks,
  Honza


Thanks,
Jonathan



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

Reply via email to