Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-11-15 Thread Jan Friesse
On 13/11/17 17:06, Jan Friesse wrote: Jonathan, I've finished (I hope) proper fix for problem you've seen, so can you please try to test https://github.com/corosync/corosync/pull/280 Thanks, Honza Hi Honza, Hi Jonathan, Thanks very much for putting this fix together. I'm happy to

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-11-14 Thread Jonathan Davies
On 13/11/17 17:06, Jan Friesse wrote: Jonathan, I've finished (I hope) proper fix for problem you've seen, so can you please try to test https://github.com/corosync/corosync/pull/280 Thanks,   Honza Hi Honza, Thanks very much for putting this fix together. I'm happy to confirm that I

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-11-13 Thread Jan Friesse
Jonathan, I've finished (I hope) proper fix for problem you've seen, so can you please try to test https://github.com/corosync/corosync/pull/280 Thanks, Honza On 31/10/17 10:41, Jan Friesse wrote: Did you get a chance to confirm whether the workaround to remove the final call to

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-11-01 Thread Jonathan Davies
On 31/10/17 10:41, Jan Friesse wrote: Did you get a chance to confirm whether the workaround to remove the final call to votequorum_exec_send_nodeinfo from votequorum_exec_init_fn is safe? I didn't had time to find out what exactly is happening, but I can confirm you, that workaround is

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-31 Thread Jan Friesse
Jonathan, Hi Honza, On 19/10/17 17:05, Jonathan Davies wrote: On 19/10/17 16:56, Jan Friesse wrote: Jonathan, On 18/10/17 16:18, Jan Friesse wrote: Jonathan, On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-31 Thread Jonathan Davies
Hi Honza, On 19/10/17 17:05, Jonathan Davies wrote: On 19/10/17 16:56, Jan Friesse wrote: Jonathan, On 18/10/17 16:18, Jan Friesse wrote: Jonathan, On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-19 Thread Jonathan Davies
On 19/10/17 16:56, Jan Friesse wrote: Jonathan, On 18/10/17 16:18, Jan Friesse wrote: Jonathan, On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c in the votequorum_exec_init_fn function (around line

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-19 Thread Jan Friesse
Jonathan, On 18/10/17 16:18, Jan Friesse wrote: Jonathan, On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c in the votequorum_exec_init_fn function (around line 2306) and let me know if problem

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-19 Thread Jonathan Davies
On 18/10/17 16:18, Jan Friesse wrote: Jonathan, On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c in the votequorum_exec_init_fn function (around line 2306) and let me know if problem persists? Wow!

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-18 Thread Jan Friesse
Jonathan, On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c in the votequorum_exec_init_fn function (around line 2306) and let me know if problem persists? Wow! With that change, I'm pleased to say that

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-18 Thread Jonathan Davies
On 18/10/17 14:38, Jan Friesse wrote: Can you please try to remove "votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c in the votequorum_exec_init_fn function (around line 2306) and let me know if problem persists? Wow! With that change, I'm pleased to say that I'm not able

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-18 Thread Jan Friesse
Jonathan, On 16/10/17 15:58, Jan Friesse wrote: Jonathan, On 13/10/17 17:24, Jan Friesse wrote: I've done a bit of digging and am getting closer to the root cause of the race. We rely on having votequorum_sync_init called twice -- once when node 1 joins (with member_list_entries=2) and

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-16 Thread Jonathan Davies
On 16/10/17 15:58, Jan Friesse wrote: Jonathan, On 13/10/17 17:24, Jan Friesse wrote: I've done a bit of digging and am getting closer to the root cause of the race. We rely on having votequorum_sync_init called twice -- once when node 1 joins (with member_list_entries=2) and once when

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-16 Thread Jan Friesse
Jonathan, On 13/10/17 17:24, Jan Friesse wrote: I've done a bit of digging and am getting closer to the root cause of the race. We rely on having votequorum_sync_init called twice -- once when node 1 joins (with member_list_entries=2) and once when node 1 leaves (with

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-16 Thread Jonathan Davies
On 13/10/17 17:24, Jan Friesse wrote: I've done a bit of digging and am getting closer to the root cause of the race. We rely on having votequorum_sync_init called twice -- once when node 1 joins (with member_list_entries=2) and once when node 1 leaves (with member_list_entries=1). This is

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-13 Thread Jan Friesse
Jonathan Davies napsal(a): On 12/10/17 11:54, Jan Friesse wrote: I'm on corosync-2.3.4 plus my patch Finally noticed ^^^ 2.3.4 is really old and as long as it is not some patched version, I wouldn't recommend to use it. Can you give a try to current needle? I was mistaken to think I was

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-13 Thread Jonathan Davies
On 13/10/17 15:05, Jonathan Davies wrote: I'm on corosync-2.3.4 plus my patch Finally noticed ^^^ 2.3.4 is really old and as long as it is not some patched version, I wouldn't recommend to use it. Can you give a try to current needle? I was mistaken to think I was on 2.3.4. Actually I am

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-13 Thread Jonathan Davies
On 12/10/17 11:54, Jan Friesse wrote: I'm on corosync-2.3.4 plus my patch Finally noticed ^^^ 2.3.4 is really old and as long as it is not some patched version, I wouldn't recommend to use it. Can you give a try to current needle? I was mistaken to think I was on 2.3.4. Actually I am on

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-12 Thread Christine Caulfield
On 12/10/17 11:54, Jan Friesse wrote: > Jonathan, > >> >> >> On 12/10/17 07:48, Jan Friesse wrote: >>> Jonathan, >>> I believe main "problem" is votequorum ability to work during sync >>> phase (votequorum is only one service with this ability, see >>> votequorum_overview.8 section VIRTUAL

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-12 Thread Jonathan Davies
On 12/10/17 07:48, Jan Friesse wrote: Jonathan, I believe main "problem" is votequorum ability to work during sync phase (votequorum is only one service with this ability, see votequorum_overview.8 section VIRTUAL SYNCHRONY)... Hi ClusterLabs, I'm seeing a race condition in corosync

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-12 Thread Jan Friesse
Jonathan, I believe main "problem" is votequorum ability to work during sync phase (votequorum is only one service with this ability, see votequorum_overview.8 section VIRTUAL SYNCHRONY)... Hi ClusterLabs, I'm seeing a race condition in corosync where votequorum can have incorrect

[ClusterLabs] corosync race condition when node leaves immediately after joining

2017-10-11 Thread Jonathan Davies
Hi ClusterLabs, I'm seeing a race condition in corosync where votequorum can have incorrect membership info when a node joins the cluster then leaves very soon after. I'm on corosync-2.3.4 plus my patch https://github.com/corosync/corosync/pull/248. That patch makes the problem readily