Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

Jan Friesse Tue, 31 Oct 2017 03:44:54 -0700

Jonathan,

Hi Honza,


On 19/10/17 17:05, Jonathan Davies wrote:

On 19/10/17 16:56, Jan Friesse wrote:

Jonathan,



On 18/10/17 16:18, Jan Friesse wrote:

Jonathan,


On 18/10/17 14:38, Jan Friesse wrote:

Can you please try to remove
"votequorum_exec_send_nodeinfo(us->node_id);" line from votequorum.c
in the votequorum_exec_init_fn function (around line 2306) and
let me
know if problem persists?


Wow! With that change, I'm pleased to say that I'm not able to
reproduce
the problem at all!


Sounds good.


Is this a legitimate fix, or do we still need the call to
votequorum_exec_send_nodeinfo for other reasons?


That is good question. Calling of votequorum_exec_send_nodeinfo should
not be needed because it's called by sync_process only slightly later.

But to mark this as a legitimate fix, I would like to find out why is
this happening and if it is legal or not. Basically because I'm not
able to reproduce the bug at all (and I was really trying also with
various usleeps/packet loss/...) I would like to have more information
about notworking_cluster1.log. Because tracing doesn't work, we need
to try blackbox. Could you please add

icmap_set_string("runtime.blackbox.dump_flight_data", "yes");

line before api->shutdown_request(); in cmap.c ?

It should trigger dumping blackbox in /var/lib/corosync. When you
reproduce the nonworking_cluster1, could you please ether:
- compress the file pointed by /var/lib/corosync/fdata symlink
- or execute corosync-blackbox
- or execute qb-blackbox "/var/lib/corosync/fdata"

and send it?


Attached, along with the "debug: trace" log from cluster2.


Thanks a lot for the logs. I'm - finally!!!! - able to reproduce bug
(with the 2 artificial pauses - included at the end of the mail).
I'll try to fix the main bug (what may take some time, eventho I have
kind of idea what is happening) and let you know.


Glad to hear that the logs are useful and you're able to reproduce the
problem! I look forward to hearing what you come up with, and am happy
to test out patches if that would help.


Did you get a chance to confirm whether the workaround to remove the
final call to votequorum_exec_send_nodeinfo from votequorum_exec_init_fn
is safe?

I didn't had time to find out what exactly is happening, but I canconfirm you, that workaround is safe. It's just not a full fix and therecan still be situations when the bug appears.


The patch works well in our testing, but I'm keen to hear whether you
think this is likely to be safe for use in production.


It's safe but it's just a workaround.


Regards,
  Honza


Thanks,
Jonathan



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

Reply via email to