On 12/10/17 11:54, Jan Friesse wrote: > Jonathan, > >> >> >> On 12/10/17 07:48, Jan Friesse wrote: >>> Jonathan, >>> I believe main "problem" is votequorum ability to work during sync >>> phase (votequorum is only one service with this ability, see >>> votequorum_overview.8 section VIRTUAL SYNCHRONY)... >>> >>>> Hi ClusterLabs, >>>> >>>> I'm seeing a race condition in corosync where votequorum can have >>>> incorrect membership info when a node joins the cluster then leaves >>>> very >>>> soon after. >>>> >>>> I'm on corosync-2.3.4 plus my patch > > Finally noticed ^^^ 2.3.4 is really old and as long as it is not some > patched version, I wouldn't recommend to use it. Can you give a try to > current needle? > >>>> https://github.com/corosync/corosync/pull/248. That patch makes the >>>> problem readily reproducible but the bug was already present. >>>> >>>> Here's the scenario. I have two hosts, cluster1 and cluster2. The >>>> corosync.conf on cluster2 is: >>>> >>>> totem { >>>> version: 2 >>>> cluster_name: test >>>> config_version: 2 >>>> transport: udpu >>>> } >>>> nodelist { >>>> node { >>>> nodeid: 1 >>>> ring0_addr: cluster1 >>>> } >>>> node { >>>> nodeid: 2 >>>> ring0_addr: cluster2 >>>> } >>>> } >>>> quorum { >>>> provider: corosync_votequorum >>>> auto_tie_breaker: 1 >>>> } >>>> logging { >>>> to_syslog: yes >>>> } >>>> >>>> The corosync.conf on cluster1 is the same except with >>>> "config_version: 1". >>>> >>>> I start corosync on cluster2. When I start corosync on cluster1, it >>>> joins and then immediately leaves due to the lower config_version. >>>> (Previously corosync on cluster2 would also exit but with >>>> https://github.com/corosync/corosync/pull/248 it remains alive.) >>>> >>>> But often at this point, cluster1's disappearance is not reflected in >>>> the votequorum info on cluster2: >>> >>> ... Is this permanent (= until new node join/leave it , or it will fix >>> itself over (short) time? If this is permanent, it's a bug. If it >>> fixes itself it's result of votequorum not being virtual synchronous. >> >> Yes, it's permanent. After several minutes of waiting, votequorum still >> reports "total votes: 2" even though there's only one member. > > > That's bad. I've tried following setup: > > - Both nodes with current needle > - Your config > - Second node is just running corosync > - First node is running following command: > while true;do corosync -f; ssh node2 'corosync-quorumtool | grep Total > | grep 1' || exit 1;done > > Running it for quite a while and I'm unable to reproduce the bug. Sadly > I'm unable to reproduce the bug even with 2.3.4. Do you think that > reproducer is correct? >
I can't reproduce it either. Chrissie _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org