Re: [ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Jan Friesse Mon, 26 Oct 2015 02:04:14 -0700

ping


echo reply ;)


On 10/14/2015 02:10 PM, Thomas Lamprecht wrote:

Hi,

On 10/08/2015 10:57 AM, Jan Friesse wrote:

Hi,

Thomas Lamprecht napsal(a):

[snip]

Hello,

we are using corosync version needle (2.3.5) for our cluster
filesystem
(pmxcfs).
The situation is the following. First we start up the pmxcfs,
which is
an fuse fs. And if there is an cluster configuration, we start also
corosync.
This allows the filesystem to exist on one node 'cluster's or
forcing it
in an local mode. We use CPG to send our messages to all members,
the filesystem is in the RAM and all fs operations are sent
'over the
wire'.

The problem is now the following:
When we're restarting all (in my test case 3) nodes at the same
time, I
get in 1 from 10 cases only CS_ERR_BAD_HANDLE back when calling


I'm really unsure how to understand what are you doing. You are
restarting all nodes and get CS_ERR_BAD_HANDLE? I mean, if you are
restarting all nodes, which node returns CS_ERR_BAD_HANDLE? Or
are you
restarting just pmxcfs? Or just coorsync?

Clarification, sorry was a bit unspecific. I can see the error
behaviour
in two cases:
1) I restart three physical hosts (= nodes) at the same time, one of
them - normally the last one coming up again - joins successfully the
corosync cluster the filesystem (pmxcfs) notices that, but then
cpg_mcast_joined receives only CS_ERR_BAD_HANDLE errors.


Ok, that is weird. Are you able to reproduce same behavior restarting
pmxcfs? Or really membership change (= restart of node) is needed?
Also are you sure network interface is up when corosync starts?

No, tried quite a few times to restart pmxcfs but that didn't trigger
the problem, yet. But I could trigger it once with only restarting one
node, so restarting all only makes the problem worse but isn't
needed in
the first place.


Ok. So let's expect change of membership come into play.

Do you think you can try to test (in cycle):
- start corosync
- start pmxcfs
- stop pmxcfs
- stop corosync

On one node? Because if problem appears, we will have at least
reproducer.

Hmm, yeah can do that but you have to know that we start pmxcfs
_before_ corosync, as we want to access the data when quorum is lost
or when it's only a one node cluster, and thus corosync is not running.

So the cycle to replicate this problems would be:
- start pmxcfs
- start corosync
- stop corosync
- stop pmxcfs

if I'm not mistaken.

Ok, I know nothing about pmxcfs, but what we want to hunt down is easierreproducer (easier than restarting node).


corosync.log of failing node may be interesting.

My nodes hostnames are [one, two, three], this time the came up in the
order they're named.
This time it was on two nodes the first and second node coming up
again.
corosync log seems normal, although I haven't had debug mode enabled,
don't know what difference that makes when no errors shows up in the
normal log.

Oct 07 09:06:36 [1335] two corosync notice [MAIN  ] Corosync Cluster
Engine ('2.3.5'): started and ready to provide service.
Oct 07 09:06:36 [1335] two corosync info    [MAIN  ] Corosync built-in
features: augeas systemd pie relro bindnow
Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] Initializing
transport (UDP/IP Multicast).
Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] Initializing
transmit/receive security (NSS) crypto: aes256 hash: sha1
Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] The network
interface [10.10.1.152] is now up.
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync configuration map access [0]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cmap
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync configuration service [1]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cfg
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync cluster closed process group service v1.01 [2]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cpg
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync profile loading service [4]
Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Using quorum
provider corosync_votequorum
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync vote quorum service v1.0 [5]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name:
votequorum
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync cluster quorum service v0.1 [3]
Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name:
quorum
Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] A new membership
(10.10.1.152:92) was formed. Members joined: 2
Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Members[1]: 2
Oct 07 09:06:36 [1335] two corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.


Looks good

Then pmxcfs results in:

Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
failed: 9
Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
failed: 9
Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
failed: 9
Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
failed: 9


Just for sure, bad handle 0 is logged by pmxcfs?


Yes exactly, this is from additional debug code I inserted and only
outputs the value of the handle.


After that the third node joins, but the CS_ERR_BAD_HANDLE stays.

Oct 07 09:06:41 [1335] two corosync notice [TOTEM ] A new membership
(10.10.1.151:100) was formed. Members joined: 1
Oct 07 09:06:41 [1335] two corosync notice  [QUORUM] This node is
within the primary component and will provide service.
Oct 07 09:06:41 [1335] two corosync notice  [QUORUM] Members[2]: 1 2
Oct 07 09:06:41 [1335] two corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.
Oct 07 09:06:52 [1335] two corosync notice  [TOTEM ] A new membership
(10.10.1.151:104) was formed. Members joined: 3
Oct 07 09:06:52 [1335] two corosync notice  [QUORUM] Members[3]: 1 2 3
Oct 07 09:06:52 [1335] two corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.


Again, looks good.

Funnily the third node also has some initial problems, but is able to
correct them:

Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
failed: 9
Oct 07 09:06:53 three pmxcfs[943]: [status] notice: Bad handle 0
Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
failed: 9
Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
failed: 9
Oct 07 09:06:53 three pmxcfs[943]: [status] notice: Bad handle 0
Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
failed: 9
[...]
Oct 07 09:06:56 three pmxcfs[943]: [status] notice: update cluster
info (cluster name  teachings, version = 15)
Oct 07 09:06:56 three pmxcfs[943]: [status] notice: node has quorum
Oct 07 09:06:56 three pmxcfs[943]: [dcdb] notice: members: 3/943
Oct 07 09:06:56 three pmxcfs[943]: [dcdb] notice: all data is up to
date
Oct 07 09:06:56 three pmxcfs[943]: [status] notice: members: 3/943
Oct 07 09:06:56 three pmxcfs[943]: [status] notice: all data is up to
date


Even weirder. I mean ... CS_ERR_BAD_HANDLE is not kind of error which
fixes automatically.


Yes, weird is a good word for that.



Thats


2) I disconnect the network interface on which corosync runs, and
reconnect it a bit later. This triggers the same as above, but
also not
every time.


Just to make sure. Don't do ifdown. Corosync reacts to ifdown pretty
badly. Also NetworkManager does ifdown on cable unplug if not
configured in server mode. If you want to test network split, ether
use iptables (make sure to block all traffic needed by corosync, so if
you are using multicast make sure to block both unicast and multicast
packets on input and output -
https://github.com/jfriesse/csts/blob/master/tests/inc/fw.sh), or use
blocking on switch.

No, I'm aware of the ifdown problems and we do not use
NetworkManager in
Proxmox VE. I often used an 'virtual' cluster, i.e. three virtual
machines (nested virtualization) in KVM/qemu, and then 'disconnected'
the network interface through qemu, which should equal to an physical
unplug.


Sounds good.

Corosync also recovers when 'connecting' the interface again through
qemu.
Thanks for the script hint, will do some more physical tests involving
the network.


Currently I'm trying to get an somewhat reproduce able test and
try it
also on bigger setups and other possible causes, need to do a bit
more
home work here and report back later.


Actually, smaller clusters are better for debugging, but yes, larger
setup may show problem faster.

Ok, good to know.

cpg_mcast_joined to send out the data, but only one node.
corosyn-quorumtool shows that we have quorum, and the logs are also
showing a healthy connect to the corosync cluster. The failing
handle is
initialized once at the initialization of our filesystem. Should
it be
reinitialized on every reconnect?


Again, I'm unsure what you mean by reconnect. On Corosync shudown
you
have to reconnect (I believe this is not the case because you are
getting error only with 10% probability).

Yes, we reconnect to Corosync, and it's not a corosync shutdown, the
whole host reboots or the network interfaces goes down and then a bit
later up again. The probability is just an estimation but the main
problem is that I can not reproduce it all the time.

Restarting the filesystem solves this problem, the strange thing is
that
isn't clearly reproduce-able and often works just fine.

Are there some known problems or steps we should look for?


Hard to tell but generally:
- Make sure cpg_init really returns CS_OK. If not, returned
handle is
invalid
- Make sure there is no memory corruption and handle is really valid
(valgrind may be helpful).

cpg_init checks are in place and should be OK.
Yes, will use Valgrind, but one questions ahead:

Can the handle get lost somehow? Is there a need to reinitialize
the cpg
with cpg_initialize/cpg_model_initialize after we left and later
rejoined the cluster?


I'm still unsure what you mean after we left and later rejoined. As
long as corosync is running client application "don't need to care
about" membership changes. It's corosync problem. So if network split
happens, you don't have to call cpg_initialize. Only place where
cpg_initalize is needed is initial connection and reconnection after
corosync main process exit.

What I mean is leaving the cpg group, after a lost and regained quorum
(unplug and plug NIC).

I'm a bit confused by two things:

-) Restarting pmxcfs, thus making a clean new initialization, solves
the
problem but reinitializing isn't intended, as you also stated. If I'm
not mistaken corosync has even problems when calling cpg_initialize
twice from the same PID, with an finalize in between.


It shouldn't. cpg_join may be problem, because of following scenario:
- client calls cpg_finalize
- delivered into corosync main process, corosync needs to inform
other clients on both local node and other cluster nodes.
- cpg_finalize in client API returned back without error
- client calls new cpg_init + cpg_join
- server doesn't (yet) get time to process leave message

-> error is returned because there are two process in same group with
same pid.


Ok, is then there a way to ensure that the leave message was processed?

Yes. Take a look to corosync git tree and there is test/testcpg.c. Ifyou comment line 409 and 413 (do_exit = 1), you have quite a niceexample how to do it.

I'll look into that, thanks for the notice, currently checking some
things with gdb.


-) the handle only can be changed by cpg_initialize, as it's the only
place where it is a reference, so where does it get to zero? I
assume it


This is question I'm asking myself too. I have 3 explanations:
- bug in pmxcfs


That's what I suspect but, naturally, not hope :)

- bug in corosync libraries
- bug in libqb

Just for sure. Are you using newest corosync 2.3.5 and newest libqb
0.17.2?


Yes.

wasn't zero from the beginning as then cpg_initialize should fail?


I'm not exactly sure what you've meant. cpg_initialize doesn't really
care what value was in the handle before it was called. It just
overwrites it.


Is there a intended way how to handle an CS_ERR_BAD_HANDLE?


No. CS_ERR_BAD_HANDLE is kind of error which simply shouldn't happen
as long as handle is not changed in caller app.

Ok, another question regarding cpg_initialize. The situations is,
pmxcfs and corosync are initialized, then i kill corosync, now pmxcfs
calls cpg_finalize and then tries to call cpg_initialize again, I
naturally receive an error  (CS_ERR_LIBRARY) but get a valid looking
handle (e.g. 4578767460630528000), should be acted upon this handle
(finalize or there like) or is it mere a random value?

It is invalid and already destroyed handle (so more like random value).Do not use it.


Further, I can rise the chance of this bug when I'm plug the corosync
interface, then after 60 second without quorum  the watchdog triggers,
when the node comes up again the " cpg_send_message failed: 9 " error
message is constantly thrown and also when reconnecting the interface
the same behaviour gets shown (corosync "recovers" and becomes
successfully quorate while pmxcfs stays in this error until
restarted). It's not always reproducable but definitely more often.

Hard to say. Important is not to ifdown interface (so if interface iseth0 it must still exists and still have same IP). You can also look tocorosync.log if there is not something interesting.

Btw. do you have any quick tutorial how to compile pcmxfs work so I cantry it by myself?


Honza


Regards,
  Honza




_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Reply via email to