ping
On 10/14/2015 02:10 PM, Thomas Lamprecht wrote:
Hi,
On 10/08/2015 10:57 AM, Jan Friesse wrote:
Hi,
Thomas Lamprecht napsal(a):
[snip]
Hello,
we are using corosync version needle (2.3.5) for our cluster
filesystem
(pmxcfs).
The situation is the following. First we start up the pmxcfs,
which is
an fuse fs. And if there is an cluster configuration, we start also
corosync.
This allows the filesystem to exist on one node 'cluster's or
forcing it
in an local mode. We use CPG to send our messages to all members,
the filesystem is in the RAM and all fs operations are sent
'over the
wire'.
The problem is now the following:
When we're restarting all (in my test case 3) nodes at the same
time, I
get in 1 from 10 cases only CS_ERR_BAD_HANDLE back when calling
I'm really unsure how to understand what are you doing. You are
restarting all nodes and get CS_ERR_BAD_HANDLE? I mean, if you are
restarting all nodes, which node returns CS_ERR_BAD_HANDLE? Or
are you
restarting just pmxcfs? Or just coorsync?
Clarification, sorry was a bit unspecific. I can see the error
behaviour
in two cases:
1) I restart three physical hosts (= nodes) at the same time, one of
them - normally the last one coming up again - joins successfully the
corosync cluster the filesystem (pmxcfs) notices that, but then
cpg_mcast_joined receives only CS_ERR_BAD_HANDLE errors.
Ok, that is weird. Are you able to reproduce same behavior restarting
pmxcfs? Or really membership change (= restart of node) is needed?
Also are you sure network interface is up when corosync starts?
No, tried quite a few times to restart pmxcfs but that didn't trigger
the problem, yet. But I could trigger it once with only restarting one
node, so restarting all only makes the problem worse but isn't
needed in
the first place.
Ok. So let's expect change of membership come into play.
Do you think you can try to test (in cycle):
- start corosync
- start pmxcfs
- stop pmxcfs
- stop corosync
On one node? Because if problem appears, we will have at least
reproducer.
Hmm, yeah can do that but you have to know that we start pmxcfs
_before_ corosync, as we want to access the data when quorum is lost
or when it's only a one node cluster, and thus corosync is not running.
So the cycle to replicate this problems would be:
- start pmxcfs
- start corosync
- stop corosync
- stop pmxcfs
if I'm not mistaken.
corosync.log of failing node may be interesting.
My nodes hostnames are [one, two, three], this time the came up in the
order they're named.
This time it was on two nodes the first and second node coming up
again.
corosync log seems normal, although I haven't had debug mode enabled,
don't know what difference that makes when no errors shows up in the
normal log.
Oct 07 09:06:36 [1335] two corosync notice [MAIN ] Corosync Cluster
Engine ('2.3.5'): started and ready to provide service.
Oct 07 09:06:36 [1335] two corosync info[MAIN ] Corosync built-in
features: augeas systemd pie relro bindnow
Oct 07 09:06:36 [1335] two corosync notice [TOTEM ] Initializing
transport (UDP/IP Multicast).
Oct 07 09:06:36 [1335] two corosync notice [TOTEM ] Initializing
transmit/receive security (NSS) crypto: aes256 hash: sha1
Oct 07 09:06:36 [1335] two corosync notice [TOTEM ] The network
interface [10.10.1.152] is now up.
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engine
loaded: corosync configuration map access [0]
Oct 07 09:06:36 [1335] two corosync info[QB] server name: cmap
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engine
loaded: corosync configuration service [1]
Oct 07 09:06:36 [1335] two corosync info[QB] server name: cfg
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engine
loaded: corosync cluster closed process group service v1.01 [2]
Oct 07 09:06:36 [1335] two corosync info[QB] server name: cpg
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engine
loaded: corosync profile loading service [4]
Oct 07 09:06:36 [1335] two corosync notice [QUORUM] Using quorum
provider corosync_votequorum
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engine
loaded: corosync vote quorum service v1.0 [5]
Oct 07 09:06:36 [1335] two corosync info[QB] server name:
votequorum
Oct 07 09:06:36 [1335] two corosync notice [SERV ] Service engine
loaded: corosync cluster quorum service v0.1 [3]
Oct 07 09:06:36 [1335] two corosync info[QB] server name:
quorum
Oct 07 09:06:36 [1335] two corosync notice [TOTEM ] A new membership
(10.10.1.152:92) was formed. Members joined: 2
Oct 07 09:06:36 [1335] two corosync notice [QUORUM] Members[1]: 2
Oct 07 09:06:36 [1335] two corosync notice [MAIN ] Completed service
synchronization, ready to provide service.
Looks good
Then pmxcfs results in:
Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
failed: 9
Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
Oct 07 09:06:38 two