Re: [ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

2015-10-23 Thread Thomas Lamprecht

ping

On 10/14/2015 02:10 PM, Thomas Lamprecht wrote:

Hi,

On 10/08/2015 10:57 AM, Jan Friesse wrote:

Hi,

Thomas Lamprecht napsal(a):

[snip]

Hello,

we are using corosync version needle (2.3.5) for our cluster
filesystem
(pmxcfs).
The situation is the following. First we start up the pmxcfs, 
which is

an fuse fs. And if there is an cluster configuration, we start also
corosync.
This allows the filesystem to exist on one node 'cluster's or
forcing it
in an local mode. We use CPG to send our messages to all members,
the filesystem is in the RAM and all fs operations are sent 
'over the

wire'.

The problem is now the following:
When we're restarting all (in my test case 3) nodes at the same
time, I
get in 1 from 10 cases only CS_ERR_BAD_HANDLE back when calling


I'm really unsure how to understand what are you doing. You are
restarting all nodes and get CS_ERR_BAD_HANDLE? I mean, if you are
restarting all nodes, which node returns CS_ERR_BAD_HANDLE? Or 
are you

restarting just pmxcfs? Or just coorsync?
Clarification, sorry was a bit unspecific. I can see the error 
behaviour

in two cases:
1) I restart three physical hosts (= nodes) at the same time, one of
them - normally the last one coming up again - joins successfully the
corosync cluster the filesystem (pmxcfs) notices that, but then
cpg_mcast_joined receives only CS_ERR_BAD_HANDLE errors.


Ok, that is weird. Are you able to reproduce same behavior restarting
pmxcfs? Or really membership change (= restart of node) is needed?
Also are you sure network interface is up when corosync starts?

No, tried quite a few times to restart pmxcfs but that didn't trigger
the problem, yet. But I could trigger it once with only restarting one
node, so restarting all only makes the problem worse but isn't 
needed in

the first place.


Ok. So let's expect change of membership come into play.

Do you think you can try to test (in cycle):
- start corosync
- start pmxcfs
- stop pmxcfs
- stop corosync

On one node? Because if problem appears, we will have at least 
reproducer.


Hmm, yeah can do that but you have to know that we start pmxcfs 
_before_ corosync, as we want to access the data when quorum is lost 
or when it's only a one node cluster, and thus corosync is not running.


So the cycle to replicate this problems would be:
- start pmxcfs
- start corosync
- stop corosync
- stop pmxcfs

if I'm not mistaken.





corosync.log of failing node may be interesting.


My nodes hostnames are [one, two, three], this time the came up in the
order they're named.
This time it was on two nodes the first and second node coming up 
again.

corosync log seems normal, although I haven't had debug mode enabled,
don't know what difference that makes when no errors shows up in the
normal log.

Oct 07 09:06:36 [1335] two corosync notice [MAIN  ] Corosync Cluster
Engine ('2.3.5'): started and ready to provide service.
Oct 07 09:06:36 [1335] two corosync info[MAIN  ] Corosync built-in
features: augeas systemd pie relro bindnow
Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] Initializing
transport (UDP/IP Multicast).
Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] Initializing
transmit/receive security (NSS) crypto: aes256 hash: sha1
Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] The network
interface [10.10.1.152] is now up.
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync configuration map access [0]
Oct 07 09:06:36 [1335] two corosync info[QB] server name: cmap
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync configuration service [1]
Oct 07 09:06:36 [1335] two corosync info[QB] server name: cfg
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync cluster closed process group service v1.01 [2]
Oct 07 09:06:36 [1335] two corosync info[QB] server name: cpg
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync profile loading service [4]
Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Using quorum
provider corosync_votequorum
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync vote quorum service v1.0 [5]
Oct 07 09:06:36 [1335] two corosync info[QB] server name:
votequorum
Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
loaded: corosync cluster quorum service v0.1 [3]
Oct 07 09:06:36 [1335] two corosync info[QB] server name: 
quorum

Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] A new membership
(10.10.1.152:92) was formed. Members joined: 2
Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Members[1]: 2
Oct 07 09:06:36 [1335] two corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.


Looks good


Then pmxcfs results in:

Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
failed: 9
Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
Oct 07 09:06:38 two 

Re: [ClusterLabs] Cluster monitoring

2015-10-23 Thread Arjun Pandey
Will have a look.

Thanks
Arjun

On Wed, Oct 21, 2015 at 8:26 PM, Ken Gaillot  wrote:

> On 10/21/2015 08:24 AM, Michael Schwartzkopff wrote:
> > Am Mittwoch, 21. Oktober 2015, 18:50:15 schrieb Arjun Pandey:
> >> Hi folks
> >>
> >> I had a question on monitoring of cluster events. Based on the
> >> documentation it seems that cluster monitor is the only method
> >> of monitoring the cluster events. Also since it seems to poll
> >> based on the interval configured it might miss some events. Is
> >> that the case ?
> >
> > No. the cluser is event-based. So it won't miss any event. If you
> > use the cluster's tools, they see hte events. If you monitor the
> > events you won't miss any either.
>
> FYI, Pacemaker 1.1.14 will have built-in handling of notification
> scripts, without needing a ClusterMon resource. These will be
> event-driven. Andrew Beekhof did a recent blog post about it:
> http://blog.clusterlabs.org/blog/2015/reliable-notifications/
>
> Pacemaker's monitors are polling, at the interval specified when
> configuring the monitor operation. Pacemaker relies on the resource
> agent to return status for monitors, so technically it's up to the
> resource agent whether it can "miss" brief outages that occur between
> polls. All the ones I've looked at would miss them, but generally
> that's considered acceptable if the service is once again fully
> working when the monitor runs (because it implies it recovered itself).
>
> Some people use an external monitoring system (nagios, icinga, zabbix,
> etc.) in addition to Pacemaker's monitors. They can complement each
> other, as the external system can check system parameters outside
> Pacemaker's view and can alert administrators for some early warning
> signs before a resource gets to the point of needing recovery. Of
> course such monitoring systems are also polling at configured intervals.
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] resource-agents iSCSILogicalUnit

2015-10-23 Thread Zhuangzeqiang
Hi,when I use ocf:heatbeat:iSCSILogicalUnit,

Configure:
primitive lunit ocf:heartbeat:iSCSILogicalUnit \
   params implementation="tgt" target_iqn="zzq" lun="2" path=”/dev/sdb" 
\
   op monitor interval="10s"

that is run success.

And I change it:
primitive lunit ocf:heartbeat:iSCSILogicalUnit \
   params implementation="tgt" target_iqn="zzq" lun="2" path="rbd/foo" 
tgt_bstype="rbd" \
   op monitor interval="1s"

it run failed,and corosync.log tells:
 lrmd:   notice: operation_finished:lunit_start_0:2898:stderr [ 
2015/10/23_20:11:57 ERROR: -t argument value '' invalid Try `tgtadm --help' for 
more information

why the tid change to '' ?
somebody can figure it out? Thx.
-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org