Re: [Pacemaker] Node fails to rejoin cluster

2013-02-18 Thread Ales Nosek
We are experiencing the same issue. Did the build from latest source
resolve it? Thanks for letting us know.
Ales


On Thu, Feb 7, 2013 at 10:05 PM, Tal Yalon  wrote:

> Thanks Andrew for all your help, will do!
> On Feb 8, 2013 3:00 AM, "Andrew Beekhof"  wrote:
>
>> On Thu, Feb 7, 2013 at 7:06 PM, Tal Yalon  wrote:
>> > Thanks for replying Andrew.
>> >
>> > Here's the other node's log (the one that fenced the non-responsive
>> node) -
>> > please let me know if there's any other information that may help. It's
>> a
>> > bit long, but it captures the moment node-1 finds out that node-2 is
>> > non-responsive, then fences it and then gets stuck in an endless
>> election
>> > loop.
>>
>> This log:
>>
>> Feb  6 01:40:59 node-1 crmd[22715]: info: join_make_offer: Peer
>> process on node-2 is not active (yet?): 0001 2
>>
>> Suggests it s a bug that got fixed recently.  Keep an eye out for
>> 1.1.9 in the next week or so (or you could try building from source if
>> you're in a hurry).
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Node fails to rejoin cluster

2013-02-14 Thread Andrew Beekhof
On Thu, Feb 14, 2013 at 9:34 PM, Proskurin Kirill
 wrote:
> On 02/08/2013 04:59 AM, Andrew Beekhof wrote:
>
>> Suggests it s a bug that got fixed recently.  Keep an eye out for
>> 1.1.9 in the next week or so (or you could try building from source if
>> you're in a hurry).
>
>
> Is 1.1.9 will be centos 5.x friendly?

Yep

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Node fails to rejoin cluster

2013-02-14 Thread Proskurin Kirill

On 02/08/2013 04:59 AM, Andrew Beekhof wrote:


Suggests it s a bug that got fixed recently.  Keep an eye out for
1.1.9 in the next week or so (or you could try building from source if
you're in a hurry).


Is 1.1.9 will be centos 5.x friendly?

--
Best regards,
Proskurin Kirill

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Node fails to rejoin cluster

2013-02-07 Thread Tal Yalon
Thanks Andrew for all your help, will do!
On Feb 8, 2013 3:00 AM, "Andrew Beekhof"  wrote:

> On Thu, Feb 7, 2013 at 7:06 PM, Tal Yalon  wrote:
> > Thanks for replying Andrew.
> >
> > Here's the other node's log (the one that fenced the non-responsive
> node) -
> > please let me know if there's any other information that may help. It's a
> > bit long, but it captures the moment node-1 finds out that node-2 is
> > non-responsive, then fences it and then gets stuck in an endless election
> > loop.
>
> This log:
>
> Feb  6 01:40:59 node-1 crmd[22715]: info: join_make_offer: Peer
> process on node-2 is not active (yet?): 0001 2
>
> Suggests it s a bug that got fixed recently.  Keep an eye out for
> 1.1.9 in the next week or so (or you could try building from source if
> you're in a hurry).
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Node fails to rejoin cluster

2013-02-07 Thread Andrew Beekhof
On Thu, Feb 7, 2013 at 7:06 PM, Tal Yalon  wrote:
> Thanks for replying Andrew.
>
> Here's the other node's log (the one that fenced the non-responsive node) -
> please let me know if there's any other information that may help. It's a
> bit long, but it captures the moment node-1 finds out that node-2 is
> non-responsive, then fences it and then gets stuck in an endless election
> loop.

This log:

Feb  6 01:40:59 node-1 crmd[22715]: info: join_make_offer: Peer
process on node-2 is not active (yet?): 0001 2

Suggests it s a bug that got fixed recently.  Keep an eye out for
1.1.9 in the next week or so (or you could try building from source if
you're in a hurry).

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Node fails to rejoin cluster

2013-02-06 Thread Andrew Beekhof
On Wed, Feb 6, 2013 at 9:11 PM, Tal Yalon  wrote:
> Hi all,
>
> I have a 2-node cluster, where node-2 got fenced and now after reboot tries
> to rejoin the cluster but fails and gets stuck in a loop for hours and never
> joins back.
>
> After another reboot it managed to join, and there was no time difference
> between the nodes.
>
> Below is corosync/pacemaker log of node-2 (the one that was stuck in the
> loop).

Unfortunately we need the other one.

>Any help would be appreciated, since I have no clue as to what
> happened.
>
> Thanks,
> Tal
>
>
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Corosync Cluster Engine
> ('1.4.1'): started and ready to provide service.
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Corosync built-in
> features: nss dbus rdma snmp
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Successfully read main
> configuration file '/etc/corosync/corosync.conf'.
> Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] Initializing transport
> (UDP/IP Unicast).
> Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] Initializing
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] The network interface
> [9.151.142.20] is now up.
> Feb  6 01:39:32 node-2 corosync[27428]:   [pcmk  ] Logging: Initialized
> pcmk_startup
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> Pacemaker Cluster Manager 1.1.6
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync extended virtual synchrony service
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync configuration service
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync cluster closed process group service v1.01
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync cluster config database access v1.01
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync profile loading service
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync cluster quorum service v0.1
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Compatibility mode set to
> whitetank.  Using V1 and V2 of the synchronization engine.
> Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Feb  6 01:39:32 node-2 corosync[27428]:   [CPG   ] chosen downlist: sender
> r(0) ip(9.151.142.20) ; members(old:0 left:0)
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: crm_log_init_worker:
> Changed active directory to /var/lib/heartbeat/cores/root
> Feb  6 01:39:37 node-2 pacemakerd[27466]:   notice: main: Starting Pacemaker
> 1.1.7-6.el6 (Build: 148fccfd5985c5590cc601123c6c16e966b85d14):
> generated-manpages agent-manpages ascii-docs publican-docs ncurses
> trace-logging libqb  corosync-plugin cman
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: main: Maximum core file
> size is: 18446744073709551615
> Feb  6 01:39:37 node-2 pacemakerd[27466]:   notice: update_node_processes:
> 0xb31fe0 Node 2 now known as node-2, was:
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
> child 27470 for process cib
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
> child 27471 for process stonith-ng
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
> child 27472 for process lrmd
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
> child 27473 for process attrd
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
> child 27474 for process pengine
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
> child 27475 for process crmd
> Feb  6 01:39:37 node-2 pacemakerd[27466]: info: main: Starting mainloop
> Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
> signal handler for signal 15
> Feb  6 01:39:37 node-2 stonith-ng[27471]: info: crm_log_init_worker:
> Changed active directory to /var/lib/heartbeat/cores/root
> Feb  6 01:39:37 node-2 stonith-ng[27471]: info: get_cluster_type:
> Cluster type is: 'openais'
> Feb  6 01:39:37 node-2 stonith-ng[27471]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin)
> Feb  6 01:39:37 node-2 stonith-ng[27471]: info:
> init_ais_connection_classic: Creating connection to our Corosync plugin
> Feb  6 01:39:37 node-2 stonith-ng[27471]: info:
> init_ais_connection_classic: AIS connection established
> Feb  6 01:39:37 node-2 stonith-ng[27471]: info: get_ais_nodeid: Server
> details: id=2 uname=node-2 cname=pcmk
> Feb  6 01:39:37 node-2 stonith-ng[27471]: info:
> init_ais_connection_once: Connection to 'classic openais (with plugin)':
> established
>

[Pacemaker] Node fails to rejoin cluster

2013-02-06 Thread Tal Yalon
Hi all,

I have a 2-node cluster, where node-2 got fenced and now after reboot tries
to rejoin the cluster but fails and gets stuck in a loop for hours and
never joins back.

After another reboot it managed to join, and there was no time difference
between the nodes.

Below is corosync/pacemaker log of node-2 (the one that was stuck in the
loop). Any help would be appreciated, since I have no clue as to what
happened.

Thanks,
Tal


Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Corosync Cluster Engine
('1.4.1'): started and ready to provide service.
Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Corosync built-in
features: nss dbus rdma snmp
Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Successfully read main
configuration file '/etc/corosync/corosync.conf'.
Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] Initializing transport
(UDP/IP Unicast).
Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] The network interface
[9.151.142.20] is now up.
Feb  6 01:39:32 node-2 corosync[27428]:   [pcmk  ] Logging: Initialized
pcmk_startup
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
Pacemaker Cluster Manager 1.1.6
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync extended virtual synchrony service
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync configuration service
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync cluster closed process group service v1.01
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync cluster config database access v1.01
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync profile loading service
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync cluster quorum service v0.1
Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Compatibility mode set
to whitetank.  Using V1 and V2 of the synchronization engine.
Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Feb  6 01:39:32 node-2 corosync[27428]:   [CPG   ] chosen downlist: sender
r(0) ip(9.151.142.20) ; members(old:0 left:0)
Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: crm_log_init_worker:
Changed active directory to /var/lib/heartbeat/cores/root
Feb  6 01:39:37 node-2 pacemakerd[27466]:   notice: main: Starting
Pacemaker 1.1.7-6.el6 (Build: 148fccfd5985c5590cc601123c6c16e966b85d14):
 generated-manpages agent-manpages ascii-docs publican-docs ncurses
trace-logging libqb  corosync-plugin cman
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: main: Maximum core file
size is: 18446744073709551615
Feb  6 01:39:37 node-2 pacemakerd[27466]:   notice: update_node_processes:
0xb31fe0 Node 2 now known as node-2, was:
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27470 for process cib
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27471 for process stonith-ng
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27472 for process lrmd
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27473 for process attrd
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27474 for process pengine
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked
child 27475 for process crmd
Feb  6 01:39:37 node-2 pacemakerd[27466]: info: main: Starting mainloop
Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 15
Feb  6 01:39:37 node-2 stonith-ng[27471]: info: crm_log_init_worker:
Changed active directory to /var/lib/heartbeat/cores/root
Feb  6 01:39:37 node-2 stonith-ng[27471]: info: get_cluster_type:
Cluster type is: 'openais'
Feb  6 01:39:37 node-2 stonith-ng[27471]:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb  6 01:39:37 node-2 stonith-ng[27471]: info:
init_ais_connection_classic: Creating connection to our Corosync plugin
Feb  6 01:39:37 node-2 stonith-ng[27471]: info:
init_ais_connection_classic: AIS connection established
Feb  6 01:39:37 node-2 stonith-ng[27471]: info: get_ais_nodeid: Server
details: id=2 uname=node-2 cname=pcmk
Feb  6 01:39:37 node-2 stonith-ng[27471]: info:
init_ais_connection_once: Connection to 'classic openais (with plugin)':
established
Feb  6 01:39:37 node-2 stonith-ng[27471]: info: crm_new_peer: Node
node-2 now has id: 2
Feb  6 01:39:37 node-2 stonith-ng[27471]: info: crm_new_peer: Node 2 is
now known as node-2
Feb  6 01:39:37 node-2 crmd[27475]: info: crm_log_init_worker: Changed
active