Re: [Linux-HA] node ignored after reboot

2009-04-03 Thread Andrew Beekhof
Sorry, I've had to ignore Heartbeat based clusters for the last few weeks...

There may have been a problem with 1.0.2, I never tested it with
Heartbeat, but my testing this week indicates the current code should
work.
So you might want to consider updating...

This looks suspicious though:
  heartbeat[1831]: 2009/03/18_14:18:03 WARN: Message hist queue is
filling up (377 messages in queue)
and would seem to indicate some sort of communications problem.

I'd suggest grabbing the latest Pacemaker code and submitting a bug if
you find it happens again.

Andrew

On Wed, Mar 18, 2009 at 18:29, Juha Heinanen  wrote:
> i set up the example apache cluster of document
>
> http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0
>
> but used mysql server instead of apache server.  crm of my test cluster
> looks like this:
>
> node $id="8df8447f-6ecf-41a7-a131-c89fd59a120d" lenny1
> node $id="f13aff7b-6c94-43ac-9a24-b118e62d5325" lenny2
> primitive drbd0 ocf:heartbeat:drbd \
>        params drbd_resource="drbd0" \
>        op monitor interval="59s" role="Master" timeout="30s" \
>        op monitor interval="60s" role="Slave" timeout="30s"
> primitive fs0 ocf:heartbeat:Filesystem \
>        params ftype="ext3" directory="/var/lib/mysql" device="/dev/drbd0" \
>        meta target-role="Started"
> primitive mysql-server lsb:mysql \
>        op monitor interval="10s" timeout="30s" start-delay="10s"
> primitive virtual-ip ocf:heartbeat:IPaddr2 \
>        params ip="192.98.102.10" broadcast="192.98.102.255" nic="eth1" 
> cidr_netmask="24" \
>        op monitor interval="21s" timeout="5s"
> group mysql-group fs0 mysql-server virtual-ip
> ms ms-drbd0 drbd0 \
>        meta clone-max="2" notify="true" globally-unique="false" 
> target-role="Started"
> colocation mysql-group-on-ms-drbd0 inf: mysql-group ms-drbd0:Master
> order ms-drbd0-before-mysql-group inf: ms-drbd0:promote mysql-group:start
> property $id="cib-bootstrap-options" \
>        dc-version="1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160" \
>        default-resource-stickiness="1"
>
> initially both nodes were online, lenny2 being the master.  then i tried
> what happens when i reboot lenny1. when lenny1 was powered off, cluster
> looked correctly like this:
>
> # crm_mon -1
>
> 
> Last updated: Wed Mar 18 14:12:09 2009
> Current DC: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325)
> Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160
> 2 Nodes configured.
> 2 Resources configured.
> 
>
> Node: lenny1 (8df8447f-6ecf-41a7-a131-c89fd59a120d): OFFLINE
> Node: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325): online
>
> Master/Slave Set: ms-drbd0
>    drbd0:0     (ocf::heartbeat:drbd):  Stopped
>    drbd0:1     (ocf::heartbeat:drbd):  Master lenny2
> Resource Group: mysql-group
>    fs0 (ocf::heartbeat:Filesystem):    Started lenny2
>    mysql-server        (lsb:mysql):    Started lenny2
>    virtual-ip  (ocf::heartbeat:IPaddr2):       Started lenny2
>
> when i powered lenny1 on again, i expected that after is becomes online
> again, but it was totally ignored.
>
> the log is below. versions of software are heartbeat 2.99.2 and
> pacemaker 1.0.2.
>
> any glues why lenny1 was ignored and my very first test to achieve high
> availability with heartbeat/pacemaker failed?  people on pacemaker list
> suspected ccm, which is part of heartbeat.
>
> -- juha
>
> --
>
> this came to syslog when lenny1 was powered off:
>
> r...@lenny2:~# heartbeat[1831]: 2009/03/18_14:12:32 WARN: node lenny1: is dead
> heartbeat[1831]: 2009/03/18_14:12:32 info: Link lenny1:eth1 dead.
> crmd[1923]: 2009/03/18_14:12:32 notice: crmd_ha_status_callback: Status 
> update:
> Node lenny1 now has status [dead] (DC=true)
> crmd[1923]: 2009/03/18_14:12:32 info: crm_update_peer_proc: lenny1.ais is now
> offline
> crmd[1923]: 2009/03/18_14:12:32 info: te_graph_trigger: Transition 12 is now
> complete
> crmd[1923]: 2009/03/18_14:12:32 info: notify_crmd: Transition 12 status: done 
> -
> 
>
> and this when it was powered on again:
>
> heartbeat[1831]: 2009/03/18_14:12:56 info: Heartbeat restart on node lenny1
> heartbeat[1831]: 2009/03/18_14:12:56 info: Link lenny1:eth1 up.
> heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1:
> status init
> heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1:
> status up
> crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status 
> update:
> Node lenny1 now has status [init] (DC=true)
> crmd[1923]: 2009/03/18_14:12:56 info: crm_update_peer_proc: lenny1.ais is now
> online
> crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status 
> update:
> Node lenny1 now has status [up] (DC=true)
> heartbeat[1831]: 2009/03/18_14:13:26 info: Status update for node lenny1:
> status active
> crmd[1923]: 2009/03/18_14:13:26 notice: crmd_ha_status_callback: Status 
> update:
> Node lenny1 now has status [active] (DC=true)
> cib[1919]: 2009/03/18_14:13:26 info: cib_clie

[Linux-HA] node ignored after reboot

2009-03-18 Thread Juha Heinanen
i set up the example apache cluster of document 

http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0

but used mysql server instead of apache server.  crm of my test cluster
looks like this:

node $id="8df8447f-6ecf-41a7-a131-c89fd59a120d" lenny1
node $id="f13aff7b-6c94-43ac-9a24-b118e62d5325" lenny2
primitive drbd0 ocf:heartbeat:drbd \
params drbd_resource="drbd0" \
op monitor interval="59s" role="Master" timeout="30s" \
op monitor interval="60s" role="Slave" timeout="30s"
primitive fs0 ocf:heartbeat:Filesystem \
params ftype="ext3" directory="/var/lib/mysql" device="/dev/drbd0" \
meta target-role="Started"
primitive mysql-server lsb:mysql \
op monitor interval="10s" timeout="30s" start-delay="10s"
primitive virtual-ip ocf:heartbeat:IPaddr2 \
params ip="192.98.102.10" broadcast="192.98.102.255" nic="eth1" 
cidr_netmask="24" \
op monitor interval="21s" timeout="5s"
group mysql-group fs0 mysql-server virtual-ip
ms ms-drbd0 drbd0 \
meta clone-max="2" notify="true" globally-unique="false" 
target-role="Started"
colocation mysql-group-on-ms-drbd0 inf: mysql-group ms-drbd0:Master
order ms-drbd0-before-mysql-group inf: ms-drbd0:promote mysql-group:start
property $id="cib-bootstrap-options" \
dc-version="1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160" \
default-resource-stickiness="1"

initially both nodes were online, lenny2 being the master.  then i tried
what happens when i reboot lenny1. when lenny1 was powered off, cluster
looked correctly like this:

# crm_mon -1


Last updated: Wed Mar 18 14:12:09 2009
Current DC: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325)
Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160
2 Nodes configured.
2 Resources configured.


Node: lenny1 (8df8447f-6ecf-41a7-a131-c89fd59a120d): OFFLINE
Node: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325): online

Master/Slave Set: ms-drbd0
drbd0:0 (ocf::heartbeat:drbd):  Stopped 
drbd0:1 (ocf::heartbeat:drbd):  Master lenny2
Resource Group: mysql-group
fs0 (ocf::heartbeat:Filesystem):Started lenny2
mysql-server(lsb:mysql):Started lenny2
virtual-ip  (ocf::heartbeat:IPaddr2):   Started lenny2

when i powered lenny1 on again, i expected that after is becomes online
again, but it was totally ignored.

the log is below. versions of software are heartbeat 2.99.2 and
pacemaker 1.0.2.  

any glues why lenny1 was ignored and my very first test to achieve high
availability with heartbeat/pacemaker failed?  people on pacemaker list
suspected ccm, which is part of heartbeat.

-- juha

--

this came to syslog when lenny1 was powered off:

r...@lenny2:~# heartbeat[1831]: 2009/03/18_14:12:32 WARN: node lenny1: is dead
heartbeat[1831]: 2009/03/18_14:12:32 info: Link lenny1:eth1 dead.
crmd[1923]: 2009/03/18_14:12:32 notice: crmd_ha_status_callback: Status update: 
Node lenny1 now has status [dead] (DC=true)
crmd[1923]: 2009/03/18_14:12:32 info: crm_update_peer_proc: lenny1.ais is now 
offline
crmd[1923]: 2009/03/18_14:12:32 info: te_graph_trigger: Transition 12 is now 
complete
crmd[1923]: 2009/03/18_14:12:32 info: notify_crmd: Transition 12 status: done - 


and this when it was powered on again:

heartbeat[1831]: 2009/03/18_14:12:56 info: Heartbeat restart on node lenny1
heartbeat[1831]: 2009/03/18_14:12:56 info: Link lenny1:eth1 up.
heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1: 
status init
heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1: 
status up
crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status update: 
Node lenny1 now has status [init] (DC=true)
crmd[1923]: 2009/03/18_14:12:56 info: crm_update_peer_proc: lenny1.ais is now 
online
crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status update: 
Node lenny1 now has status [up] (DC=true)
heartbeat[1831]: 2009/03/18_14:13:26 info: Status update for node lenny1: 
status active
crmd[1923]: 2009/03/18_14:13:26 notice: crmd_ha_status_callback: Status update: 
Node lenny1 now has status [active] (DC=true)
cib[1919]: 2009/03/18_14:13:26 info: cib_client_status_callback: Status update: 
Client lenny1/cib now has status [join]
cib[1919]: 2009/03/18_14:13:26 info: crm_update_peer_proc: lenny1.cib is now 
online
heartbeat[1831]: 2009/03/18_14:13:30 WARN: 1 lost packet(s) for [lenny1] [55:57]
heartbeat[1831]: 2009/03/18_14:13:30 info: No pkts missing from lenny1!
crmd[1923]: 2009/03/18_14:13:30 notice: crmd_client_status_callback: Status 
update: Client lenny1/crmd now has status [online] (DC=true)
crmd[1923]: 2009/03/18_14:13:30 info: crm_update_peer_proc: lenny1.crmd is now 
online
heartbeat[1831]: 2009/03/18_14:13:31 WARN: 1 lost packet(s) for [lenny1] [59:61]
heartbeat[1831]: 2009/03/18_14:13:31 info: No pkts missing from lenny1!
crmd[1923]: 2009/03/18_14:13:33 WARN: crmd_ha_msg_callback: Ignoring HA message 
(op=join_announce)