Re: [Linux-HA] node ignored after reboot
Sorry, I've had to ignore Heartbeat based clusters for the last few weeks... There may have been a problem with 1.0.2, I never tested it with Heartbeat, but my testing this week indicates the current code should work. So you might want to consider updating... This looks suspicious though: heartbeat[1831]: 2009/03/18_14:18:03 WARN: Message hist queue is filling up (377 messages in queue) and would seem to indicate some sort of communications problem. I'd suggest grabbing the latest Pacemaker code and submitting a bug if you find it happens again. Andrew On Wed, Mar 18, 2009 at 18:29, Juha Heinanen wrote: > i set up the example apache cluster of document > > http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 > > but used mysql server instead of apache server. crm of my test cluster > looks like this: > > node $id="8df8447f-6ecf-41a7-a131-c89fd59a120d" lenny1 > node $id="f13aff7b-6c94-43ac-9a24-b118e62d5325" lenny2 > primitive drbd0 ocf:heartbeat:drbd \ > params drbd_resource="drbd0" \ > op monitor interval="59s" role="Master" timeout="30s" \ > op monitor interval="60s" role="Slave" timeout="30s" > primitive fs0 ocf:heartbeat:Filesystem \ > params ftype="ext3" directory="/var/lib/mysql" device="/dev/drbd0" \ > meta target-role="Started" > primitive mysql-server lsb:mysql \ > op monitor interval="10s" timeout="30s" start-delay="10s" > primitive virtual-ip ocf:heartbeat:IPaddr2 \ > params ip="192.98.102.10" broadcast="192.98.102.255" nic="eth1" > cidr_netmask="24" \ > op monitor interval="21s" timeout="5s" > group mysql-group fs0 mysql-server virtual-ip > ms ms-drbd0 drbd0 \ > meta clone-max="2" notify="true" globally-unique="false" > target-role="Started" > colocation mysql-group-on-ms-drbd0 inf: mysql-group ms-drbd0:Master > order ms-drbd0-before-mysql-group inf: ms-drbd0:promote mysql-group:start > property $id="cib-bootstrap-options" \ > dc-version="1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160" \ > default-resource-stickiness="1" > > initially both nodes were online, lenny2 being the master. then i tried > what happens when i reboot lenny1. when lenny1 was powered off, cluster > looked correctly like this: > > # crm_mon -1 > > > Last updated: Wed Mar 18 14:12:09 2009 > Current DC: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325) > Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160 > 2 Nodes configured. > 2 Resources configured. > > > Node: lenny1 (8df8447f-6ecf-41a7-a131-c89fd59a120d): OFFLINE > Node: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325): online > > Master/Slave Set: ms-drbd0 > drbd0:0 (ocf::heartbeat:drbd): Stopped > drbd0:1 (ocf::heartbeat:drbd): Master lenny2 > Resource Group: mysql-group > fs0 (ocf::heartbeat:Filesystem): Started lenny2 > mysql-server (lsb:mysql): Started lenny2 > virtual-ip (ocf::heartbeat:IPaddr2): Started lenny2 > > when i powered lenny1 on again, i expected that after is becomes online > again, but it was totally ignored. > > the log is below. versions of software are heartbeat 2.99.2 and > pacemaker 1.0.2. > > any glues why lenny1 was ignored and my very first test to achieve high > availability with heartbeat/pacemaker failed? people on pacemaker list > suspected ccm, which is part of heartbeat. > > -- juha > > -- > > this came to syslog when lenny1 was powered off: > > r...@lenny2:~# heartbeat[1831]: 2009/03/18_14:12:32 WARN: node lenny1: is dead > heartbeat[1831]: 2009/03/18_14:12:32 info: Link lenny1:eth1 dead. > crmd[1923]: 2009/03/18_14:12:32 notice: crmd_ha_status_callback: Status > update: > Node lenny1 now has status [dead] (DC=true) > crmd[1923]: 2009/03/18_14:12:32 info: crm_update_peer_proc: lenny1.ais is now > offline > crmd[1923]: 2009/03/18_14:12:32 info: te_graph_trigger: Transition 12 is now > complete > crmd[1923]: 2009/03/18_14:12:32 info: notify_crmd: Transition 12 status: done > - > > > and this when it was powered on again: > > heartbeat[1831]: 2009/03/18_14:12:56 info: Heartbeat restart on node lenny1 > heartbeat[1831]: 2009/03/18_14:12:56 info: Link lenny1:eth1 up. > heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1: > status init > heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1: > status up > crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status > update: > Node lenny1 now has status [init] (DC=true) > crmd[1923]: 2009/03/18_14:12:56 info: crm_update_peer_proc: lenny1.ais is now > online > crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status > update: > Node lenny1 now has status [up] (DC=true) > heartbeat[1831]: 2009/03/18_14:13:26 info: Status update for node lenny1: > status active > crmd[1923]: 2009/03/18_14:13:26 notice: crmd_ha_status_callback: Status > update: > Node lenny1 now has status [active] (DC=true) > cib[1919]: 2009/03/18_14:13:26 info: cib_clie
[Linux-HA] node ignored after reboot
i set up the example apache cluster of document http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0 but used mysql server instead of apache server. crm of my test cluster looks like this: node $id="8df8447f-6ecf-41a7-a131-c89fd59a120d" lenny1 node $id="f13aff7b-6c94-43ac-9a24-b118e62d5325" lenny2 primitive drbd0 ocf:heartbeat:drbd \ params drbd_resource="drbd0" \ op monitor interval="59s" role="Master" timeout="30s" \ op monitor interval="60s" role="Slave" timeout="30s" primitive fs0 ocf:heartbeat:Filesystem \ params ftype="ext3" directory="/var/lib/mysql" device="/dev/drbd0" \ meta target-role="Started" primitive mysql-server lsb:mysql \ op monitor interval="10s" timeout="30s" start-delay="10s" primitive virtual-ip ocf:heartbeat:IPaddr2 \ params ip="192.98.102.10" broadcast="192.98.102.255" nic="eth1" cidr_netmask="24" \ op monitor interval="21s" timeout="5s" group mysql-group fs0 mysql-server virtual-ip ms ms-drbd0 drbd0 \ meta clone-max="2" notify="true" globally-unique="false" target-role="Started" colocation mysql-group-on-ms-drbd0 inf: mysql-group ms-drbd0:Master order ms-drbd0-before-mysql-group inf: ms-drbd0:promote mysql-group:start property $id="cib-bootstrap-options" \ dc-version="1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160" \ default-resource-stickiness="1" initially both nodes were online, lenny2 being the master. then i tried what happens when i reboot lenny1. when lenny1 was powered off, cluster looked correctly like this: # crm_mon -1 Last updated: Wed Mar 18 14:12:09 2009 Current DC: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325) Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160 2 Nodes configured. 2 Resources configured. Node: lenny1 (8df8447f-6ecf-41a7-a131-c89fd59a120d): OFFLINE Node: lenny2 (f13aff7b-6c94-43ac-9a24-b118e62d5325): online Master/Slave Set: ms-drbd0 drbd0:0 (ocf::heartbeat:drbd): Stopped drbd0:1 (ocf::heartbeat:drbd): Master lenny2 Resource Group: mysql-group fs0 (ocf::heartbeat:Filesystem):Started lenny2 mysql-server(lsb:mysql):Started lenny2 virtual-ip (ocf::heartbeat:IPaddr2): Started lenny2 when i powered lenny1 on again, i expected that after is becomes online again, but it was totally ignored. the log is below. versions of software are heartbeat 2.99.2 and pacemaker 1.0.2. any glues why lenny1 was ignored and my very first test to achieve high availability with heartbeat/pacemaker failed? people on pacemaker list suspected ccm, which is part of heartbeat. -- juha -- this came to syslog when lenny1 was powered off: r...@lenny2:~# heartbeat[1831]: 2009/03/18_14:12:32 WARN: node lenny1: is dead heartbeat[1831]: 2009/03/18_14:12:32 info: Link lenny1:eth1 dead. crmd[1923]: 2009/03/18_14:12:32 notice: crmd_ha_status_callback: Status update: Node lenny1 now has status [dead] (DC=true) crmd[1923]: 2009/03/18_14:12:32 info: crm_update_peer_proc: lenny1.ais is now offline crmd[1923]: 2009/03/18_14:12:32 info: te_graph_trigger: Transition 12 is now complete crmd[1923]: 2009/03/18_14:12:32 info: notify_crmd: Transition 12 status: done - and this when it was powered on again: heartbeat[1831]: 2009/03/18_14:12:56 info: Heartbeat restart on node lenny1 heartbeat[1831]: 2009/03/18_14:12:56 info: Link lenny1:eth1 up. heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1: status init heartbeat[1831]: 2009/03/18_14:12:56 info: Status update for node lenny1: status up crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status update: Node lenny1 now has status [init] (DC=true) crmd[1923]: 2009/03/18_14:12:56 info: crm_update_peer_proc: lenny1.ais is now online crmd[1923]: 2009/03/18_14:12:56 notice: crmd_ha_status_callback: Status update: Node lenny1 now has status [up] (DC=true) heartbeat[1831]: 2009/03/18_14:13:26 info: Status update for node lenny1: status active crmd[1923]: 2009/03/18_14:13:26 notice: crmd_ha_status_callback: Status update: Node lenny1 now has status [active] (DC=true) cib[1919]: 2009/03/18_14:13:26 info: cib_client_status_callback: Status update: Client lenny1/cib now has status [join] cib[1919]: 2009/03/18_14:13:26 info: crm_update_peer_proc: lenny1.cib is now online heartbeat[1831]: 2009/03/18_14:13:30 WARN: 1 lost packet(s) for [lenny1] [55:57] heartbeat[1831]: 2009/03/18_14:13:30 info: No pkts missing from lenny1! crmd[1923]: 2009/03/18_14:13:30 notice: crmd_client_status_callback: Status update: Client lenny1/crmd now has status [online] (DC=true) crmd[1923]: 2009/03/18_14:13:30 info: crm_update_peer_proc: lenny1.crmd is now online heartbeat[1831]: 2009/03/18_14:13:31 WARN: 1 lost packet(s) for [lenny1] [59:61] heartbeat[1831]: 2009/03/18_14:13:31 info: No pkts missing from lenny1! crmd[1923]: 2009/03/18_14:13:33 WARN: crmd_ha_msg_callback: Ignoring HA message (op=join_announce)