----- Original Message ----- > Hello, > > I have 2 active-passive fail over system with corosync and drbd. > One system using 2 debian server and the other using 2 ubuntu server. > The debian servers are for web server fail over and the ubuntu servers are > for database server fail over. > > I applied the same configuration in the pacemaker. Everything works fine, > fail over can be done nicely and also the file system synchronization, but > in the ubuntu server, it was always has error after a couple week or month. > The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed > that ubuntu2 was down and ubuntu2 assumed that something happened with > ubuntu1 but still alive and took over the resources. It made the drbd > resource cannot be taken over, thus no fail over happened and we must > manually restart the server because restarting pacemaker and corosync didn't > help. I have changed the configuration of pacemaker a couple time, but the > problem still exist. > > has anyone experienced it? I use Ubuntu 14.04.1 LTS. > > I got this error in apport.log > > ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable: > /usr/lib/pacemaker/lrmd (command line "/usr/lib/pacemaker/lrmd")
wow, it looks like the lrmd is crashing on you. I haven't seen this occur in the wild before. Without a backtrace it will be nearly impossible to determine what is happening. Do you have the ability to upgrade pacemaker to a newer version? -- Vossel > ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: is_closing_session(): no > DBUS_SESSION_BUS_ADDRESS in environment > ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: wrote report > /var/crash/_usr_lib_pacemaker_lrmd.0.crash > > my pacemaker configuration: > > node $id="1" db \ > attributes standby="off" > node $id="2" db2 \ > attributes standby="off" > primitive ClusterIP ocf:heartbeat:IPaddr2 \ > params ip="192.168.0.100" cidr_netmask="24" \ > op monitor interval="30s" > primitive DBase ocf:heartbeat:mysql \ > meta target-role="Started" \ > op start timeout="120s" interval="0" \ > op stop timeout="120s" interval="0" \ > op monitor interval="20s" timeout="30s" > primitive DbFS ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/sync" fstype="ext4" \ > op start timeout="60s" interval="0" \ > op stop timeout="180s" interval="0" \ > op monitor interval="60s" timeout="60s" > primitive Links lsb:drbdlinks > primitive r0 ocf:linbit:drbd \ > params drbd_resource="r0" \ > op monitor interval="29s" role="Master" \ > op start timeout="240s" interval="0" \ > op stop timeout="180s" interval="0" \ > op promote timeout="180s" interval="0" \ > op demote timeout="180s" interval="0" \ > op monitor interval="30s" role="Slave" > group DbServer ClusterIP DbFS Links DBase > ms ms_r0 r0 \ > meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" > notify="true" target-role="Master" > location prefer-db DbServer 50: db > colocation DbServer-with-ms_ro inf: DbServer ms_r0:Master > order DbServer-after-ms_ro inf: ms_r0:promote DbServer:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-42f2063" \ > cluster-infrastructure="corosync" \ > expected-quorum-votes="2" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1363370585" > > my corosync config: > > totem { > version: 2 > token: 3000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 3600 > vsftype: none > max_messages: 20 > clear_node_high_bit: yes > secauth: off > threads: 0 > rrp_mode: none > transport: udpu > cluster_name: Dbcluster > } > > nodelist { > node { > ring0_addr: db > nodeid: 1 > } > node { > ring0_addr: db2 > nodeid: 2 > } > } > > quorum { > provider: corosync_votequorum > } > > amf { > mode: disabled > } > > service { > ver: 0 > name: pacemaker > } > > aisexec { > user: root > group: root > } > > logging { > fileline: off > to_stderr: yes > to_logfile: yes > logfile: /var/log/corosync/corosync.log > to_syslog: no > syslog_facility: daemon > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > } > } > > my drbd.conf: > > global { > usage-count no; > } > > common { > protocol C; > > handlers { > pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; > reboot -f"; > pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; > /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; > reboot -f"; > local-io-error "/usr/lib/drbd/notify-io-error.sh; > /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; > halt -f"; > } > > startup { > degr-wfc-timeout 120; > } > > disk { > on-io-error detach; > } > > syncer { > rate 100M; > al-extents 257; > } > } > > resource r0 { > protocol C; > flexible-meta-disk internal; > > on db2 { > address 192.168.0.10:7801 ; > device /dev/drbd0 minor 0; > disk /dev/sdb1; > } > on db { > device /dev/drbd0 minor 0; > disk /dev/db/sync; > address 192.168.0.20:7801 ; > } > handlers { > split-brain "/usr/lib/drbd/notify-split-brain.sh root"; > } > net { > after-sb-0pri discard-younger-primary; #discard-zero-changes; > after-sb-1pri discard-secondary; > after-sb-2pri call-pri-lost-after-sb; > } > } > > I have no idea, how to solve this problem. Maybe someone can help me. > > best regards, > > ariee > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org