Thanks. I missed
drbdadm -- --force create-md vlv on node2, that was the cause of the problem. Because of this there was no drbd-minor-0.lkbd file in /var/lib/drbd on node2, so it did not work properly (e.g. "drbdadm adjust" did not work, and that killed pacemaker). BTW to speedup initial resync, one could execute: drbdadm -- --force create-md vlv drbdadm up vlv drbdadm -- --clear-bitmap new-current-uuid vlv/0 drbdadm primary vlv mkfs.xfs -d agcount=8 /dev/drbd0 P.S. I played with strace and sources a bit and discovered that "service drbd start" looked working only because it executes "drbdadm adjust-with-progress", although pacemaker's script executes "drbdadm -v adjust". In contrast to what is said in the docs, adjust and adjust-with-progress differ not only by debug output: there used to be a hack in drbdadm code which forces them to work differently: for (stage = CFG_PREREQ; stage < __CFG_LAST; stage++) { r = _run_deferred_cmds(stage); if (r) { if (!adjust_with_progress) return 1; /* FIXME r? */ ret = 1; } } This "FIXME" code is executed in "adjust" mode, but NOT when "adjust-with-progress" is active. So if one uses "adjust-with-progress", drbdadm silently skips failed steps and continues with the next ones, while "adjust" fails on a first failed step. On Sat, Jan 3, 2015 at 10:27 PM, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: > 03.01.2015 20:35, Dmitry Koterov wrote: > >> Hello. >> >> Ubuntu 14.04, corosync 2.3.3, pacemaker 1.1.10. The cluster consists of >> 2 nodes (node1 and node2), when I run "crm node standby node2" and then, >> in a minute, "crm node online node2", DRBD secondary on node2 does not >> start. Logs say that "drbdadm -c /etc/drbd.conf check-resize vlv" fails >> with an error message: "No valid meta data found" on the onlining node. >> And, surprisingly, after I run "service drbd start" on node2 manually, >> everything becomes fine. >> >> Maybe something is broken in /usr/lib/ocf/resource.d/linbit/drbd, why >> cannot it start DRBD? Or I am misconfigured somehow? Could you please >> give an advice what to do? >> > > Please see inline in drbdadm output. > > >> I have the following configuration (drbd + mount + postgresql, but >> postgresql is innocent here, so just ignore it): >> >> *root@node2:/var/log#* crm configure show >> >> node $id="1017525950" node2 attributes standby="off" >> node $id="1760315215" node1 >> primitive drbd ocf:linbit:drbd \ >> params drbd_resource="vlv" \ >> op start interval="0" timeout="240" \ >> op stop interval="0" timeout="120" >> primitive fs ocf:heartbeat:Filesystem \ >> params device="/dev/drbd0" directory="/var/lib/vlv.drbd/root" >> options="noatime,nodiratime" fstype="xfs" \ >> op start interval="0" timeout="300" \ >> op stop interval="0" timeout="300" >> primitive postgresql lsb:postgresql \ >> op monitor interval="4" timeout="60" \ >> op start interval="0" timeout="60" \ >> op stop interval="0" timeout="60" >> group pgserver fs postgresql >> ms ms_drbd drbd \ >> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" >> notify="true" >> location cli-prefer-pgserver pgserver inf: node1 >> colocation col_pgserver inf: pgserver ms_drbd:Master >> order ord_pgserver inf: ms_drbd:promote pgserver:start >> property $id="cib-bootstrap-options" dc-version="1.1.10-42f2063" >> cluster-infrastructure="corosync" >> stonith-enabled="false" no-quorum-policy="ignore" >> last-lrm-refresh="1420304078" >> rsc_defaults $id="rsc-options" \ >> resource-stickiness="100" >> >> >> The cluster and DRBD statuses on node2 look healthy: >> >> >> *root@node2:/var/log#* crm status >> ... >> Online: [ node1 node2 ] >> Master/Slave Set: ms_drbd [drbd] >> Masters: [ node1 ] >> Slaves: [ node2 ] >> Resource Group: pgserver >> fs(ocf::heartbeat:Filesystem):Started node1 >> postgresql(lsb:postgresql):Started node1 >> >> *root@node2:/var/log#* cat /proc/drbd >> version: 8.4.3 (api:1/proto:86-101) >> srcversion: F97798065516C94BE0F27DC >> 0: cs:Connected ro:Secondary/Primary ds:Diskless/UpToDate C r----- >> > ^^^^^^^^ > that is your problem > > ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 >> >> >> Now I switch node2 to standby and verify that DRBD on it has really shot >> down: >> >> >> *root@node1:/etc/rc2.d#* crm node standby node2 >> *root@node2:/var/log#* cat /proc/drbd >> version: 8.4.3 (api:1/proto:86-101) >> srcversion: F97798065516C94BE0F27DC >> *root@node2:/var/log#* >> * >> * >> Then I switch node2 back online and see that DRBD has not been >> initialized and reattached again! >> >> *root@node2:/var/log#* >syslog >> *root@node1:/etc#* crm node online node2 >> *root@node2:/var/log#* crm status >> ... >> Online: [ node1 node2 ] >> Master/Slave Set: ms_drbd [drbd] >> Masters: [ node1 ] >> Stopped: [ node2 ] >> Resource Group: pgserver >> fs(ocf::heartbeat:Filesystem):Started node1 >> postgresql(lsb:postgresql):Started node1 >> Failed actions: >> drbd_start_0 (node=node2, call=81, rc=1, status=complete, >> last-rc-change=Sat Jan 3 12:05:32 2015 >> , queued=1118ms, exec=0ms >> ): unknown error >> >> *root@node2:/var/log#* cat syslog | head -n 30 >> >> Jan 3 12:05:31 node2 crmd[918]: notice: do_state_transition: State >> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC >> cause=C_FSA_INTERNAL origin=abort_transition_graph ] >> Jan 3 12:05:31 node2 cib[913]: notice: cib:diff: Diff: --- 0.29.3 >> Jan 3 12:05:31 node2 cib[913]: notice: cib:diff: Diff: +++ 0.30.1 >> 027344551b46745123e4a52562e55974 >> Jan 3 12:05:31 node2 pengine[917]: notice: unpack_config: On loss of >> CCM Quorum: Ignore >> Jan 3 12:05:31 node2 pengine[917]: notice: LogActions: Start >> drbd:1#011(node2) >> Jan 3 12:05:31 node2 crmd[918]: notice: te_rsc_command: Initiating >> action 46: notify drbd_pre_notify_start_0 on node1 >> Jan 3 12:05:31 node2 pengine[917]: notice: process_pe_message: >> Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-11.bz2 >> Jan 3 12:05:32 node2 crmd[918]: notice: te_rsc_command: Initiating >> action 10: start drbd:1_start_0 on node2 (local) >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c >> /etc/drbd.conf check-resize vlv >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255 >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output: >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c >> /etc/drbd.conf --peer node1 attach vlv >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255 >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output: >> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c >> /etc/drbd.conf --peer node1 attach vlv >> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255 >> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output: >> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished: >> drbd_start_0:1931:stderr [ no suitable meta data found :( ] >> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished: >> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal >> check-resize' terminated with exit code 255 ] >> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished: >> drbd_start_0:1931:stderr [ No valid meta data found ] >> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished: >> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal >> apply-al' terminated with exit code 255 ] >> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished: >> drbd_start_0:1931:stderr [ No valid meta data found ] >> Jan 3 12:05:33 node2 lrmd[915]: notice: operation_finished: >> drbd_start_0:1931:stderr [ Command 'drbdmeta 0 v08 /dev/loop0 internal >> apply-al' terminated with exit code 255 ] >> Jan 3 12:05:33 node2 crmd[918]: notice: process_lrm_event: LRM >> operation drbd_start_0 (call=81, rc=1, cib-update=79, confirmed=true) >> unknown error >> Jan 3 12:05:33 node2 crmd[918]: notice: process_lrm_event: >> node2-drbd_start_0:81 [ \n\n\n\n\n\n\n ] >> Jan 3 12:05:33 node2 crmd[918]: warning: status_from_rc: Action 10 >> (drbd:1_start_0) on node2 failed (target: 0 vs. rc: 1): Error >> Jan 3 12:05:33 node2 crmd[918]: warning: update_failcount: Updating >> failcount for drbd on node2 after failed start: rc=1 (update=INFINITY, >> time=1420304733) >> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: fail-count-drbd (INFINITY) >> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_perform_update: Sent >> update 60: fail-count-drbd=INFINITY >> Jan 3 12:05:33 node2 attrd[916]: notice: attrd_trigger_update: >> Sending flush op to all hosts for: last-failure-drbd (1420304733) >> >> root@node2:/var/log# cat syslog | grep ERROR | head -n 30 >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c >> /etc/drbd.conf check-resize vlv >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255 >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output: >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c >> /etc/drbd.conf --peer node1 attach vlv >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255 >> Jan 3 12:05:32 node2 drbd(drbd)[1931]: ERROR: vlv: Command output: >> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Called drbdadm -c >> /etc/drbd.conf --peer node1 attach vlv >> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Exit code 255 >> Jan 3 12:05:33 node2 drbd(drbd)[1931]: ERROR: vlv: Command output: >> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Called drbdadm -c >> /etc/drbd.conf --peer node1 -v adjust vlv >> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Exit code 1 >> Jan 3 12:05:33 node2 drbd(drbd)[2021]: ERROR: vlv: Command output: >> drbdmeta 0 v08 /dev/loop0 internal apply-al >> >> >> Here's the DRBD config file: >> >> >> *root@node2:/var/log#* cat /etc/drbd.d/drbd.res >> resource vlv { >> device /dev/drbd0; >> disk /dev/loop0; >> meta-disk internal; >> syncer { rate 150M; verify-alg md5; } >> on node1 { address x.x.x.x:7788; } >> on node2 { address y.y.y.y:7788; } >> } >> >> *root@node2:/var/log#* losetup -a >> /dev/loop0: [fd01]:1314858 (/var/lib/vlv.drbd/vlv.img) >> >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org