Hi sorry for my late answer. I activate debug mode and this is the output during shutdown the node:
Dec 13 00:45:56 san1 kernel: [ 361.100051] INFO: task clvmd:3291 blocked for more than 120 seconds. Dec 13 00:45:56 san1 kernel: [ 361.100405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 13 00:45:56 san1 kernel: [ 361.100758] clvmd D ffff88043e51b170 0 3291 1 0x00000000 Dec 13 00:45:56 san1 kernel: [ 361.100763] ffff88043e51b170 0000000000000082 000000033e1b2118 ffff88000fa157e8 Dec 13 00:45:56 san1 kernel: [ 361.100766] ffff88043d7b0e58 ffff88000fa15780 000000000000f9e0 ffff88042469dfd8 Dec 13 00:45:56 san1 kernel: [ 361.100770] 0000000000015780 0000000000015780 ffff88043d7b0e20 ffff88043d7b1118 Dec 13 00:45:56 san1 kernel: [ 361.100773] Call Trace: Dec 13 00:45:56 san1 kernel: [ 361.100784] [<ffffffff810482ed>] ? finish_task_switch+0x3a/0xaf Dec 13 00:45:56 san1 kernel: [ 361.100790] [<ffffffff812fb73d>] ? schedule_timeout+0x2e/0xdd Dec 13 00:45:56 san1 kernel: [ 361.100797] [<ffffffff810e5d79>] ? virt_to_head_page+0x9/0x2a Dec 13 00:45:56 san1 kernel: [ 361.100800] [<ffffffff812fb5f4>] ? wait_for_common+0xde/0x15b Dec 13 00:45:56 san1 kernel: [ 361.100803] [<ffffffff8104a4cc>] ? default_wake_function+0x0/0x9 Dec 13 00:45:56 san1 kernel: [ 361.100818] [<ffffffffa020096c>] ? dlm_new_lockspace+0x783/0x8da [dlm] Dec 13 00:45:56 san1 kernel: [ 361.100822] [<ffffffff810e81a5>] ? __kmalloc+0x12f/0x141 Dec 13 00:45:56 san1 kernel: [ 361.100829] [<ffffffffa0206c67>] ? device_write+0x420/0x5f2 [dlm] Dec 13 00:45:56 san1 kernel: [ 361.100834] [<ffffffff810cd128>] ? handle_mm_fault+0x7aa/0x80f Dec 13 00:45:56 san1 kernel: [ 361.100839] [<ffffffff810ef544>] ? vfs_write+0xa9/0x102 Dec 13 00:45:56 san1 kernel: [ 361.100842] [<ffffffff810ef659>] ? sys_write+0x45/0x6e Dec 13 00:45:56 san1 kernel: [ 361.100845] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b Dec 13 00:46:06 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4716) Dec 13 00:46:06 san1 lrmd: [4716]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Dec 13 00:46:06 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4716 exited with return code 0 Dec 13 00:46:16 san1 corosync[2571]: [pcmk ] notice: pcmk_shutdown: Still waiting for crmd (pid=2600, seq=6) to terminate... Dec 13 00:46:16 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4724) Dec 13 00:46:16 san1 lrmd: [4724]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Dec 13 00:46:16 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4724 exited with return code 0 Dec 13 00:46:26 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4732) Dec 13 00:46:26 san1 lrmd: [4732]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Dec 13 00:46:26 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4732 exited with return code 0 Dec 13 00:46:36 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4740) Dec 13 00:46:36 san1 lrmd: [4740]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Dec 13 00:46:36 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4740 exited with return code 0 Dec 13 00:46:46 san1 corosync[2571]: [pcmk ] notice: pcmk_shutdown: Still waiting for crmd (pid=2600, seq=6) to terminate... Dec 13 00:46:46 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4748) Dec 13 00:46:46 san1 lrmd: [4748]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Dec 13 00:46:46 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4748 exited with return code 0 Dec 13 00:46:56 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4759) Dec 13 00:46:56 san1 lrmd: [4759]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Dec 13 00:46:56 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4759 exited with return code 0 Dec 13 00:47:06 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4767) Dec 13 00:47:06 san1 lrmd: [4767]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Dec 13 00:47:06 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4767 exited with return code 0 Regards, Erik ----- Ursprüngliche Mail ----- Von: "Andrew Beekhof" <and...@beekhof.net> An: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> Gesendet: Donnerstag, 8. Dezember 2011 04:15:46 Betreff: Re: [Pacemaker] pcmk_shutdown: Still waiting for crmd On Wed, Dec 7, 2011 at 9:44 PM, Erik Schwalbe <erik.schwa...@canoo.com> wrote: > Hi, > > I think there is never an attempt to demote it. > kern.log: > > 11:35:58 ubuntu1 kernel: Kernel logging (proc) stopped. > > corosync.log: > 11:35:59 ubuntu1 crmd: [839]: info: crm_signal_dispatch: Invoking handler for > signal 15: Terminated > 11:35:59 ubuntu1 crmd: [839]: info: crm_shutdown: Requesting shutdown > 11:35:59 ubuntu1 crmd: [839]: notice: crm_shutdown: Forcing shutdown in: > 1200000ms > > after that, there is no entry in kern.log Sounds strange. Can you file a bug and attach a crm_report dating from the time you initiated shutdown please? > > Regards, > Erik > > > ----- Ursprüngliche Mail ----- > Von: "Andreas Kurz" <andr...@hastexo.com> > An: pacemaker@oss.clusterlabs.org > Gesendet: Mittwoch, 7. Dezember 2011 11:06:14 > Betreff: Re: [Pacemaker] pcmk_shutdown: Still waiting for crmd > > On 12/07/2011 10:27 AM, Erik Schwalbe wrote: >> Hi, >> >> I built a test cluster with 2 nodes. >> Ubuntu 10.4.3 LTS with *ppa:ubuntu-ha-maintainers/ppa* >> >> corosync 1.4.2 >> pacemaker 1.1.6 >> >> primitive clvm ocf:lvm2:clvmd \ >> params daemon_timeout="30" \ >> operations $id="clvm-operations" \ >> op start interval="0" timeout="90" \ >> op stop interval="0" timeout="100" \ >> op monitor interval="0" timeout="20" start-delay="0" \ >> meta target-role="started" >> primitive data ocf:heartbeat:LVM \ >> params volgrpname="data" \ >> operations $id="data-operations" \ >> op start interval="0" timeout="30" \ >> op stop interval="0" timeout="30" \ >> op monitor interval="10" timeout="120" start-delay="0" \ >> op methods interval="0" timeout="5" \ >> meta target-role="started" >> primitive dlm ocf:pacemaker:controld \ >> operations $id="dlm-operations" \ >> op start interval="0" timeout="90" \ >> op stop interval="0" timeout="100" \ >> op monitor interval="10" timeout="20" start-delay="0" \ >> meta target-role="started" >> primitive fs ocf:heartbeat:Filesystem \ >> params device="/dev/data/test" directory="/data/test" >> fstype="ocfs2" \ >> operations $id="fs-operations" \ >> op start interval="0" timeout="60" \ >> op stop interval="0" timeout="60" \ >> op monitor interval="120" timeout="40" start-delay="0" \ >> op notify interval="0" timeout="60" \ >> meta target-role="started" >> primitive o2cb ocf:pacemaker:o2cb \ >> operations $id="o2cb-operations" \ >> op start interval="0" timeout="90" \ >> op stop interval="0" timeout="100" \ >> op monitor interval="0" timeout="20" start-delay="0" \ >> meta target-role="started" >> primitive res_DRBD ocf:linbit:drbd \ >> params drbd_resource="r0" \ >> operations $id="res_DRBD-operations" \ >> op start interval="0" timeout="240" \ >> op promote interval="0" timeout="90" \ >> op demote interval="0" timeout="90" \ >> op stop interval="0" timeout="100" \ >> op monitor interval="30" timeout="20" start-delay="1min" \ >> op notify interval="0" timeout="90" \ >> meta target-role="started" >> group dlm-clvm dlm clvm >> ms ms_DRBD res_DRBD \ >> meta master-max="2" clone-max="2" notify="true" interleave="true" >> clone clone_data data \ >> meta clone-max="2" ordered="true" interleave="true" >> clone dlm-clvm-clone dlm-clvm \ >> meta interleave="true" ordered="true" >> clone fs-clone fs \ >> meta clone-max="2" ordered="true" interleave="true" >> clone o2cb-clone o2cb \ >> meta clone-max="2" interleave="true" >> colocation col_data_clvm-dlm-clone inf: clone_data dlm-clvm-clone >> colocation col_fs_o2cb inf: fs-clone o2cb-clone >> colocation col_ms_DRBD_dlm-clvm-clone inf: dlm-clvm-clone ms_DRBD:Master >> colocation col_o2cb_dlm-clvm inf: o2cb-clone dlm-clvm-clone >> order ord_data_after_clvm-dlm-clone inf: dlm-clvm-clone clone_data >> order ord_ms_DRBD_dlm-clvm-clone inf: ms_DRBD:promote dlm-clvm-clone:start >> order ord_o2cb_after_dlm-clvm 0: dlm-clvm-clone o2cb-clone >> order ord_o2cb_fs inf: o2cb-clone fs-clone >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ >> cluster-infrastructure="openais" \ >> expected-quorum-votes="2" \ >> stonith-enabled="false" \ >> no-quorum-policy="ignore" \ >> last-lrm-refresh="1323246238" \ >> default-resource-stickiness="1000" >> >> The problem is to restart corosync or to reboot a cluster node. All >> resources are stopped except for drbd resource. Than the system hangs >> for a long time. > > Is there a timeout on stopping/demoteing DRBD and do you see kernel > messages from DRBD about being unable to demote because in use ... or is > there never an attempt to demote it? > > Regards, > Andreas > > -- > Need help with Pacemaker? > http://www.hastexo.com/now > >> corosync.log: >> >> ubuntu0 crmd: [926]: info: do_state_transition: (Re)Issuing shutdown >> request now that we are the DC >> ubuntu0 crmd: [926]: info: do_state_transition: Starting PEngine Recheck >> Timer >> ubuntu0 crmd: [926]: info: do_shutdown_req: Sending shutdown request to >> DC: ubuntu0 >> ubuntu0 crmd: [926]: info: handle_shutdown_request: Creating shutdown >> request for ubuntu0 (state=S_IDLE) >> corosync [pcmk ] notice: pcmk_shutdown: Still waiting for crmd >> (pid=926, seq=6) to terminate... >> corosync [pcmk ] notice: pcmk_shutdown: Still waiting for crmd >> (pid=926, seq=6) to terminate... >> corosync [pcmk ] notice: pcmk_shutdown: Still waiting for crmd >> (pid=926, seq=6) to terminate... >> corosync [pcmk ] notice: pcmk_shutdown: Still waiting for crmd >> (pid=926, seq=6) to terminate... >> corosync [pcmk ] notice: pcmk_shutdown: Still waiting for crmd >> (pid=926, seq=6) to terminate... >> >> I tested the same config with a debian 6.0.3. The reboot works. The >> behaviour there is, that in the first step the drbd resource demote to >> secondary and then goes down. >> >> Is this a known problem?? >> >> Thank you for help. >> >> Regards, >> Erik >> >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org