22.07.2020 12:46, Хиль Эдуард пишет: > > Hey, Andrei! Thanx for ur time! > A-a-and there is no chance to do something? :( > The pacemaker’s log below. >
Resource was started: ... > Jul 22 12:38:36 node2.local pacemaker-execd [1721] (log_execute) > info: executing - rsc:dummy.service action:start call_id:76 > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.131.4 2 > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.131.5 (null) > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=5 > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_perform_op) > info: + > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']: > @operation_key=dummy.service_start_0, @operation=start, > @transition-key=164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, > @transition-magic=-1:193;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, > @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1595410716, > @last-run=1595410716, @e > Jul 22 12:38:36 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_modify operation for section status: OK (rc=0, > origin=node2.local/crmd/62, version=0.131.5) > Jul 22 12:38:36 node2.local pacemaker-execd [1721] (systemd_exec_result) > info: Call to start passed: /org/freedesktop/systemd1/job/703 > Jul 22 12:38:38 node2.local pacemaker-controld [1724] (process_lrm_event) > notice: Result of start operation for dummy.service on node2.local: 0 (ok) | > call=76 key=dummy.service_start_0 confirmed=true cib-update=63 So start operation at least was successfully completed. > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_process_request) > info: Forwarding cib_modify operation for section status to all > (origin=local/crmd/63) > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.131.5 2 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.131.6 (null) > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=6 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: + > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']: > @transition-magic=0:0;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, > @call-id=76, @rc-code=0, @op-status=0, @last-rc-change=1986, @last-run=1986, > @exec-time=-587720, @queue-time=59 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_modify operation for section status: OK (rc=0, > origin=node2.local/crmd/63, version=0.131.6) > Jul 22 12:38:38 node2.local pacemaker-controld [1724] (do_lrm_rsc_op) > info: Performing key=165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a > op=dummy.service_monitor_60000 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_process_request) > info: Forwarding cib_modify operation for section status to all > (origin=local/crmd/64) > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.131.6 2 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.131.7 (null) > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=7 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: ++ > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']: > <lrm_rsc_op id="dummy.service_monitor_60000" > operation_key="dummy.service_monitor_60000" operation="monitor" > crm-debug-origin="do_update_resource" crm_feature_set="3.2.0" > transition-key="165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a" > transition-magic="-1:193;165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a" > exit-reason="" on_ > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_modify operation for section status: OK (rc=0, > origin=node2.local/crmd/64, version=0.131.7) > Jul 22 12:38:38 node2.local pacemaker-controld [1724] (process_lrm_event) > notice: Result of monitor operation for dummy.service on node2.local: 0 (ok) > | call=77 key=dummy.service_monitor_60000 confirmed=false cib-update=65 And monitor confirmed that it was started > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_process_request) > info: Forwarding cib_modify operation for section status to all > (origin=local/crmd/65) > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.131.7 2 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.131.8 (null) > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=8 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_perform_op) > info: + > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_monitor_60000']: > @transition-magic=0:0;165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, > @call-id=77, @rc-code=0, @op-status=0, @last-rc-change=1989, @exec-time=1542, > @queue-time=63 > Jul 22 12:38:38 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_modify operation for section status: OK (rc=0, > origin=node2.local/crmd/65, version=0.131.8) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Forwarding cib_apply_diff operation for section 'all' to all > (origin=local/cibadmin/2) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.131.8 2 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.132.0 (null) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @epoch=132, @num_updates=0 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: ++ /cib/configuration/resources/primitive[@id='dummy.service']: > <meta_attributes id="dummy.service-meta_attributes"/> > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: ++ > <nvpair id="dummy.service-meta_attributes-target-role" name="target-role" > value="Stopped"/> Something requested stop of this resource. Logs from DC at this point may provide more clues why it happened. > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: ++ > </meta_attributes> > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_apply_diff operation for section 'all': OK (rc=0, > origin=node2.local/cibadmin/2, version=0.132.0) > Jul 22 12:38:42 node2.local pacemaker-fenced [1720] > (update_cib_stonith_devices_v2) info: Updating device list from the cib: > create primitive[@id='dummy.service'] > Jul 22 12:38:42 node2.local pacemaker-fenced [1720] (cib_devices_update) > info: Updating devices to version 0.132.0 > Jul 22 12:38:42 node2.local pacemaker-fenced [1720] (cib_device_update) > info: Device mfs4.stonith has been disabled on node2.local: score=-INFINITY > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_file_backup) > info: Archived previous version as /var/lib/pacemaker/cib/cib-21.raw > Jul 22 12:38:42 node2.local pacemaker-based [1719] > (cib_file_write_with_digest) info: Wrote version 0.132.0 of the CIB to > disk (digest: 408cdd7e93bae2d8e2899d800ab82c61) > Jul 22 12:38:42 node2.local pacemaker-based [1719] > (cib_file_write_with_digest) info: Reading cluster configuration file > /var/lib/pacemaker/cib/cib.oVovE6 (digest: /var/lib/pacemaker/cib/cib.SiUE36) > Jul 22 12:38:42 node2.local pacemaker-execd [1721] > (cancel_recurring_action) info: Cancelling systemd operation > dummy.service_status_60000 > Jul 22 12:38:42 node2.local pacemaker-controld [1724] (do_lrm_rsc_op) > info: Performing key=163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a > op=dummy.service_stop_0 > Jul 22 12:38:42 node2.local pacemaker-execd [1721] (log_execute) > info: executing - rsc:dummy.service action:stop call_id:79 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Forwarding cib_modify operation for section status to all > (origin=local/crmd/66) > Jul 22 12:38:42 node2.local pacemaker-controld [1724] (process_lrm_event) > info: Result of monitor operation for dummy.service on node2.local: > Cancelled | call=77 key=dummy.service_monitor_60000 confirmed=true > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.132.0 2 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.132.1 (null) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=1 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']: > @operation_key=dummy.service_stop_0, @operation=stop, > @transition-key=163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a, > @transition-magic=-1:193;163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a, > @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1595410722, > @last-run=1595410722, @exe > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_modify operation for section status: OK (rc=0, > origin=node2.local/crmd/66, version=0.132.1) > Jul 22 12:38:42 node2.local pacemaker-execd [1721] (systemd_exec_result) > info: Call to stop passed: /org/freedesktop/systemd1/job/798 > Jul 22 12:38:42 node2.local pacemaker-execd [1721] (action_complete) > notice: Giving up on dummy.service stop (rc=0): timeout (elapsed=299980ms, > remaining=-199980ms) Well, times are absolutely nonsensical. Operation was requested less than a second ago. That sounds like the same problem as reported in https://lists.clusterlabs.org/pipermail/users/2020-June/027212.html You need to investigate what tries to stop you resource immediately after start. > Jul 22 12:38:42 node2.local pacemaker-controld [1724] (process_lrm_event) > error: Result of stop operation for dummy.service on node2.local: Timed Out > | call=79 key=dummy.service_stop_0 timeout=100000ms > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Forwarding cib_modify operation for section status to all > (origin=local/crmd/67) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.132.1 2 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.132.2 (null) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=2 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']: > @transition-magic=2:198;163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a, > @call-id=79, @rc-code=198, @op-status=2, @last-rc-change=1992, > @last-run=1992, @exec-time=300031, @queue-time=57 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: ++ > /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']: > <lrm_rsc_op id="dummy.service_last_failure_0" > operation_key="dummy.service_stop_0" operation="stop" > crm-debug-origin="do_update_resource" crm_feature_set="3.2.0" > transition-key="163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a" > transition-magic="2:198;163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a" > exit-reason="" on_node="mfs4 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_modify operation for section status: OK (rc=0, > origin=node2.local/crmd/67, version=0.132.2) > Jul 22 12:38:42 node2.local pacemaker-attrd [1722] (attrd_peer_update) > notice: Setting fail-count-dummy.service#stop_0[node2.local]: (unset) -> > INFINITY | from node1.local > Jul 22 12:38:42 node2.local pacemaker-attrd [1722] (attrd_peer_update) > notice: Setting last-failure-dummy.service#stop_0[node2.local]: (unset) -> > 1595410722 | from node1.local > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.132.2 2 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.132.3 (null) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=3 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: ++ > /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']: > <nvpair id="status-2-fail-count-dummy.service.stop_0" > name="fail-count-dummy.service#stop_0" value="INFINITY"/> > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_modify operation for section status: OK (rc=0, > origin=node1.local/attrd/34, version=0.132.3) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.132.3 2 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.132.4 (null) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=4 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: ++ > /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']: > <nvpair id="status-2-last-failure-dummy.service.stop_0" > name="last-failure-dummy.service#stop_0" value="1595410722"/> > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_modify operation for section status: OK (rc=0, > origin=node1.local/attrd/35, version=0.132.4) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.132.4 2 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.132.5 (null) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: -- > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='drbd_docker1']/lrm_rsc_op[@id='drbd_docker1_monitor_60000'] > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=5 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_delete operation for section status: OK (rc=0, > origin=node1.local/crmd/246, version=0.132.5) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: --- 0.132.5 2 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: Diff: +++ 0.132.6 (null) > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: -- > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='drbd_docker2']/lrm_rsc_op[@id='drbd_docker2_monitor_60000'] > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_perform_op) > info: + /cib: @num_updates=6 > Jul 22 12:38:42 node2.local pacemaker-based [1719] (cib_process_request) > info: Completed cib_delete operation for section status: OK (rc=0, > origin=node1.local/crmd/247, version=0.132.6) > Jul 22 12:38:47 node2.local pacemaker-based [1719] (cib_process_ping) > info: Reporting our current digest to node1.local: > 6406f330dbe96e8c3e732c29f91a5865 for 0.132.6 (0x55a3b5b7a640 0) > >> Среда, 22 июля 2020, 14:17 +05:00 от Andrei Borzenkov <arvidj...@gmail.com>: >> >> >> On Wed, Jul 22, 2020 at 10:59 AM Хиль Эдуард < i.am.t...@mail.ru > wrote: >>> Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on ubuntu 20 >>> + 1 qdevice. I want to define new resource as systemd unit dummy.service : >>> >>> [Unit] >>> Description=Dummy >>> [Service] >>> Restart=on-failure >>> StartLimitInterval=20 >>> StartLimitBurst=5 >>> TimeoutStartSec=0 >>> RestartSec=5 >>> Environment="HOME=/root" >>> SyslogIdentifier=dummy >>> ExecStart=/usr/local/sbin/dummy.sh >>> [Install] >>> WantedBy=multi-user.target >>> >>> and /usr/local/sbin/dummy.sh : >>> >>> #!/bin/bash >>> CNT=0 >>> while true; do >>> let CNT++ >>> echo "hello world $CNT" >>> sleep 5 >>> done >>> >>> and then i try to define it with: pcs resource create dummy.service >>> systemd:dummy op monitor interval="10s" timeout="15s" >>> after 2 seconds node2 reboot. >> >> Node reboots because stop operation failed, no start. >> >>> In logs i see pacemaker in 2 seconds tried to start this unit, and it >>> started, but pacemaker somehow think he is «Timed Out» . What i am doing >>> wrong? Logs below. >>> >>> >>> Jul 21 15:53:41 node2.local pacemaker-controld[1813]: notice: Result of >>> probe operation for dummy.service on node2.local: 7 (not running) >>> Jul 21 15:53:41 node2.local systemd[1]: Reloading. >>> Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5: >>> ListenStream= references a path below legacy directory /var/run/, updating >>> /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please >>> update the unit file accordingly. >>> Jul 21 15:53:42 node2.local systemd[1]: >>> /lib/systemd/system/docker.socket:6: ListenStream= references a path below >>> legacy directory /var/run/, updating /var/run/docker.sock → >>> /run/docker.sock; please update the unit file accordingly. >>> Jul 21 15:53:42 node2.local pacemaker-execd[1808]: notice: Giving up on >>> dummy.service start (rc=0): timeout (elapsed=259719ms, remaining=-159719ms) >>> Jul 21 15:53:42 node2.local pacemaker-controld[1813]: error: Result of >>> start operation for dummy.service on node2.local: Timed Out >>> Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled dummy. >>> Jul 21 15:53:42 node2.local dummy[9330]: hello world 1 >>> Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface >>> NamePolicy= disabled on kernel command line, ignoring. >>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]: notice: Setting >>> fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY >>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]: notice: Setting >>> last-failure-dummy.service#start_0[node2.local]: (unset) -> 1595336022 >>> Jul 21 15:53:42 node2.local systemd[1]: Reloading. >>> Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5: >>> ListenStream= references a path below legacy directory /var/run/, updating >>> /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please >>> update the unit file accordingly. >>> Jul 21 15:53:42 node2.local systemd[1]: >>> /lib/systemd/system/docker.socket:6: ListenStream= references a path below >>> legacy directory /var/run/, updating /var/run/docker.sock → >>> /run/docker.sock; please update the unit file accordingly. >>> Jul 21 15:53:42 node2.local pacemaker-execd[1808]: notice: Giving up on >>> dummy.service stop (rc=0): timeout (elapsed=317181ms, remaining=-217181ms) >> >> 317181ms == 5 minutes. Barring pacemaker bug, you need to show pacemaker log >> since the very first start operation so we can see proper timing. Seeing >> that systemd was reloaded in between, it is quite possible that systemd lost >> track of pending job so any client waiting for confirmation hangs forever. >> Such problems were known, not sure what current status is (if it ever was >> fixed). >> >>> Jul 21 15:53:42 node2.local pacemaker-controld[1813]: error: Result of >>> stop operation for dummy.service on node2.local: Timed Out >>> Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy... >>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]: notice: Setting >>> fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY >>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]: notice: Setting >>> last-failure-dummy.service#stop_0[node2.local]: (unset) -> 1595336022 >>> Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded. >>> Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy. >>> ... lost connection (node rebooting) >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ > > > -- > Хиль Эдуард > > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/