02.12.2017 16:30, Jan Pokorný пишет: > > In race-condition free situation, such a BindsTo-incurred stopping (or > at least scheduled to since 235?) of the service is then not a subject > of auto-restarting, from what I've observed, and documentation agrees: > > Restart= [...] When the death of the process is a result of systemd > operation (e.g. service stop or restart), the service will not be > restarted >
Yes, if systemd has chance to explicitly queue Stop action, that's correct. >>>> (FTR, I tried with systemd 235). >>>> >> >> Well ... what we have here is race condition. We have two events - >> corosync.service and pacemaker.service *independent* failures >> and two (re-)actions - stop pacemaker.service in response to the >> former (due to BindsTo) and restart pacemaker.service in response to >> the latter (due to Restart=on-failure). The final result depends on >> the order in which systemd gets those events and schedules actions >> (and relative timing when those actions complete) and this is not >> deterministic. > > Coming to similar conclusion. > To illustrate. Following are two logs from the same system from two consecutive "systemctl start pacemaker; pkill -9 corosync" Number 1: Dec 02 20:03:17 ha1 sbd[1462]: cluster: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Dec 02 20:03:17 ha1 systemd[1]: corosync.service: Main process exited, code=killed, status=9/KILL Dec 02 20:03:17 ha1 sbd[1462]: cluster: warning: sbd_membership_destroy: Lost connection to corosync Dec 02 20:03:17 ha1 systemd[1]: corosync.service: Unit entered failed state. Dec 02 20:03:17 ha1 sbd[1462]: cluster: error: set_servant_health: Cluster connection terminated Dec 02 20:03:17 ha1 systemd[1]: corosync.service: Failed with result 'signal'. Dec 02 20:03:17 ha1 sbd[1462]: cluster: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2 Dec 02 20:03:17 ha1 systemd[1]: Stopping Pacemaker High Availability Cluster Manager... Dec 02 20:03:17 ha1 sbd[1455]: warning: inquisitor_child: cluster health check: UNHEALTHY Dec 02 20:03:17 ha1 sbd[1455]: warning: inquisitor_child: Servant cluster is outdated (age: 170) Dec 02 20:03:17 ha1 cib[1568]: error: Connection to the CPG API failed: Library error (2) Dec 02 20:03:17 ha1 attrd[1571]: error: Connection to the CPG API failed: Library error (2) Dec 02 20:03:17 ha1 attrd[1571]: notice: Disconnecting client 0x55590e1d1190, pid=1573... Dec 02 20:03:17 ha1 stonith-ng[1569]: error: Connection to the CPG API failed: Library error (2) Dec 02 20:03:17 ha1 lrmd[1570]: error: Connection to stonith-ng failed Dec 02 20:03:17 ha1 lrmd[1570]: error: Connection to stonith-ng[0x558bf889f300] closed (I/O condition=17) Dec 02 20:03:17 ha1 pacemakerd[1566]: notice: Caught 'Terminated' signal Dec 02 20:03:17 ha1 pacemakerd[1566]: error: Connection to the CPG API failed: Library error (2) Dec 02 20:03:17 ha1 systemd[1]: pacemaker.service: Main process exited, code=exited, status=107/n/a Dec 02 20:03:17 ha1 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. Key line is "Stopping Pacemaker" which indicates voluntary action on systemd side. Number 2: Dec 02 20:07:33 ha1 sbd[1462]: cluster: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Dec 02 20:07:33 ha1 systemd[1]: corosync.service: Main process exited, code=killed, status=9/KILL Dec 02 20:07:33 ha1 sbd[1462]: cluster: warning: sbd_membership_destroy: Lost connection to corosync Dec 02 20:07:33 ha1 systemd[1]: corosync.service: Unit entered failed state. Dec 02 20:07:33 ha1 sbd[1462]: cluster: error: set_servant_health: Cluster connection terminated Dec 02 20:07:33 ha1 systemd[1]: corosync.service: Failed with result 'signal'. Dec 02 20:07:33 ha1 sbd[1462]: cluster: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2 Dec 02 20:07:33 ha1 systemd[1]: pacemaker.service: Main process exited, code=exited, status=107/n/a Dec 02 20:07:33 ha1 sbd[1455]: warning: inquisitor_child: cluster health check: UNHEALTHY Dec 02 20:07:33 ha1 systemd[1]: Stopped Pacemaker High Availability Cluster Manager. ... Dec 02 20:07:33 ha1 systemd[1]: pacemaker.service: Service hold-off time over, scheduling restart. Here there is no line "Stopping Pacemaker", from systemd PoV it failed and should be restarted. Note that it is quite possible that in the second case systemd still attempts to stop pacemaker due to BindsTo directive. But this job is dropped as redundant and so we never actually see it. And as soon as you enable debug output timing is skewed and you cannot reproduce it anymore.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org