Re: [ClusterLabs] restarting pacemakerd
On 06/18/2016 05:15 AM, Ferenc Wágner wrote: > Hi, > > Could somebody please elaborate a little why the pacemaker systemd > service file contains "Restart=on-failure"? I mean that a failed node > gets fenced anyway, so most of the time this would be a futile effort. > On the other hand, one could argue that restarting failed services > should be the default behavior of systemd (or any init system). Still, > it is not. I'd be grateful for some insight into the matter. To clarify one point, the configuration mentioned here is systemd configuration, not part of pacemaker configuration or operation. Systemd monitors the processes it launches. With "Restart=on-failure", system will re-launch pacemaker in situations systemd considers "failure" (exiting nonzero, exiting with core dump, etc.). Systemd does have various rate-limiting options, which we leave as default in the pacemaker unit file. Perhaps one day we could try to come up with ideal values, but it should be a rare situation, and admins can always tune them as desired for their system using an override file. The goal of restart is of course to have a slightly better shot at recovery. You're right, if fencing is configured and quorum is retained, the node will almost certainly get fenced anyway, but those conditions aren't always true. Systemd upstream recommends Restart=on-failure or Restart=on-abnormal for all long-running services. on-abnormal would probably be better for pacemaker, but it's not supported in older systemd versions. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] restarting pacemakerd
On 19/06/16 01:59 AM, Andrei Borzenkov wrote: > 18.06.2016 22:04, Dmitri Maziuk пишет: >> On 2016-06-18 05:15, Ferenc Wágner wrote: >> ... >>> On the other hand, one could argue that restarting failed services >>> should be the default behavior of systemd (or any init system). Still, >>> it is not. >> >> As an off-topic snide comment, I never understood the thinking behind >> that: restarting without removing the cause of the failure will just >> make it fail again. If at first you don't succeed, then try, try, try >> again? >> > > Some problems are transient and restarting may succeed (most obvious > example is program crash which includes OS kernel crash). What is needed > here is rate limiting so restart is not attempted indefinitely. Rgmanager offers this via "max_restarts". I'd be shocked if there wasn't a version of this in pacemaker already, given that it has for more flexibility than rgmanager. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] restarting pacemakerd
18.06.2016 22:04, Dmitri Maziuk пишет: > On 2016-06-18 05:15, Ferenc Wágner wrote: > ... >> On the other hand, one could argue that restarting failed services >> should be the default behavior of systemd (or any init system). Still, >> it is not. > > As an off-topic snide comment, I never understood the thinking behind > that: restarting without removing the cause of the failure will just > make it fail again. If at first you don't succeed, then try, try, try > again? > Some problems are transient and restarting may succeed (most obvious example is program crash which includes OS kernel crash). What is needed here is rate limiting so restart is not attempted indefinitely. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] restarting pacemakerd
On 06/18/2016 02:15 PM, Digimer wrote: > When your focus is availability, restarting makes sense. What you want > to do is alert an admin that a restart was needed, so that he or she can > investigate the cause. Pacemaker 1.1.15 allows for this alerting now. When your focus is availability, restarting on the node that doesn't have the error makes sense. As does alerting the admin there's a problem. Like e.g. drbd's handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root"; } The difference between "restating and failing" and "restarting and generating alert and failing" is the alert flood. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] restarting pacemakerd
On 18/06/16 03:04 PM, Dmitri Maziuk wrote: > On 2016-06-18 05:15, Ferenc Wágner wrote: > ... >> On the other hand, one could argue that restarting failed services >> should be the default behavior of systemd (or any init system). Still, >> it is not. > > As an off-topic snide comment, I never understood the thinking behind > that: restarting without removing the cause of the failure will just > make it fail again. If at first you don't succeed, then try, try, try > again? > > Dimitri When your focus is availability, restarting makes sense. What you want to do is alert an admin that a restart was needed, so that he or she can investigate the cause. Pacemaker 1.1.15 allows for this alerting now. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] restarting pacemakerd
On 2016-06-18 05:15, Ferenc Wágner wrote: ... On the other hand, one could argue that restarting failed services should be the default behavior of systemd (or any init system). Still, it is not. As an off-topic snide comment, I never understood the thinking behind that: restarting without removing the cause of the failure will just make it fail again. If at first you don't succeed, then try, try, try again? Dimitri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] restarting pacemakerd
Hi, Could somebody please elaborate a little why the pacemaker systemd service file contains "Restart=on-failure"? I mean that a failed node gets fenced anyway, so most of the time this would be a futile effort. On the other hand, one could argue that restarting failed services should be the default behavior of systemd (or any init system). Still, it is not. I'd be grateful for some insight into the matter. -- Thanks, Feri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org