Re: [ClusterLabs] restarting pacemakerd

2016-06-20 Thread Ken Gaillot
On 06/18/2016 05:15 AM, Ferenc Wágner wrote:
> Hi,
> 
> Could somebody please elaborate a little why the pacemaker systemd
> service file contains "Restart=on-failure"?  I mean that a failed node
> gets fenced anyway, so most of the time this would be a futile effort.
> On the other hand, one could argue that restarting failed services
> should be the default behavior of systemd (or any init system).  Still,
> it is not.  I'd be grateful for some insight into the matter.

To clarify one point, the configuration mentioned here is systemd
configuration, not part of pacemaker configuration or operation. Systemd
monitors the processes it launches. With "Restart=on-failure", system
will re-launch pacemaker in situations systemd considers "failure"
(exiting nonzero, exiting with core dump, etc.).

Systemd does have various rate-limiting options, which we leave as
default in the pacemaker unit file. Perhaps one day we could try to come
up with ideal values, but it should be a rare situation, and admins can
always tune them as desired for their system using an override file.

The goal of restart is of course to have a slightly better shot at
recovery. You're right, if fencing is configured and quorum is retained,
the node will almost certainly get fenced anyway, but those conditions
aren't always true.

Systemd upstream recommends Restart=on-failure or Restart=on-abnormal
for all long-running services. on-abnormal would probably be better for
pacemaker, but it's not supported in older systemd versions.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] restarting pacemakerd

2016-06-19 Thread Digimer
On 19/06/16 01:59 AM, Andrei Borzenkov wrote:
> 18.06.2016 22:04, Dmitri Maziuk пишет:
>> On 2016-06-18 05:15, Ferenc Wágner wrote:
>> ...
>>> On the other hand, one could argue that restarting failed services
>>> should be the default behavior of systemd (or any init system).  Still,
>>> it is not.
>>
>> As an off-topic snide comment, I never understood the thinking behind
>> that: restarting without removing the cause of the failure will just
>> make it fail again. If at first you don't succeed, then try, try, try
>> again?
>>
> 
> Some problems are transient and restarting may succeed (most obvious
> example is program crash which includes OS kernel crash). What is needed
> here is rate limiting so restart is not attempted indefinitely.

Rgmanager offers this via "max_restarts". I'd be shocked if there wasn't
a version of this in pacemaker already, given that it has for more
flexibility than rgmanager.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] restarting pacemakerd

2016-06-19 Thread Andrei Borzenkov
18.06.2016 22:04, Dmitri Maziuk пишет:
> On 2016-06-18 05:15, Ferenc Wágner wrote:
> ...
>> On the other hand, one could argue that restarting failed services
>> should be the default behavior of systemd (or any init system).  Still,
>> it is not.
> 
> As an off-topic snide comment, I never understood the thinking behind
> that: restarting without removing the cause of the failure will just
> make it fail again. If at first you don't succeed, then try, try, try
> again?
> 

Some problems are transient and restarting may succeed (most obvious
example is program crash which includes OS kernel crash). What is needed
here is rate limiting so restart is not attempted indefinitely.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] restarting pacemakerd

2016-06-18 Thread Dimitri Maziuk
On 06/18/2016 02:15 PM, Digimer wrote:

> When your focus is availability, restarting makes sense. What you want
> to do is alert an admin that a restart was needed, so that he or she can
> investigate the cause. Pacemaker 1.1.15 allows for this alerting now.

When your focus is availability, restarting on the node that doesn't
have the error makes sense. As does alerting the admin there's a
problem. Like e.g. drbd's
handlers {
  split-brain "/usr/lib/drbd/notify-split-brain.sh root";
}

The difference between "restating and failing" and "restarting and
generating alert and failing" is the alert flood.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] restarting pacemakerd

2016-06-18 Thread Digimer
On 18/06/16 03:04 PM, Dmitri Maziuk wrote:
> On 2016-06-18 05:15, Ferenc Wágner wrote:
> ...
>> On the other hand, one could argue that restarting failed services
>> should be the default behavior of systemd (or any init system).  Still,
>> it is not.
> 
> As an off-topic snide comment, I never understood the thinking behind
> that: restarting without removing the cause of the failure will just
> make it fail again. If at first you don't succeed, then try, try, try
> again?
> 
> Dimitri

When your focus is availability, restarting makes sense. What you want
to do is alert an admin that a restart was needed, so that he or she can
investigate the cause. Pacemaker 1.1.15 allows for this alerting now.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] restarting pacemakerd

2016-06-18 Thread Dmitri Maziuk

On 2016-06-18 05:15, Ferenc Wágner wrote:
...

On the other hand, one could argue that restarting failed services
should be the default behavior of systemd (or any init system).  Still,
it is not.


As an off-topic snide comment, I never understood the thinking behind 
that: restarting without removing the cause of the failure will just 
make it fail again. If at first you don't succeed, then try, try, try again?


Dimitri


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] restarting pacemakerd

2016-06-18 Thread Ferenc Wágner
Hi,

Could somebody please elaborate a little why the pacemaker systemd
service file contains "Restart=on-failure"?  I mean that a failed node
gets fenced anyway, so most of the time this would be a futile effort.
On the other hand, one could argue that restarting failed services
should be the default behavior of systemd (or any init system).  Still,
it is not.  I'd be grateful for some insight into the matter.
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org