Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-30 Thread Andrei Borzenkov
On Thu, Nov 30, 2017 at 12:42 AM, Jan Pokorný  wrote:
> On 29/11/17 22:00 +0100, Jan Pokorný wrote:
>> On 28/11/17 22:35 +0300, Andrei Borzenkov wrote:
>>> 28.11.2017 13:01, Jan Pokorný пишет:
 On 27/11/17 17:43 +0300, Andrei Borzenkov wrote:
> Отправлено с iPhone
>
>> 27 нояб. 2017 г., в 14:36, Ferenc Wágner  написал(а):
>>
>> Andrei Borzenkov  writes:
>>
>>> 25.11.2017 10:05, Andrei Borzenkov пишет:
>>>
 In one of guides suggested procedure to simulate split brain was to 
 kill
 corosync process. It actually worked on one cluster, but on another
 corosync process was restarted after being killed without cluster
 noticing anything. Except after several attempts pacemaker died with
 stopping resources ... :)

 This is SLES12 SP2; I do not see any Restart in service definition so 
 it
 probably not systemd.

>>> FTR - it was not corosync, but pacemakker; its unit file specifies
>>> RestartOn=error so killing corosync caused pacemaker to fail and be
>>> restarted by systemd.
>>
>> And starting corosync via a Requires dependency?
>
> Exactly.

 From my testing it looks like we should change
 "Requires=corosync.service" to "BindsTo=corosync.service"
 in pacemaker.service.

 Could you give it a try?

>>>
>>> I'm not sure what is expected outcome, but pacemaker.service is still
>>> restarted (due to Restart=on-failure).
>>
>> Expected outcome is that pacemaker.service will become
>> "inactive (dead)" after killing corosync (as a result of being
>> "bound" by pacemaker).  Have you indeed issued "systemctl
>> daemon-reload" after updating the pacemaker unit file?
>>

Of course. I even rebooted ... :)

ha1:~ # systemctl cat pacemaker.service  | grep corosync
After=corosync.service
BindsTo=corosync.service
# ExecStopPost=/bin/sh -c 'pidof crmd || killall -TERM corosync'
ha1:~ #

Nov 30 10:41:14 ha1 sbd[1743]:cluster:error:
pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Main process exited,
code=killed, status=9/KILL
Nov 30 10:41:14 ha1 sbd[1743]:cluster:  warning:
sbd_membership_destroy: Lost connection to corosync
Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Main process
exited, code=exited, status=107/n/a
Nov 30 10:41:14 ha1 sbd[1743]:cluster:error:
set_servant_health: Cluster connection terminated
Nov 30 10:41:14 ha1 systemd[1]: Stopped Pacemaker High Availability
Cluster Manager.
Nov 30 10:41:14 ha1 sbd[1743]:cluster:error:
cluster_connect_cpg: Could not connect to the Cluster Process Group
API: 2
Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Unit entered failed state.
Nov 30 10:41:14 ha1 sbd[1739]:  warning: inquisitor_child: cluster
health check: UNHEALTHY
Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Failed with result
'exit-code'.
...
Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Unit entered failed state.
Nov 30 10:41:14 ha1 systemd[1]: corosync.service: Failed with result 'signal'.
Nov 30 10:41:14 ha1 systemd[1]: pacemaker.service: Service hold-off
time over, scheduling restart.
Nov 30 10:41:14 ha1 systemd[1]: Stopped Pacemaker High Availability
Cluster Manager.
Nov 30 10:41:14 ha1 systemd[1]: Starting Corosync Cluster Engine...

Do you mean you get different results? Do not forget that the only
thing BindsTo does is to stop service is dependency failed; it does
*not* affect decision whether to restart service in any way (at least
directly).


>> (FTR, I tried with systemd 235).
>>

Well ... what we have here is race condition. We have two events -
corosync.service and pacemaker.service *independent* failures and two
(re-)actions - stop pacemaker.service in response to the former (due
to BindsTo) and restart pacemaker.service in response to the latter
(due to Restart=on-failure). The final result depends on the order in
which systemd gets those events and schedules actions (and relative
timing when those actions complete) and this is not deterministic.

Now 235 includes some changes to restart logic which refuses to do
restart if other action (like stop) is currently being scheduled. I am
not sure what happens if restart is scheduled first though (such
"implementation details" tend to be not documented in systemd world).
I have been doing systemd troubleshooting for a long time to know that
even if you observe specific sequence of events, another system may
exhibit completely different sequence.

Anyway, I will try to install system with 235 on the same platform to
see how it behaves.

>>> If intention is to unconditionally stop it when corosync dies,
>>> pacemaker should probably exit with unique code and unit files have
>>> RestartPreventExitStatus set to it.
>>
>> That would be an elaborate way to reach the same.
>>

This is the *only* way to reach the same. You cannot both tell service
manager to restart s

Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-29 Thread Jan Pokorný
On 29/11/17 22:00 +0100, Jan Pokorný wrote:
> On 28/11/17 22:35 +0300, Andrei Borzenkov wrote:
>> 28.11.2017 13:01, Jan Pokorný пишет:
>>> On 27/11/17 17:43 +0300, Andrei Borzenkov wrote:
 Отправлено с iPhone
 
> 27 нояб. 2017 г., в 14:36, Ferenc Wágner  написал(а):
> 
> Andrei Borzenkov  writes:
> 
>> 25.11.2017 10:05, Andrei Borzenkov пишет:
>> 
>>> In one of guides suggested procedure to simulate split brain was to kill
>>> corosync process. It actually worked on one cluster, but on another
>>> corosync process was restarted after being killed without cluster
>>> noticing anything. Except after several attempts pacemaker died with
>>> stopping resources ... :)
>>> 
>>> This is SLES12 SP2; I do not see any Restart in service definition so it
>>> probably not systemd.
>>> 
>> FTR - it was not corosync, but pacemakker; its unit file specifies
>> RestartOn=error so killing corosync caused pacemaker to fail and be
>> restarted by systemd.
> 
> And starting corosync via a Requires dependency?
 
 Exactly.
>>> 
>>> From my testing it looks like we should change
>>> "Requires=corosync.service" to "BindsTo=corosync.service"
>>> in pacemaker.service.
>>> 
>>> Could you give it a try?
>>> 
>> 
>> I'm not sure what is expected outcome, but pacemaker.service is still
>> restarted (due to Restart=on-failure).
> 
> Expected outcome is that pacemaker.service will become
> "inactive (dead)" after killing corosync (as a result of being
> "bound" by pacemaker).  Have you indeed issued "systemctl
> daemon-reload" after updating the pacemaker unit file?
> 
> (FTR, I tried with systemd 235).
> 
>> If intention is to unconditionally stop it when corosync dies,
>> pacemaker should probably exit with unique code and unit files have
>> RestartPreventExitStatus set to it.
> 
> That would be an elaborate way to reach the same.
> 
> But good point in questioning what's the "best intention" around these
> scenarios -- normally, fencing would happen, but as you note, the node
> had actually survived by being fast enough to put corosync back to
> life, and from there, whether it adds any value to have pacemaker
> restarted on non-clean terminations at all.  I don't know.
> 
> Would it make more sense to have FailureAction=reboot-immediate to
> at least in part emulate the fencing instead?

Although the restart may be also blazingly fast in some cases,
not making much difference except for taking all the previously
running resources forcibly down as an extra step, which may be
either good or bad.

-- 
Jan (Poki)


pgpo6ZFEeT30X.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-29 Thread Jan Pokorný
On 28/11/17 22:35 +0300, Andrei Borzenkov wrote:
> 28.11.2017 13:01, Jan Pokorný пишет:
>> On 27/11/17 17:43 +0300, Andrei Borzenkov wrote:
>>> Отправлено с iPhone
>>> 
 27 нояб. 2017 г., в 14:36, Ferenc Wágner  написал(а):
 
 Andrei Borzenkov  writes:
 
> 25.11.2017 10:05, Andrei Borzenkov пишет:
> 
>> In one of guides suggested procedure to simulate split brain was to kill
>> corosync process. It actually worked on one cluster, but on another
>> corosync process was restarted after being killed without cluster
>> noticing anything. Except after several attempts pacemaker died with
>> stopping resources ... :)
>> 
>> This is SLES12 SP2; I do not see any Restart in service definition so it
>> probably not systemd.
>> 
> FTR - it was not corosync, but pacemakker; its unit file specifies
> RestartOn=error so killing corosync caused pacemaker to fail and be
> restarted by systemd.
 
 And starting corosync via a Requires dependency?
>>> 
>>> Exactly.
>> 
>> From my testing it looks like we should change
>> "Requires=corosync.service" to "BindsTo=corosync.service"
>> in pacemaker.service.
>> 
>> Could you give it a try?
>> 
> 
> I'm not sure what is expected outcome, but pacemaker.service is still
> restarted (due to Restart=on-failure).

Expected outcome is that pacemaker.service will become
"inactive (dead)" after killing corosync (as a result of being
"bound" by pacemaker).  Have you indeed issued "systemctl
daemon-reload" after updating the pacemaker unit file?

(FTR, I tried with systemd 235).

> If intention is to unconditionally stop it when corosync dies,
> pacemaker should probably exit with unique code and unit files have
> RestartPreventExitStatus set to it.

That would be an elaborate way to reach the same.

But good point in questioning what's the "best intention" around these
scenarios -- normally, fencing would happen, but as you note, the node
had actually survived by being fast enough to put corosync back to
life, and from there, whether it adds any value to have pacemaker
restarted on non-clean terminations at all.  I don't know.

Would it make more sense to have FailureAction=reboot-immediate to
at least in part emulate the fencing instead?

-- 
Jan (Poki)


pgpvr3dRWe6V_.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-28 Thread Andrei Borzenkov
28.11.2017 13:01, Jan Pokorný пишет:
> On 27/11/17 17:43 +0300, Andrei Borzenkov wrote:
>> Отправлено с iPhone
>>
>>> 27 нояб. 2017 г., в 14:36, Ferenc Wágner  написал(а):
>>>
>>> Andrei Borzenkov  writes:
>>>
 25.11.2017 10:05, Andrei Borzenkov пишет:

> In one of guides suggested procedure to simulate split brain was to kill
> corosync process. It actually worked on one cluster, but on another
> corosync process was restarted after being killed without cluster
> noticing anything. Except after several attempts pacemaker died with
> stopping resources ... :)
>
> This is SLES12 SP2; I do not see any Restart in service definition so it
> probably not systemd.
>
 FTR - it was not corosync, but pacemakker; its unit file specifies
 RestartOn=error so killing corosync caused pacemaker to fail and be
 restarted by systemd.
>>>
>>> And starting corosync via a Requires dependency?
>>
>> Exactly.
> 
> From my testing it looks like we should change
> "Requires=corosync.service" to "BindsTo=corosync.service"
> in pacemaker.service.
> 
> Could you give it a try?
> 

I'm not sure what is expected outcome, but pacemaker.service is still
restarted (due to Restart=on-failure). If intention is to
unconditionally stop it when corosync dies, pacemaker should probably
exit with unique code and unit files have RestartPreventExitStatus set
to it.



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-28 Thread Jan Pokorný
On 27/11/17 17:43 +0300, Andrei Borzenkov wrote:
> Отправлено с iPhone
> 
>> 27 нояб. 2017 г., в 14:36, Ferenc Wágner  написал(а):
>> 
>> Andrei Borzenkov  writes:
>> 
>>> 25.11.2017 10:05, Andrei Borzenkov пишет:
>>> 
 In one of guides suggested procedure to simulate split brain was to kill
 corosync process. It actually worked on one cluster, but on another
 corosync process was restarted after being killed without cluster
 noticing anything. Except after several attempts pacemaker died with
 stopping resources ... :)
 
 This is SLES12 SP2; I do not see any Restart in service definition so it
 probably not systemd.
 
>>> FTR - it was not corosync, but pacemakker; its unit file specifies
>>> RestartOn=error so killing corosync caused pacemaker to fail and be
>>> restarted by systemd.
>> 
>> And starting corosync via a Requires dependency?
> 
> Exactly.

From my testing it looks like we should change
"Requires=corosync.service" to "BindsTo=corosync.service"
in pacemaker.service.

Could you give it a try?

-- 
Jan (Poki)


pgpyPzPqzvNQR.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-27 Thread Andrei Borzenkov


Отправлено с iPhone

> 27 нояб. 2017 г., в 14:36, Ferenc Wágner  написал(а):
> 
> Andrei Borzenkov  writes:
> 
>> 25.11.2017 10:05, Andrei Borzenkov пишет:
>> 
>>> In one of guides suggested procedure to simulate split brain was to kill
>>> corosync process. It actually worked on one cluster, but on another
>>> corosync process was restarted after being killed without cluster
>>> noticing anything. Except after several attempts pacemaker died with
>>> stopping resources ... :)
>>> 
>>> This is SLES12 SP2; I do not see any Restart in service definition so it
>>> probably not systemd.
>>> 
>> FTR - it was not corosync, but pacemakker; its unit file specifies
>> RestartOn=error so killing corosync caused pacemaker to fail and be
>> restarted by systemd.
> 
> And starting corosync via a Requires dependency?

Exactly.


> -- 
> Feri
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-27 Thread Ferenc Wágner
Andrei Borzenkov  writes:

> 25.11.2017 10:05, Andrei Borzenkov пишет:
>
>> In one of guides suggested procedure to simulate split brain was to kill
>> corosync process. It actually worked on one cluster, but on another
>> corosync process was restarted after being killed without cluster
>> noticing anything. Except after several attempts pacemaker died with
>> stopping resources ... :)
>> 
>> This is SLES12 SP2; I do not see any Restart in service definition so it
>> probably not systemd.
>> 
> FTR - it was not corosync, but pacemakker; its unit file specifies
> RestartOn=error so killing corosync caused pacemaker to fail and be
> restarted by systemd.

And starting corosync via a Requires dependency?
-- 
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is corosync supposed to be restarted if it fies?

2017-11-26 Thread Andrei Borzenkov
25.11.2017 10:05, Andrei Borzenkov пишет:
> In one of guides suggested procedure to simulate split brain was to kill
> corosync process. It actually worked on one cluster, but on another
> corosync process was restarted after being killed without cluster
> noticing anything. Except after several attempts pacemaker died with
> stopping resources ... :)
> 
> This is SLES12 SP2; I do not see any Restart in service definition so it
> probably not systemd.
> 
FTR - it was not corosync, but pacemakker; its unit file specifies
RestartOn=error so killing corosync caused pacemaker to fail and be
restarted by systemd.

I wish systemd could dynamically "unmanage" services ...

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org