[ClusterLabs] Stopping the last node with pcs

2021-04-27 Thread Digimer
Hi all,

  I noticed something odd.


[root@an-a02n01 ~]# pcs cluster status
Cluster Status:
 Cluster Summary:
   * Stack: corosync
   * Current DC: an-a02n01 (version 2.0.4-6.el8_3.2-2deceaa3ae) -
partition with quorum
   * Last updated: Tue Apr 27 23:20:45 2021
   * Last change:  Tue Apr 27 23:12:40 2021 by root via cibadmin on
an-a02n01
   * 2 nodes configured
   * 12 resource instances configured (4 DISABLED)
 Node List:
   * Online: [ an-a02n01 ]
   * OFFLINE: [ an-a02n02 ]

PCSD Status:
  an-a02n01: Online
  an-a02n02: Offline

[root@an-a02n01 ~]# pcs cluster stop
Error: Stopping the node will cause a loss of the quorum, use --force to
override


  Shouldn't pcs know it's the last node and shut down without complaint?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Autostart/Enabling of Pacemaker and corosync

2021-04-27 Thread Jehan-Guillaume de Rorthais
On Mon, 26 Apr 2021 18:04:41 + (UTC)
Strahil Nikolov  wrote:

> I prefer that the stack is auto enabled. Imagine that you got a DB that is
> replicated and primary DB node is fenced. You would like that node to join
> the cluster and if possible to sync with the new primary instead of staying
> down.

In the case of PostgreSQL, the failing primary may not be able to failback
automatically with the new primary. Worse, if it actually enters in
replication, it might just silently become a corrupted standby, giving a wrong
feeling of safety, until a new failover occurs.

PAF doesn't handle auto-failback (eg. pg_rewind) per design, to avoid code
complexity. We don't want to give a wrong feeling of perfect
full-availability/failback/fully-automated-admin'ed PgSQL cluster. If something
went wrong with your DB, you better need to check and fix it. You need both
system and DBA guy on board to take care of the availability and safety of your
cluster.

Note that auto-failback of secondary nodes is safe, as far as they are able
to actually follow up with the production. Maybe we can imaginer some safety
belts in PAF's code to allow Pacemaker auto-start on boot, but refuse
to start a badly shaped PostgreSQL.

> One such example is the SAP HANA DB. Imagine that the current primary
> node looses storage and it failed to commit all transactions to disk. Without
> replication you will endure data loss for the last 1-2 minutes (depends on
> your monitoring interval) unless you got a replication.

PAF is a shared-nothing approach, it requires replication between nodes.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Autostart/Enabling of Pacemaker and corosync

2021-04-27 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 26.04.2021 um 20:06 in
Nachricht
:
> On Mon, 2021-04-26 at 17:04 +, Moneta, Howard wrote:
>> Hello community.  I have read that it is not recommended to set
>> Pacemaker and corosync to enabled/auto start on the nodes.  Is this
>> how people have it configured? If a computer restarts unexpectedly,
>> is it better to manually investigate first or allow the node to come
>> back online and rejoin the cluster automaticly in order to minimize
>> downtime?  If the auto start is not enabled, how do you handle
>> patching?  I’m using Pacemaker with PAF, PostgreSQL Automatic
>> Failover. I had thought to follow the published guidance and not set
>> those processes to enabled but other coworkers are resisting and
>> saying that the systems should be configured to recover by themselves
>> around patching or even a temporary unplanned network/virtualization
>> glitch.
>>  
>> Thanks,
>> Howard
> 
> Hi Howard,
> 
> It's a matter of preference. You summed up the pros and cons of each
> side quite well. :)
> 
> The manual approach leans more to safety. For example, if a node got
> fenced because its network card is flaky, or a disk is having write
> errors, then having it automatically rejoin is just going to repeat the
> problem.
> 
> The automated approach leans more to quick self-recovery, and is more
> convenient in larger organizations where not every administrator that
> has access to the host for applying updates etc. is trained on the
> cluster software.

As auto-start is usually paired with auto-stop, enabling those has the
advantage that a clean node restart could be as simple as pressing
Ctrl+Alt+Del, while some stressed admins may forget to shutdown the cluster
node before restarting...

Regards,
Ulrich

> -- 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: Autostart/Enabling of Pacemaker and corosync

2021-04-27 Thread Ulrich Windl
>>> damiano giuliani  schrieb am 26.04.2021 um
20:04
in Nachricht
:
> Personally i discourage the use of the auto restarts/rejoin, if something
> wrong happened, better investigate the causes and then enable the failed
> node again.

Well, actually I find it less stressing if the cluster is running, and you are
examining thelogs while the cluster is running.
Typically it takes an hour or more to analyze what was going on, especially if
the cluster fenced multiple times.
The situation may be different if some resource can't be started any more; so
immediate action is required, but usually some external events (like network
outrages, CPU load, full disks) cause the problems.

> failovers shouldnt occour frequently, only if something went really bad: as
> far i know, pacemaker and PAF doesnt support any kind of autoheal so it
> should a good thing check the causes before.

Checking the cluster periodically is even better: You can fix things before
they really cause a serious problem.

> This is just my opinion and way to work, probably someone more expert can
> join the conversation.

Regards,
Ulrich

> 
> Best,
> 
> Damiano
> 
> Il giorno lun 26 apr 2021 alle ore 19:04 Moneta, Howard <
> howard.mon...@csaa.com> ha scritto:
> 
>> Hello community.  I have read that it is not recommended to set Pacemaker
>> and corosync to enabled/auto start on the nodes.  Is this how people have
>> it configured? If a computer restarts unexpectedly, is it better to
>> manually investigate first or allow the node to come back online and
rejoin
>> the cluster automaticly in order to minimize downtime?  If the auto start
>> is not enabled, how do you handle patching?  I’m using Pacemaker with PAF,
>> PostgreSQL Automatic Failover. I had thought to follow the published
>> guidance and not set those processes to enabled but other coworkers are
>> resisting and saying that the systems should be configured to recover by
>> themselves around patching or even a temporary unplanned
>> network/virtualization glitch.
>>
>>
>>
>> Thanks,
>>
>> Howard
>>
>>
>>
>> This message may contain information, including personally identifiable
>> information that is confidential, privileged, or otherwise legally
>> protected. If you are not the intended recipient, please immediately
notify
>> the sender and delete this message without copying, disclosing, or
>> distributing it.
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Autostart/Enabling of Pacemaker and corosync

2021-04-27 Thread Ulrich Windl
>>> "Moneta, Howard"  schrieb am 26.04.2021 um 19:04 in
Nachricht


> Hello community.  I have read that it is not recommended to set Pacemaker and 
> corosync to enabled/auto start on the nodes.  Is this how people have it 

I think it's basically about your sleep at night or at weekends:
If the cluster can manage a problem without requiring your intervention, that's 
good (for your sleep).
In practice there had been many situations when both nodes in a 2-node cluster 
did fence.
So without auto-starting the cluster nodes after boot, you clearly have a "no 
sleep" situation; specifically if your cluster provides services for more 
external machines.

> configured? If a computer restarts unexpectedly, is it better to manually 
> investigate first or allow the node to come back online and rejoin the 
> cluster automaticly in order to minimize downtime?  If the auto start is not 
> enabled, how do you handle patching?  I'm using Pacemaker with PAF, 
> PostgreSQL Automatic Failover. I had thought to follow the published guidance 
> and not set those processes to enabled but other coworkers are resisting and 
> saying that the systems should be configured to recover by themselves around 
> patching or even a temporary unplanned network/virtualization glitch.

I think there is no "one size fits all" solution: Both variants have advantages 
and disadvantages.

Regards,
Ulrich

> 
> Thanks,
> Howard
> 
> 
> This message may contain information, including personally identifiable 
> information that is confidential, privileged, or otherwise legally protected. 
> If you are not the intended recipient, please immediately notify the sender 
> and delete this message without copying, disclosing, or distributing it.




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/