[ClusterLabs] Antw: Re: Antw: [EXT] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Ulrich Windl
>>> Tony Stocker  schrieb am 29.06.2020 um 19:20 in
Nachricht
:
> On Mon, Jun 29, 2020 at 11:08 AM Ulrich Windl
>  wrote:
>>
>> You could construct a script that generates the commands needed, so it
would
>> be rather easy to handle.
> 
> True. The initial population wouldn't be that burdensome. I was
> thinking of later when my coworkers have to add/remove mounts. I,
> honestly, don't want to be involved in that any more than I must.
> Currently they just make changes in their script and alles ist gut.

Well, it all depends:
you could have a configuration file with lines like this
Action Configuration...

"Action" would describe what to do (e.g. Add/Remove/Keep) and
"Configuration..."  would describe the details. The script would create the
needed actions (script commands), and maybe execute them...

You could also (once you have a setup) to use a graphical frontend like hawk
to enable and disable services (adding and removing is a bit more tricky).


> But more than anything I don't want them mucking about with Pacemaker
> commands (which means I would have to do updates) since once they
> break things, I'm the one who would have to fix it and explain how it
> wasn't my fault.
> 
>>
>>
>> Have you considered using automount? It's like fstab, but won't mount
>> automatically.
> 
> We looked at it a few years ago, but it didn't seem to react too well
> to being used in a file server (https/ftps) role and so we abandoned
> it.

So far it was NFS...


> 
>>
>>
>> The most interesting part seems to be the question whow you define (and
>> detect) a failure that will cause a node switch.
> 
> That is a VERY good question! How many mounts failed is the critical
> number when you have 130+? If a single one fails, do you suddenly move
> everything to the other node (even though it's just as likely to fail
> there)? Do you just monitor and issue complaints? At the moment
> there's zero checking of this, so until someone complains that they
> can't reach something, we don't know that the mount isn't working
> properly ‑‑ so apparently I guess it's not viewed as that critical.

With manual checking you don't need a cluster: Just set up both machines and
run of of them.


> But at the very least, the main home directory for the https/ftps file
> server operations should be operational, or else it's all moot.

Actually I wrote a monitoring plugin that can monitor even hanging NFS mounts
;-) (see attachment)

> 
> Is ocf_tester still available? I installed via 'yum' from the High
> Availability repository and don't see it. I also did a 'yum
> whatprovides *bin/ocf‑tester' and no package came back. Do I have to
> manual download it from somewhere? If so, could someone provide a link
> to the most up‑to‑date source?

In SLES (12) it's part of the resource agent package:
> rpm -qf /usr/sbin/ocf-tester
resource-agents-4.3.018.a7fb5035-3.45.1.x86_64

Regards,
Ulrich


> 
> Thanks!
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: Re: [Off-topic] Message threading (Was: Antw: [EXT] Re: Two node cluster and extended distance/site failure)

2020-06-29 Thread Ulrich Windl
>>> Andrei Borzenkov  schrieb am 29.06.2020 um 17:13 in
Nachricht :
> 29.06.2020 14:57, Ulrich Windl пишет:
> Klaus Wenninger  schrieb am 29.06.2020 um 10:12 in
>> Nachricht
>> 
>> [...]
>>> My mailer was confused by all this combinations of
>>> "Antw: Re: Antw:" anddidn't compose mails into a
>>> thread properly. Which is why I missed further
>>> discussion where it was definitely still about
>>> shared-storage and notwatchdog fencing.
>> [...]
>> 
>> Some remarks: I never liked translations of mail headers, and specifically

> the translation of "Re:", but any decent MUA thould thread messages based on

> "In-Reply-To:" (or "References:"), not on the subject. (RFC 5230, "5.8. 
> In-Reply-To and References")
> 
> 
> Fine.
> 
> Original message:
> 
> Message-ID:
> 
> 
> Your reply:
> 
> In-Reply-To:
>
<31558_1593436561_5EF9E991_31558_332_1_CACLi31XRhAm41CczAS9yoHadd+y6ByBTt7YWX
> rf_fp4ffv1...@mail.gmail.com>

Thanks for the hint! It seems out IT staff implemented some "security feature"
that changes the Message-ID for incoming messages.
I'm asking for a fix...

> 
> 
> Last thread "Suggestions for multiple NFS mounts as LSB script".
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Strahil Nikolov
NFS mounting ... This sounds the perrfect candidate for autofs or systemd's 
'.automount'. Have you thought about systemd automounting your NFS ? 
It will allow  you to automatically mount on demand and umount based  on 
inactivity to prevent stale NFS mounts  on network issues.

If you still wish to use  your script,  you can create a systemd service  to 
call it and ensure (via  pacemaker) that service  will be  always running.


Best Regards,
Strahil Nikolov

На 29 юни 2020 г. 16:15:42 GMT+03:00, Tony Stocker  
написа:
>Hello
>
>We have a system which has become critical in nature and that
>management wants to be made into a high-available pair of servers. We
>are building on CentOS-8 and using Pacemaker to accomplish this.
>
>Without going into too much detail as to why it's being done, and to
>avoid any comments/suggestions about changing it which I cannot do,
>the system currently uses a script (which is not LSB compliant) to
>mount 133 NFS mounts. Yes, it's a crap ton of NFS mounts. No, I cannot
>do anything to alter, change, or reduce it. I must implement a
>Pacemaker 2-node high-availability pair which mounts those 133 NFS
>mounts. This list of mounts also changes over time as some are removed
>(rarely) and others added (much too frequently) and occasionally
>changed.
>
>It seems to me that manually putting each individual NFS mount in
>using the 'pcs' command as an individual ocf:heartbeat:FileSystem
>resource would be time-consuming and ultimately futile given the
>frequency of changes.
>
>Also, the reason that we don't put all of these mounts in the
>/etc/fstab file is to speed up boot times and ensure that the systems
>can actually come up into a useable state (and not hang forever)
>during a period when the NFS mounts might not be available for
>whatever reason (e.g. archive maintenance periods.)
>
>So, I'm left with trying to turn my coworker's bare minimum bash
>script that mounts these volumes into a functional LSB script. I've
>read:
>https://www.clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/_linux_standard_base.html
>and
>http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
>
>My first question is: is there any kind of script within the Pacemaker
>world that one can use to verify that one's script passes muster and
>is compliant without actually trying to run it as a resource? ~8 years
>ago there used to be a script called ocf-tester that one used to check
>OCF scripts, but I notice that that doesn't seem to be available any
>more - and really I need one for Pacemaker-compatible LSB script
>testing.
>
>Second, just what is Pacemaker expecting from the script? Does it
>'exercise' it looking for all available options? Or is it simply
>relying on it to provide the correct responses when it calls 'start',
>'stop', and 'status'?
>
>Thanks in advance for help.
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Ken Gaillot
On Mon, 2020-06-29 at 13:20 -0400, Tony Stocker wrote:
> On Mon, Jun 29, 2020 at 11:08 AM Ulrich Windl
>  wrote:
> > 
> > You could construct a script that generates the commands needed, so
> > it would
> > be rather easy to handle.
> 
> True. The initial population wouldn't be that burdensome. I was
> thinking of later when my coworkers have to add/remove mounts. I,
> honestly, don't want to be involved in that any more than I must.
> Currently they just make changes in their script and alles ist gut.
> But more than anything I don't want them mucking about with Pacemaker
> commands (which means I would have to do updates) since once they
> break things, I'm the one who would have to fix it and explain how it
> wasn't my fault.

Advantages of having a resource per mount:
- Each is monitored individually so you know if there are mount-
specific issues
- If one mount fails (to start, or later), it doesn't need to affect
the other mounts
- You can define colocation/ordering dependencies between specific
mounts if warranted
- You can choose a failover or load-balancing model

To get around the issues you mention, you could define ACLs and allow
the others access to just the resources section (still pretty broad,
but less room for trouble). You could also write a script for simply
adding/removing mounts so they don't have to know (or be able to
misuse) cluster commands.

For background on ACLs see:

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#idm47160746093920

and see the man page for pcs/crm/whatever you're using for
configuration.


> > Have you considered using automount? It's like fstab, but won't
> > mount
> > automatically.
> 
> We looked at it a few years ago, but it didn't seem to react too well
> to being used in a file server (https/ftps) role and so we abandoned
> it.
> 
> > 
> > 
> > The most interesting part seems to be the question whow you define
> > (and
> > detect) a failure that will cause a node switch.
> 
> That is a VERY good question! How many mounts failed is the critical
> number when you have 130+? If a single one fails, do you suddenly
> move
> everything to the other node (even though it's just as likely to fail
> there)? Do you just monitor and issue complaints? At the moment
> there's zero checking of this, so until someone complains that they
> can't reach something, we don't know that the mount isn't working
> properly -- so apparently I guess it's not viewed as that critical.
> But at the very least, the main home directory for the https/ftps
> file
> server operations should be operational, or else it's all moot.
> 
> Is ocf_tester still available? I installed via 'yum' from the High
> Availability repository and don't see it. I also did a 'yum
> whatprovides *bin/ocf-tester' and no package came back. Do I have to
> manual download it from somewhere? If so, could someone provide a
> link
> to the most up-to-date source?
> 
> Thanks!
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Andrei Borzenkov
29.06.2020 20:20, Tony Stocker пишет:
> 
>>
>>
>> The most interesting part seems to be the question whow you define (and
>> detect) a failure that will cause a node switch.
> 
> That is a VERY good question! How many mounts failed is the critical
> number when you have 130+? If a single one fails, do you suddenly move
> everything to the other node (even though it's just as likely to fail
> there)? Do you just monitor and issue complaints?

Is it rhetorical question, or who are those "you"?

Pacemaker works with resources. Resource is considered failed if either
its state does not match expectation (resource agent reports resource is
stopped while pacemaker started it a while back) or resource agent
explicitly reports resource as failed. What constitutes "failed" in the
latter case is entirely up to resource agent. To notice resource failure
resource definition must include periodical monitor, otherwise there is
no way pacemaker will be aware that anything happened.

There are also resource dependencies, so you can define that resource B
must always be on the same node as A, if A ever needs to be switched
over, B will follow.

That's basically all. It is up to you to simply do not configure
monitoring of "unimportant" resources so that after initial start
nothing ever happens. You can even ignore initial start failures if you
want.

> At the moment
> there's zero checking of this, so until someone complains that they
> can't reach something, we don't know that the mount isn't working
> properly -- so apparently I guess it's not viewed as that critical.
> But at the very least, the main home directory for the https/ftps file
> server operations should be operational, or else it's all moot.
> 

With single monolithic script your script is responsible for
distinguishing between "important" and "unimportant" mounts. With
individual resources you have boiler plate to fill in mount point.

> Is ocf_tester still available? I installed via 'yum' from the High
> Availability repository and don't see it. I also did a 'yum
> whatprovides *bin/ocf-tester' and no package came back. Do I have to
> manual download it from somewhere? If so, could someone provide a link
> to the most up-to-date source?
> 

It is part of resource-agents:

https://github.com/ClusterLabs/resource-agents/blob/master/tools/ocf-tester.in

But ocf-tester preforms exactly actual resource operations
(start/stop/etc) that you want to avoid. It is not syntactic or semantic
offline checker.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Tony Stocker
On Mon, Jun 29, 2020 at 11:08 AM Ulrich Windl
 wrote:
>
> You could construct a script that generates the commands needed, so it would
> be rather easy to handle.

True. The initial population wouldn't be that burdensome. I was
thinking of later when my coworkers have to add/remove mounts. I,
honestly, don't want to be involved in that any more than I must.
Currently they just make changes in their script and alles ist gut.
But more than anything I don't want them mucking about with Pacemaker
commands (which means I would have to do updates) since once they
break things, I'm the one who would have to fix it and explain how it
wasn't my fault.

>
>
> Have you considered using automount? It's like fstab, but won't mount
> automatically.

We looked at it a few years ago, but it didn't seem to react too well
to being used in a file server (https/ftps) role and so we abandoned
it.

>
>
> The most interesting part seems to be the question whow you define (and
> detect) a failure that will cause a node switch.

That is a VERY good question! How many mounts failed is the critical
number when you have 130+? If a single one fails, do you suddenly move
everything to the other node (even though it's just as likely to fail
there)? Do you just monitor and issue complaints? At the moment
there's zero checking of this, so until someone complains that they
can't reach something, we don't know that the mount isn't working
properly -- so apparently I guess it's not viewed as that critical.
But at the very least, the main home directory for the https/ftps file
server operations should be operational, or else it's all moot.

Is ocf_tester still available? I installed via 'yum' from the High
Availability repository and don't see it. I also did a 'yum
whatprovides *bin/ocf-tester' and no package came back. Do I have to
manual download it from somewhere? If so, could someone provide a link
to the most up-to-date source?

Thanks!
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] [Off-topic] Message threading (Was: Antw: [EXT] Re: Two node cluster and extended distance/site failure)

2020-06-29 Thread Andrei Borzenkov
29.06.2020 14:57, Ulrich Windl пишет:
 Klaus Wenninger  schrieb am 29.06.2020 um 10:12 in
> Nachricht
> 
> [...]
>> My mailer was confused by all this combinations of
>> "Antw: Re: Antw:" anddidn't compose mails into a
>> thread properly. Which is why I missed further
>> discussion where it was definitely still about
>> shared-storage and notwatchdog fencing.
> [...]
> 
> Some remarks: I never liked translations of mail headers, and specifically 
> the translation of "Re:", but any decent MUA thould thread messages based on 
> "In-Reply-To:" (or "References:"), not on the subject. (RFC 5230, "5.8. 
> In-Reply-To and References")


Fine.

Original message:

Message-ID:


Your reply:

In-Reply-To:
<31558_1593436561_5ef9e991_31558_332_1_cacli31xrham41cczas9yohadd+y6bybtt7ywxrf_fp4ffv1...@mail.gmail.com>


Last thread "Suggestions for multiple NFS mounts as LSB script".
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Ulrich Windl
>>> Tony Stocker  schrieb am 29.06.2020 um 15:15 in
Nachricht
<31558_1593436561_5EF9E991_31558_332_1_CACLi31XRhAm41CczAS9yoHadd+y6ByBTt7YWXrF_
p4ffv1...@mail.gmail.com>:
> Hello
> 
> We have a system which has become critical in nature and that
> management wants to be made into a high‑available pair of servers. We
> are building on CentOS‑8 and using Pacemaker to accomplish this.
> 
> Without going into too much detail as to why it's being done, and to
> avoid any comments/suggestions about changing it which I cannot do,
> the system currently uses a script (which is not LSB compliant) to
> mount 133 NFS mounts. Yes, it's a crap ton of NFS mounts. No, I cannot
> do anything to alter, change, or reduce it. I must implement a
> Pacemaker 2‑node high‑availability pair which mounts those 133 NFS
> mounts. This list of mounts also changes over time as some are removed
> (rarely) and others added (much too frequently) and occasionally
> changed.
> 
> It seems to me that manually putting each individual NFS mount in
> using the 'pcs' command as an individual ocf:heartbeat:FileSystem
> resource would be time‑consuming and ultimately futile given the
> frequency of changes.

You could construct a script that generates the commands needed, so it would
be rather easy to handle.

> 
> Also, the reason that we don't put all of these mounts in the
> /etc/fstab file is to speed up boot times and ensure that the systems
> can actually come up into a useable state (and not hang forever)
> during a period when the NFS mounts might not be available for
> whatever reason (e.g. archive maintenance periods.)

Have you considered using automount? It's like fstab, but won't mount
automatically.

> 
> So, I'm left with trying to turn my coworker's bare minimum bash
> script that mounts these volumes into a functional LSB script. I've
> read:
>
https://www.clusterlabs.org/pacemaker/doc/en‑US/Pacemaker/2.0/html/Pacemaker_

> Explained/_linux_standard_base.html
> and
>
http://refspecs.linux‑foundation.org/LSB_3.0.0/LSB‑Core‑generic/LSB‑Core‑generic/

> iniscrptact.html
> 
> My first question is: is there any kind of script within the Pacemaker
> world that one can use to verify that one's script passes muster and
> is compliant without actually trying to run it as a resource? ~8 years
> ago there used to be a script called ocf‑tester that one used to check
> OCF scripts, but I notice that that doesn't seem to be available any
> more ‑ and really I need one for Pacemaker‑compatible LSB script
> testing.
> 
> Second, just what is Pacemaker expecting from the script? Does it
> 'exercise' it looking for all available options? Or is it simply
> relying on it to provide the correct responses when it calls 'start',
> 'stop', and 'status'?

If you decide to continue with the script, I would convert it to an OCF RA
agent (according to docs). It's actually not too difficult, and you can have
automated testing via ocf-tester.

The most interesting part seems to be the question whow you define (and
detect) a failure that will cause a node switch.

Regards,
Ulrich



> 
> Thanks in advance for help.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Suggestions for multiple NFS mounts as LSB script

2020-06-29 Thread Tony Stocker
Hello

We have a system which has become critical in nature and that
management wants to be made into a high-available pair of servers. We
are building on CentOS-8 and using Pacemaker to accomplish this.

Without going into too much detail as to why it's being done, and to
avoid any comments/suggestions about changing it which I cannot do,
the system currently uses a script (which is not LSB compliant) to
mount 133 NFS mounts. Yes, it's a crap ton of NFS mounts. No, I cannot
do anything to alter, change, or reduce it. I must implement a
Pacemaker 2-node high-availability pair which mounts those 133 NFS
mounts. This list of mounts also changes over time as some are removed
(rarely) and others added (much too frequently) and occasionally
changed.

It seems to me that manually putting each individual NFS mount in
using the 'pcs' command as an individual ocf:heartbeat:FileSystem
resource would be time-consuming and ultimately futile given the
frequency of changes.

Also, the reason that we don't put all of these mounts in the
/etc/fstab file is to speed up boot times and ensure that the systems
can actually come up into a useable state (and not hang forever)
during a period when the NFS mounts might not be available for
whatever reason (e.g. archive maintenance periods.)

So, I'm left with trying to turn my coworker's bare minimum bash
script that mounts these volumes into a functional LSB script. I've
read:
https://www.clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/_linux_standard_base.html
and
http://refspecs.linux-foundation.org/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html

My first question is: is there any kind of script within the Pacemaker
world that one can use to verify that one's script passes muster and
is compliant without actually trying to run it as a resource? ~8 years
ago there used to be a script called ocf-tester that one used to check
OCF scripts, but I notice that that doesn't seem to be available any
more - and really I need one for Pacemaker-compatible LSB script
testing.

Second, just what is Pacemaker expecting from the script? Does it
'exercise' it looking for all available options? Or is it simply
relying on it to provide the correct responses when it calls 'start',
'stop', and 'status'?

Thanks in advance for help.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] [Off-topic] Message threading (Was: Antw: [EXT] Re: Two node cluster and extended distance/site failure)

2020-06-29 Thread Ulrich Windl
>>> Klaus Wenninger  schrieb am 29.06.2020 um 10:12 in
Nachricht

[...]
> My mailer was confused by all this combinations of
> "Antw: Re: Antw:" anddidn't compose mails into a
> thread properly. Which is why I missed further
> discussion where it was definitely still about
> shared-storage and notwatchdog fencing.
[...]

Some remarks: I never liked translations of mail headers, and specifically the 
translation of "Re:", but any decent MUA thould thread messages based on 
"In-Reply-To:" (or "References:"), not on the subject. (RFC 5230, "5.8. 
In-Reply-To and References")

Regards,
Ulrich




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] clusterlabs.github.io

2020-06-29 Thread Jehan-Guillaume de Rorthais
On Mon, 29 Jun 2020 10:37:27 +0100
Christine Caulfield  wrote:

> On 29/06/2020 10:27, Jehan-Guillaume de Rorthais wrote:
> > On Mon, 29 Jun 2020 09:27:00 +0100
> > Christine Caulfield  wrote:
> >   
> >> Is anyone (else) using this?  
> > 
> > I do: https://clusterlabs.github.io/PAF/
> >   
> >> We publish the libqb man pages to clusterlabs.github.io/libqb but I
> >> can't see any other clusterlabs projects using it (just by adding, eg,
> >> /pacemaker to the hostname).
> >>
> >> With libqb 2.0.1 having actual man pages installed with it - which seems
> >> far more useful to me -  I was considering dropping it if no-one else is
> >> using the facility.  
> > 
> > I have some refactoring to do on PAF website style, to adopt clusterlabs
> > look & feel. This is a long overdue task on my todo list. If you drop
> > clusterlabs.github.io, where will I be able to host PAF docs & stuffs?
> >   
> ]
> 
> 
> To be clear, I'm not planning to remove clusterlabs.githib.io, just to
> deprecate libqb from it.

Oh, ok :)

Have a good day.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] clusterlabs.github.io

2020-06-29 Thread Christine Caulfield
On 29/06/2020 10:27, Jehan-Guillaume de Rorthais wrote:
> On Mon, 29 Jun 2020 09:27:00 +0100
> Christine Caulfield  wrote:
> 
>> Is anyone (else) using this?
> 
> I do: https://clusterlabs.github.io/PAF/
> 
>> We publish the libqb man pages to clusterlabs.github.io/libqb but I
>> can't see any other clusterlabs projects using it (just by adding, eg,
>> /pacemaker to the hostname).
>>
>> With libqb 2.0.1 having actual man pages installed with it - which seems
>> far more useful to me -  I was considering dropping it if no-one else is
>> using the facility.
> 
> I have some refactoring to do on PAF website style, to adopt clusterlabs
> look & feel. This is a long overdue task on my todo list. If you drop
> clusterlabs.github.io, where will I be able to host PAF docs & stuffs?
> 
]


To be clear, I'm not planning to remove clusterlabs.githib.io, just to
deprecate libqb from it.

Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] clusterlabs.github.io

2020-06-29 Thread Jehan-Guillaume de Rorthais
On Mon, 29 Jun 2020 09:27:00 +0100
Christine Caulfield  wrote:

> Is anyone (else) using this?

I do: https://clusterlabs.github.io/PAF/

> We publish the libqb man pages to clusterlabs.github.io/libqb but I
> can't see any other clusterlabs projects using it (just by adding, eg,
> /pacemaker to the hostname).
> 
> With libqb 2.0.1 having actual man pages installed with it - which seems
> far more useful to me -  I was considering dropping it if no-one else is
> using the facility.

I have some refactoring to do on PAF website style, to adopt clusterlabs
look & feel. This is a long overdue task on my todo list. If you drop
clusterlabs.github.io, where will I be able to host PAF docs & stuffs?

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] clusterlabs.github.io

2020-06-29 Thread Christine Caulfield
Is anyone (else) using this?

We publish the libqb man pages to clusterlabs.github.io/libqb but I
can't see any other clusterlabs projects using it (just by adding, eg,
/pacemaker to the hostname).

With libqb 2.0.1 having actual man pages installed with it - which seems
far more useful to me -  I was considering dropping it if no-one else is
using the facility.

Any opinions?

Chrissie

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Two node cluster and extended distance/site failure

2020-06-29 Thread Klaus Wenninger
On 6/29/20 10:12 AM, Klaus Wenninger wrote:
> On 6/29/20 9:56 AM, Klaus Wenninger wrote:
>> On 6/24/20 8:09 AM, Andrei Borzenkov wrote:
>>> Two node is what I almost exclusively deal with. It works reasonably
>>> well in one location where failures to perform fencing are rare and can
>>> be mitigated by two different fencing methods. Usually SBD is reliable
>>> enough, as failure of shared storage also implies failure of the whole
>>> cluster.
>>>
>>> When two nodes are located on separate sites (not necessary
>>> Asia/America, two buildings across the street is already enough) we have
>>> issue of complete site isolation where normal fencing becomes impossible
>>> together with missing node (power outage, network outage etc).
>>>
>>> Usual recommendation is third site which functions as witness. This
>>> works fine up to failure of this third site itself. Unavailability of
>>> the witness makes normal maintenance of either of two nodes impossible.
>>> If witness is not available and (pacemaker on) one of two nodes needs to
>>> be restarted the remaining node goes out of quorum or commits suicide.
>>> At most we can statically designate one node as tiebreaker (and this is
>>> already incompatible with qdevice).
>>>
>>> I think I finally can formulate what I miss. The behavior that I would
>>> really want is
>>>
>>> - if (pacemaker on) one node performs normal shutdown, remaining node
>>> continues managing services, independently of witness state or
>>> availability. Usually this is achieved either by two_node or by
>>> no-quorum-policy=ignore, but that absolutely requires successful
>>> fencing, so cannot be used alone. Such feature likely mandates WFA, but
>>> that is probably unavoidable.
>>>
>>> - if other node is lost unexpectedly, first try normal fencing between
>>> two nodes, independently of witness state or availability. If fencing
>>> succeeds, we can continue managing services.
>>>
>>> - if normal fencing fails (due to other site isolation), consult witness
>>> - and follow normal procedure. If witness is not available/does not
>>> grant us quorum - suicide/go out of quorum, if witness is available and
>>> grants us quorum - continue managing services.
>>>
>>> Any potential issues with this? If it is possible to implement using
>>> current tools I did not find it.
>> I see the idea but I see a couple of issues:
> My mailer was confused by all this combinations of
> "Antw: Re: Antw:" anddidn't compose mails into a
> thread properly. Which is why I missed further
> discussion where it was definitely still about
> shared-storage and notwatchdog fencing.
> Had guessed - from the initial post - that there was
> a shift in direction of qdevice.
> But maybe thoughts below are still interesting in
> that light ...
And what I had said about timing for quorum-loss is of
course true as well for loss of access to a shared-disk
which is why your ask witness as last resort is critical.
>>
>> - watchdog-fencing is timing critical. So when loosing quorum
>>   we haveto suicide after a defined time. So just trying normal
>>   fencing first andthen going for watchdog-fencing is no way.
>>   But what could be consideredis right away starting with
>>   watchdog fencing upon quorum-loss - rememberI said defined
>>   which doesn't necessarily mean short - and try other means
>>   of fencing in parallel and if that succeeds e.g. somehow
>>   regain quorum(additional voting or something one would have
>>   to think over a little more).
>>
>> - usually we are using quorum to prevent a fence race which
>>   this approachjeopardizes. Of course we can introduce an
>>   additional wait before normalfencing on the node that
>>   doesn't have quorum to mitigate that effect.
>>
>> - why I think current configuration possibilities won't give
>>   you yourdesired behavior is that we finally end up with
>>   2 quorum sources.
>>   Only case where I'm aware of a similar thing is
>>   2-node + shared diskwhere sbd decides not to go with quorum
>>   gotten from pacemaker butdoes node-counting internally
>>   instead.
>>
>> Klaus
>>> And note, that this is not actually limited to two node cluster - we
>>> have more or less the same issue with any 50-50 split cluster and
>>> witness on third site.
>>> ___
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Two node cluster and extended distance/site failure

2020-06-29 Thread Klaus Wenninger
On 6/29/20 9:56 AM, Klaus Wenninger wrote:
> On 6/24/20 8:09 AM, Andrei Borzenkov wrote:
>> Two node is what I almost exclusively deal with. It works reasonably
>> well in one location where failures to perform fencing are rare and can
>> be mitigated by two different fencing methods. Usually SBD is reliable
>> enough, as failure of shared storage also implies failure of the whole
>> cluster.
>>
>> When two nodes are located on separate sites (not necessary
>> Asia/America, two buildings across the street is already enough) we have
>> issue of complete site isolation where normal fencing becomes impossible
>> together with missing node (power outage, network outage etc).
>>
>> Usual recommendation is third site which functions as witness. This
>> works fine up to failure of this third site itself. Unavailability of
>> the witness makes normal maintenance of either of two nodes impossible.
>> If witness is not available and (pacemaker on) one of two nodes needs to
>> be restarted the remaining node goes out of quorum or commits suicide.
>> At most we can statically designate one node as tiebreaker (and this is
>> already incompatible with qdevice).
>>
>> I think I finally can formulate what I miss. The behavior that I would
>> really want is
>>
>> - if (pacemaker on) one node performs normal shutdown, remaining node
>> continues managing services, independently of witness state or
>> availability. Usually this is achieved either by two_node or by
>> no-quorum-policy=ignore, but that absolutely requires successful
>> fencing, so cannot be used alone. Such feature likely mandates WFA, but
>> that is probably unavoidable.
>>
>> - if other node is lost unexpectedly, first try normal fencing between
>> two nodes, independently of witness state or availability. If fencing
>> succeeds, we can continue managing services.
>>
>> - if normal fencing fails (due to other site isolation), consult witness
>> - and follow normal procedure. If witness is not available/does not
>> grant us quorum - suicide/go out of quorum, if witness is available and
>> grants us quorum - continue managing services.
>>
>> Any potential issues with this? If it is possible to implement using
>> current tools I did not find it.
> I see the idea but I see a couple of issues:
My mailer was confused by all this combinations of
"Antw: Re: Antw:" anddidn't compose mails into a
thread properly. Which is why I missed further
discussion where it was definitely still about
shared-storage and notwatchdog fencing.
Had guessed - from the initial post - that there was
a shift in direction of qdevice.
But maybe thoughts below are still interesting in
that light ...
>
>
> - watchdog-fencing is timing critical. So when loosing quorum
>   we haveto suicide after a defined time. So just trying normal
>   fencing first andthen going for watchdog-fencing is no way.
>   But what could be consideredis right away starting with
>   watchdog fencing upon quorum-loss - rememberI said defined
>   which doesn't necessarily mean short - and try other means
>   of fencing in parallel and if that succeeds e.g. somehow
>   regain quorum(additional voting or something one would have
>   to think over a little more).
>
> - usually we are using quorum to prevent a fence race which
>   this approachjeopardizes. Of course we can introduce an
>   additional wait before normalfencing on the node that
>   doesn't have quorum to mitigate that effect.
>
> - why I think current configuration possibilities won't give
>   you yourdesired behavior is that we finally end up with
>   2 quorum sources.
>   Only case where I'm aware of a similar thing is
>   2-node + shared diskwhere sbd decides not to go with quorum
>   gotten from pacemaker butdoes node-counting internally
>   instead.
>
> Klaus
>> And note, that this is not actually limited to two node cluster - we
>> have more or less the same issue with any 50-50 split cluster and
>> witness on third site.
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/


-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Base Operating Systems

Red Hat

kwenn...@redhat.com   

Red Hat GmbH, http://www.de.redhat.com/, Sitz: Grasbrunn, 
Handelsregister: Amtsgericht München, HRB 153243,
Geschäftsführer: Charles Cachera, Laurie Krebs, Michael O'Neill, Thomas Savage

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Two node cluster and extended distance/site failure

2020-06-29 Thread Klaus Wenninger
On 6/24/20 8:09 AM, Andrei Borzenkov wrote:
> Two node is what I almost exclusively deal with. It works reasonably
> well in one location where failures to perform fencing are rare and can
> be mitigated by two different fencing methods. Usually SBD is reliable
> enough, as failure of shared storage also implies failure of the whole
> cluster.
>
> When two nodes are located on separate sites (not necessary
> Asia/America, two buildings across the street is already enough) we have
> issue of complete site isolation where normal fencing becomes impossible
> together with missing node (power outage, network outage etc).
>
> Usual recommendation is third site which functions as witness. This
> works fine up to failure of this third site itself. Unavailability of
> the witness makes normal maintenance of either of two nodes impossible.
> If witness is not available and (pacemaker on) one of two nodes needs to
> be restarted the remaining node goes out of quorum or commits suicide.
> At most we can statically designate one node as tiebreaker (and this is
> already incompatible with qdevice).
>
> I think I finally can formulate what I miss. The behavior that I would
> really want is
>
> - if (pacemaker on) one node performs normal shutdown, remaining node
> continues managing services, independently of witness state or
> availability. Usually this is achieved either by two_node or by
> no-quorum-policy=ignore, but that absolutely requires successful
> fencing, so cannot be used alone. Such feature likely mandates WFA, but
> that is probably unavoidable.
>
> - if other node is lost unexpectedly, first try normal fencing between
> two nodes, independently of witness state or availability. If fencing
> succeeds, we can continue managing services.
>
> - if normal fencing fails (due to other site isolation), consult witness
> - and follow normal procedure. If witness is not available/does not
> grant us quorum - suicide/go out of quorum, if witness is available and
> grants us quorum - continue managing services.
>
> Any potential issues with this? If it is possible to implement using
> current tools I did not find it.
I see the idea but I see a couple of issues:


- watchdog-fencing is timing critical. So when loosing quorum
  we haveto suicide after a defined time. So just trying normal
  fencing first andthen going for watchdog-fencing is no way.
  But what could be consideredis right away starting with
  watchdog fencing upon quorum-loss - rememberI said defined
  which doesn't necessarily mean short - and try other means
  of fencing in parallel and if that succeeds e.g. somehow
  regain quorum(additional voting or something one would have
  to think over a little more).

- usually we are using quorum to prevent a fence race which
  this approachjeopardizes. Of course we can introduce an
  additional wait before normalfencing on the node that
  doesn't have quorum to mitigate that effect.

- why I think current configuration possibilities won't give
  you yourdesired behavior is that we finally end up with
  2 quorum sources.
  Only case where I'm aware of a similar thing is
  2-node + shared diskwhere sbd decides not to go with quorum
  gotten from pacemaker butdoes node-counting internally
  instead.

Klaus
>
> And note, that this is not actually limited to two node cluster - we
> have more or less the same issue with any 50-50 split cluster and
> witness on third site.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/