Re: [ClusterLabs] Antw: [EXT] How to stop removed resources when replacing cib.xml via cibadmin or crm_shadow

2020-10-02 Thread Ken Gaillot
On Fri, 2020-10-02 at 21:35 +0300, Igor Tverdovskiy wrote:
> 
> 
> On Thu, Oct 1, 2020 at 5:55 PM Ken Gaillot 
> wrote:
> > There's no harm on the Pacemaker side in doing so.
> > 
> > A resource that's running but removed from the configuration is
> > what
> > Pacemaker calls an "orphan". By default (the stop-orphan-resources
> > cluster property) it will try to stop these. Pacemaker keeps the
> > set of
> > parameters that a resource was started with in memory, so it
> > doesn't
> > need the now-removed configuration to perform the stop. So, the
> > "ORPHANED" part of this is normal and appropriate.
> > 
> > The problem in this particular case is the "FAILED ... (blocked)".
> > Removing the configuration shouldn't cause the resource to fail,
> > and
> > something is blocking the stop. You should be able to see in the
> > failed
> > action section of the status, or in the logs, what failed and why
> > it's
> > blocked. My guess is the stop itself failed, in which case you'd
> > need
> > to investigate why that happened.
> 
> Hi Ken,
> 
> As always, thanks a lot for pointing me to the right direction!
> I have digged logs, but something not logical happens. Maybe you can
> shed light a bit?
> 
> Just in case, I have a pretty old pacemaker (Pacemaker 1.1.15-11.el7) 
> freezing of the version was conducted by
> changes in stikiness=-1 attribute handling logic. I consider update
> to a newer stable version later on, but
> at the moment I have to deal with this version.
> 
> First of all "stop-orphan-resources" was not set and thus by default
> should be true, but still I see a strange message
> > Cluster configured not to stop active orphans. vip-10.0.0.115 must
> be stopped manually on tt738741-ip2
> 
> I even explicitly set "stop-orphan-resources" to true
> > sudo crm_attribute --type crm_config --name stop-orphan-resources
> --query
> scope=crm_config  name=stop-orphan-resources value=true
> 
> For the sake of justice pacemaker still tries to remove ORPHANED
> resources, i.e
> > Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
> native_color:Stopping orphan resource vip-10.0.0.115
> 
> To my mind the issue is the following:
> > Clearing failcount for monitor on vip-10.0.0.115, tt738741-ip2
> failed and now resource parameters have changed.
> 
> I suppose that this operation clears parameters of the resource which
> were used at start
> 
> The error is pretty straight:
> > IPaddr2(vip-10.0.0.115)[21351]: 2020/10/02_09:20:56 ERROR: IP
> address (the ip parameter) is mandatory
> 
> As I understand it means the ip parameter has already vanished at the
> moment of "stop" action.
> 
> It looks like a bug, but who knows.
> 
> Trimmed logs, if required I can provide full log, cib.xml, etc
> ```
> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine:  warning:
> process_rsc_state:   Cluster configured not to stop active
> orphans. vip-10.0.0.115 must be stopped manually on tt738741-ip2

It'll log this message anytime the orphan is unmanaged, not just when
stop-orphan-resources is false, so I'll make a note to change the
message.

Did the resource happen to be unmanaged via the configuration at the
time it was removed? Obviously it couldn't be unmanaged via its own
(now gone) configuration, but maybe by resource defaults or maintenance
mode?

> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
> native_add_running:  resource vip-10.0.0.115 isn't managed
> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
> native_add_running:  resource haproxy-10.0.0.115 isn't managed
> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
> determine_op_status: Operation monitor found resource vip-
> 10.0.0.115 active on tt738741-ip2
> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
> check_operation_expiry:  Clearing failcount for monitor on vip-
> 10.0.0.115, tt738741-ip2 failed and now resource parameters have
> changed.
> ...
> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine:  warning:
> process_rsc_state:   Detected active orphan vip-10.0.0.115
> running on tt738741-ip2
> ...
> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
> native_print:vip-10.0.0.115  (ocf::heartbeat:IPaddr2):
> ORPHANED Started tt738741-ip2
> ...
> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
> native_color:Stopping orphan resource vip-10.0.0.115

There should be a "saving inputs" log message with a filename shortly
after this. If you could email me that, I could check whether there are
any issues in the scheduler side of things.

> ...
> Oct 02 09:20:56 [21763] tt738741-ip2.ops   lrmd: info:
> log_execute: executing - rsc:vip-10.0.0.115 action:stop
> call_id:4358
> Oct 02 09:20:56 [21766] tt738741-ip2.ops   crmd: info:
> te_crm_command:  Executing crm-event (1): clear_failcount on
> tt738741-ip2
> Oct 02 09:20:56 [21766] tt738741-ip2.ops   crmd: info:
> process_lrm_event:   Result of 

Re: [ClusterLabs] pacemaker and cluster hostname reconfiguration

2020-10-02 Thread Igor Tverdovskiy
Hi Riccardo,

As I see you have already handled the issue, but I would recommend using
static node names
in the corosync.conf instead of reference to hostname. I did so years ago
and now I have no issues
with hostname changes.

e.g.:
node {
ring0_addr: 1.1.1.1
name: my.node
nodeid: 123456
}

On Thu, Oct 1, 2020 at 10:10 PM Riccardo Manfrin <
riccardo.manf...@athonet.com> wrote:

> Thank you for your suggestion Ken; I'm indeed on Centos7, but using
>
> hostnamectl set-hostname newHostname
>
> in place of
>
> hostname -f /etc/hostname
>
> didn't have any beneficial effect. As soon as I powered off one of the
> two nodes, the other one took the old hostnames back and drifted out of
> sync.
>
> The only way of doing this in the end was
>
> 1. rebooting the machine (close in time so that the first new corosync
> instance coming up never ever sees the old instance from the other node,
> or it gets the old hostnames again)
>
> 2. killing pacemakerd and corosync (and letting systemd bring them on up
> again).
>
> This second method appears to be the cleanest and more robust, and has
> the advantage that while primitives/services are unsupervised, they are
> not reloaded.
>
> I hope this can be of help to someone although I tend to think that my
> case was really a rare beast not to be seen around.
>
> R
>
> On 01/10/20 16:41, Ken Gaillot wrote:
> > Does "uname -n" also revert?
> >
> > It looks like you're using RHEL 7 or a derivative -- if so, use
> > hostnamectl to change the host name. That will make sure it's updated
> > in the right places.
> 
>
> Riccardo Manfrin
> R DEPARTMENT
> Web | LinkedIn<
> https://www.linkedin.com/company/athonet/> t +39 (0)444 750045
> e riccardo.manf...@athonet.com
> [https://www.athonet.com/signature/logo_athonet.png]<
> https://www.athonet.com/>
> ATHONET | Via Cà del Luogo, 6/8 - 36050 Bolzano Vicentino (VI) Italy
> This email and any attachments are confidential and intended solely for
> the use of the intended recipient. If you are not the named addressee,
> please be aware that you shall not distribute, copy, use or disclose this
> email. If you have received this email by error, please notify us
> immediately and delete this email from your system. Email transmission
> cannot be guaranteed to be secured or error-free or not to contain viruses.
> Athonet S.r.l. processes any personal data exchanged in email
> correspondence in accordance with EU Reg. 679/2016 (GDPR) - you may find
> here the privacy policy with information on such processing and your
> rights. Any views or opinions presented in this email are solely those of
> the sender and do not necessarily represent those of Athonet S.r.l.
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] How to stop removed resources when replacing cib.xml via cibadmin or crm_shadow

2020-10-02 Thread Igor Tverdovskiy
On Thu, Oct 1, 2020 at 5:55 PM Ken Gaillot  wrote:

> There's no harm on the Pacemaker side in doing so.
>
> A resource that's running but removed from the configuration is what
> Pacemaker calls an "orphan". By default (the stop-orphan-resources
> cluster property) it will try to stop these. Pacemaker keeps the set of
> parameters that a resource was started with in memory, so it doesn't
> need the now-removed configuration to perform the stop. So, the
> "ORPHANED" part of this is normal and appropriate.
>
> The problem in this particular case is the "FAILED ... (blocked)".
> Removing the configuration shouldn't cause the resource to fail, and
> something is blocking the stop. You should be able to see in the failed
> action section of the status, or in the logs, what failed and why it's
> blocked. My guess is the stop itself failed, in which case you'd need
> to investigate why that happened.
>

Hi Ken,

As always, thanks a lot for pointing me to the right direction!
I have digged logs, but something not logical happens. Maybe you can shed
light a bit?

Just in case, I have a pretty old pacemaker (Pacemaker 1.1.15-11.el7)
freezing of the version was conducted by
changes in stikiness=-1 attribute handling logic. I consider update to a
newer stable version later on, but
at the moment I have to deal with this version.

First of all "stop-orphan-resources" was not set and thus by default should
be true, but still I see a strange message
> Cluster configured not to stop active orphans. vip-10.0.0.115 must be
stopped manually on tt738741-ip2

I even explicitly set "stop-orphan-resources" to true
> sudo crm_attribute --type crm_config --name stop-orphan-resources --query
scope=crm_config  name=stop-orphan-resources value=true

For the sake of justice pacemaker still tries to remove ORPHANED resources,
i.e
> Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
native_color:Stopping orphan resource vip-10.0.0.115

To my mind the issue is the following:
> Clearing failcount for monitor on vip-10.0.0.115, tt738741-ip2 failed and
now resource parameters have changed.

I suppose that this operation clears parameters of the resource which were
used at start

The error is pretty straight:
> IPaddr2(vip-10.0.0.115)[21351]: 2020/10/02_09:20:56 ERROR: IP address
(the ip parameter) is mandatory

As I understand it means the ip parameter has already vanished at the
moment of "stop" action.

It looks like a bug, but who knows.

Trimmed logs, if required I can provide full log, cib.xml, etc
```
Oct 02 09:20:56 [21765] tt738741-ip2.opspengine:  warning:
process_rsc_state:   Cluster configured not to stop active orphans.
vip-10.0.0.115 must be stopped manually on tt738741-ip2
Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
native_add_running:  resource vip-10.0.0.115 isn't managed
Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
native_add_running:  resource haproxy-10.0.0.115 isn't managed
Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
determine_op_status: Operation monitor found resource vip-10.0.0.115
active on tt738741-ip2
Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
check_operation_expiry:  Clearing failcount for monitor on vip-10.0.0.115,
tt738741-ip2 failed and now resource parameters have changed.
...
Oct 02 09:20:56 [21765] tt738741-ip2.opspengine:  warning:
process_rsc_state:   Detected active orphan vip-10.0.0.115 running on
tt738741-ip2
...
Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
native_print:vip-10.0.0.115  (ocf::heartbeat:IPaddr2): ORPHANED
Started tt738741-ip2
...
Oct 02 09:20:56 [21765] tt738741-ip2.opspengine: info:
native_color:Stopping orphan resource vip-10.0.0.115
...
Oct 02 09:20:56 [21763] tt738741-ip2.ops   lrmd: info: log_execute:
executing - rsc:vip-10.0.0.115 action:stop call_id:4358
Oct 02 09:20:56 [21766] tt738741-ip2.ops   crmd: info:
te_crm_command:  Executing crm-event (1): clear_failcount on tt738741-ip2
Oct 02 09:20:56 [21766] tt738741-ip2.ops   crmd: info:
process_lrm_event:   Result of monitor operation for vip-10.0.0.115 on
tt738741-ip2: Cancelled | call=4323 key=vip-10.0.0.115_monitor_1
confirmed=true
Oct 02 09:20:56 [21766] tt738741-ip2.ops   crmd: info:
handle_failcount_op: Removing failcount for vip-10.0.0.115
...
IPaddr2(vip-10.0.0.115)[21351]: 2020/10/02_09:20:56 ERROR: IP address (the
ip parameter) is mandatory
Oct 02 09:20:56 [21763] tt738741-ip2.ops   lrmd:   notice:
operation_finished:  vip-10.0.0.115_stop_0:21351:stderr [
ocf-exit-reason:IP address (the ip parameter) is mandatory ]
Oct 02 09:20:56 [21763] tt738741-ip2.ops   lrmd: info:
log_finished:finished - rsc:vip-10.0.0.115 action:stop call_id:4358
pid:21351 exit-code:6 exec-time:124ms queue-time:0ms
Oct 02 09:20:56 [21766] tt738741-ip2.ops   crmd:   notice:
process_lrm_event:   Result of stop 

[ClusterLabs] pcs 0.10.7 released

2020-10-02 Thread Tomas Jelinek

I am happy to announce the latest release of pcs, version 0.10.7.

Source code is available at:
https://github.com/ClusterLabs/pcs/archive/0.10.7.tar.gz
or
https://github.com/ClusterLabs/pcs/archive/0.10.7.zip

Complete change log for this release:
## [0.10.7] - 2020-09-30

### Added
- Support for multiple sets of resource and operation defaults,
  including support for rules ([rhbz#1222691], [rhbz#1817547],
  [rhbz#1862966], [rhbz#1867516], [rhbz#1869399])
- Support for "demote" value of resource operation's "on-fail" option
  ([rhbz#1843079])
- Support for 'number' type in rules ([rhbz#1869399])
- It is possible to set custom (promotable) clone id in `pcs resource
  create` and `pcs resource clone/promotable` commands ([rhbz#1741056])

### Fixed
- Prevent removing non-empty tag by removing tagged resource group or
  clone ([rhbz#1857295])
- Clarify documentation for 'resource move' and 'resource ban' commands
  with regards to the 'lifetime' option.
- Allow moving both promoted and demoted promotable clone resources
  ([rhbz#1875301])
- Improved error message with a hint in `pcs cluster cib-push`
  ([ghissue#241])

### Deprecated
- `pcs resource [op] defaults =...` commands are deprecated
  now. Use `pcs resource [op] defaults update =...` if you
  only manage one set of defaults, or `pcs resource [op] defaults set`
  if you manage several sets of defaults. ([rhbz#1817547])


Thanks / congratulations to everyone who contributed to this release,
including Ivan Devat, Miroslav Lisik, Ondrej Mular, Reid Wahl and Tomas
Jelinek.

Cheers,
Tomas


[ghissue#241]: https://github.com/ClusterLabs/pcs/issues/241
[rhbz#1222691]: https://bugzilla.redhat.com/show_bug.cgi?id=1222691
[rhbz#1741056]: https://bugzilla.redhat.com/show_bug.cgi?id=1741056
[rhbz#1817547]: https://bugzilla.redhat.com/show_bug.cgi?id=1817547
[rhbz#1843079]: https://bugzilla.redhat.com/show_bug.cgi?id=1843079
[rhbz#1857295]: https://bugzilla.redhat.com/show_bug.cgi?id=1857295
[rhbz#1862966]: https://bugzilla.redhat.com/show_bug.cgi?id=1862966
[rhbz#1867516]: https://bugzilla.redhat.com/show_bug.cgi?id=1867516
[rhbz#1869399]: https://bugzilla.redhat.com/show_bug.cgi?id=1869399
[rhbz#1875301]: https://bugzilla.redhat.com/show_bug.cgi?id=1875301

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Tuchanka

2020-10-02 Thread Klaus Wenninger
On 10/2/20 3:15 PM, Jehan-Guillaume de Rorthais wrote:
> On Fri, 2 Oct 2020 15:18:18 +0300
> Олег Самойлов  wrote:
>
>>> On 29 Sep 2020, at 11:34, Jehan-Guillaume de Rorthais 
>>> wrote:
>>>
>>>
>>> Vagrant use virtualbox by default, which supports softdog, but it support
>>> many other virtualization plateform, including eg. libvirt/kvm where you
>>> can use virtualized watchdog card.
>>>   
   
>>> Vagrant can use Chef, Ansible, Salt, puppet, and others to provision VM:
>>>
>>>  https://www.vagrantup.com/docs/provisioning
>>>
>>>
>>> There many many available vagrant images:
>>> https://app.vagrantup.com/boxes/search There's many vagrant image...because
>>> building vagrant image is easy. I built some when RH8 wasn't available yet.
>>> So if you need special box, with eg. some predefined setup, you can do it
>>> quite fast.  
>> My english is poor, I'll try to find other words. My primary and main task
>> was to create a prototype for an automatic deploy system. So I used only the
>> same technique that will be used on the real hardware servers: RedHat dvd
>> image + kickstart. And to test such deploying too. That's why I do not use
>> any special image for virtual machines.
> How exactly using a vagrant box you built yourself is different with
> virtualbox where you clone (I suppose) an existing VM you built?
>
>>> Watchdog is kind of a self-fencing method. Cluster with quorum+watchdog, or
>>> SBD+watchdog or quorum+SBD+watchdog are fine...without "active" fencing.  
>> quorum+watchdog or SBD+watchdog are useless. Quorum+SBD+watchdog is a
>> solution, but also has some drawback, so this is not perfect or fine yet.
> Well, by "SBD", I meant "Storage Based Death": using a shared storage to 
> poison
> pill other nodes. Not just the sbd daemon, that is used for SBD and watchdog.
> Sorry for the shortcut and the confusion.
>
>> I'll write about it below.
>>   
> Now, in regard with your multi-site clusters and how you deal with it
> using quorum, did you read the chapter about the Cluster Ticket Registry
> in Pacemaker doc ? See:
>
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/ch15.html
> 
 Yep, I read the whole documentation two years ago. Yep, the ticket system
 was looked interesting at first glance, but I didn't see a method how to
 use it with PAF. :)  
>>> It could be interesting to have detailed feedback about that. Could you
>>> share your experience?  
>> Heh, I don't have experience of using the ticket system because I can't even
>> imaging how to use the ticket system with PAF.
> OK
>
>> As about pacemaker without STONITH the idea was simple: quorum + SBD as
>> watchdog daemon.
> (this was what I describe as "quorum+watchdog", again sorry for the
> confusion :))
>
>> More precisely described in the README. Proved by my test
>> system this is mostly works. :)
>>
>> What are possible caveats. First of all softdog is not good for this (only
>> for testing), and system will heavily depend on reliability of the watchdog
>> device.
> +1
>
>> SBD is not good as watchdog daemon. In my version it does not check
>> that the corosync and any processes of the pacemaker are not frozen (for
>> instance by kill -STOP). Looked like checking for corosync have been already
>> done: https://github.com/ClusterLabs/sbd/pull/83
> Good.
>
>> Don't know what about checking all processes of the pacemaker.
> This moves toward the good direction I would say:
>
>   https://lists.clusterlabs.org/pipermail/users/2020-August/027602.html
>
> The main Pacemaker process is now checked by sbd. Maybe other processes will 
> be
> included in futur releases as "more in-depth health checks" as written in this
> email.
We are targetting a hierarchical approach:

SBD is checking pacemakerd - more explicitly a timestamp
when pacemakerwas considered fine last time. So this task
of checking liveness of thewhole group of pacemaker
daemons can be passed over to pacemakerdwithout risking
that pacemakerd might be stalled or something.

Klaus
>
> Regards,
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Tuchanka

2020-10-02 Thread Jehan-Guillaume de Rorthais
On Fri, 2 Oct 2020 15:18:18 +0300
Олег Самойлов  wrote:

> > On 29 Sep 2020, at 11:34, Jehan-Guillaume de Rorthais 
> > wrote:
> > 
> > 
> > Vagrant use virtualbox by default, which supports softdog, but it support
> > many other virtualization plateform, including eg. libvirt/kvm where you
> > can use virtualized watchdog card.
> >   
> >>   
> > 
> > Vagrant can use Chef, Ansible, Salt, puppet, and others to provision VM:
> > 
> >  https://www.vagrantup.com/docs/provisioning
> > 
> > 
> > There many many available vagrant images:
> > https://app.vagrantup.com/boxes/search There's many vagrant image...because
> > building vagrant image is easy. I built some when RH8 wasn't available yet.
> > So if you need special box, with eg. some predefined setup, you can do it
> > quite fast.  
> 
> My english is poor, I'll try to find other words. My primary and main task
> was to create a prototype for an automatic deploy system. So I used only the
> same technique that will be used on the real hardware servers: RedHat dvd
> image + kickstart. And to test such deploying too. That's why I do not use
> any special image for virtual machines.

How exactly using a vagrant box you built yourself is different with
virtualbox where you clone (I suppose) an existing VM you built?

> > Watchdog is kind of a self-fencing method. Cluster with quorum+watchdog, or
> > SBD+watchdog or quorum+SBD+watchdog are fine...without "active" fencing.  
> 
> quorum+watchdog or SBD+watchdog are useless. Quorum+SBD+watchdog is a
> solution, but also has some drawback, so this is not perfect or fine yet.

Well, by "SBD", I meant "Storage Based Death": using a shared storage to poison
pill other nodes. Not just the sbd daemon, that is used for SBD and watchdog.
Sorry for the shortcut and the confusion.

> I'll write about it below.
>   
> >>> Now, in regard with your multi-site clusters and how you deal with it
> >>> using quorum, did you read the chapter about the Cluster Ticket Registry
> >>> in Pacemaker doc ? See:
> >>> 
> >>> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/ch15.html
> >>> 
> >> 
> >> Yep, I read the whole documentation two years ago. Yep, the ticket system
> >> was looked interesting at first glance, but I didn't see a method how to
> >> use it with PAF. :)  
> > 
> > It could be interesting to have detailed feedback about that. Could you
> > share your experience?  
> 
> Heh, I don't have experience of using the ticket system because I can't even
> imaging how to use the ticket system with PAF.

OK

> As about pacemaker without STONITH the idea was simple: quorum + SBD as
> watchdog daemon.

(this was what I describe as "quorum+watchdog", again sorry for the
confusion :))

> More precisely described in the README. Proved by my test
> system this is mostly works. :)
> 
> What are possible caveats. First of all softdog is not good for this (only
> for testing), and system will heavily depend on reliability of the watchdog
> device.

+1

> SBD is not good as watchdog daemon. In my version it does not check
> that the corosync and any processes of the pacemaker are not frozen (for
> instance by kill -STOP). Looked like checking for corosync have been already
> done: https://github.com/ClusterLabs/sbd/pull/83

Good.

> Don't know what about checking all processes of the pacemaker.

This moves toward the good direction I would say:

  https://lists.clusterlabs.org/pipermail/users/2020-August/027602.html

The main Pacemaker process is now checked by sbd. Maybe other processes will be
included in futur releases as "more in-depth health checks" as written in this
email.

Regards,
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Tuchanka

2020-10-02 Thread Олег Самойлов



> On 29 Sep 2020, at 11:34, Jehan-Guillaume de Rorthais  wrote:
> 
> 
> Vagrant use virtualbox by default, which supports softdog, but it support many
> other virtualization plateform, including eg. libvirt/kvm where you can use
> virtualized watchdog card.
> 
>> 
> 
> Vagrant can use Chef, Ansible, Salt, puppet, and others to provision VM:
> 
>  https://www.vagrantup.com/docs/provisioning
> 
> 
> There many many available vagrant images: 
> https://app.vagrantup.com/boxes/search
> There's many vagrant image...because building vagrant image is easy. I built
> some when RH8 wasn't available yet. So if you need special box, with eg. some
> predefined setup, you can do it quite fast.

My english is poor, I'll try to find other words. My primary and main task was 
to create 
a prototype for an automatic deploy system. So I used only the same technique 
that will 
be used on the real hardware servers: RedHat dvd image + kickstart. And to test 
such deploying too. 
That's why I do not use any special image for virtual machines.


> Watchdog is kind of a self-fencing method. Cluster with quorum+watchdog, or
> SBD+watchdog or quorum+SBD+watchdog are fine...without "active" fencing.

quorum+watchdog or SBD+watchdog are useless. Quorum+SBD+watchdog is a solution, 
but also has some drawback,
 so this is not perfect or fine yet.
I'll write about it below.


> 
>>> Now, in regard with your multi-site clusters and how you deal with it using
>>> quorum, did you read the chapter about the Cluster Ticket Registry in
>>> Pacemaker doc ? See:
>>> 
>>> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/ch15.html
>>>   
>> 
>> Yep, I read the whole documentation two years ago. Yep, the ticket system was
>> looked interesting at first glance, but I didn't see a method how to use it
>> with PAF. :)
> 
> It could be interesting to have detailed feedback about that. Could you share
> your experience?

Heh, I don't have experience of using the ticket system because I can't even 
imaging how to use the ticket system with PAF.

As about pacemaker without STONITH the idea was simple: quorum + SBD as 
watchdog daemon. More precisely described in the README. 
Proved by my test system this is mostly works. :)

What are possible caveats. First of all softdog is not good for this (only for 
testing), and system will heavily depend on reliability of the watchdog device.
SBD is not good as watchdog daemon. In my version it does not check that the 
corosync and any processes of the pacemaker are not frozen (for instance by 
kill -STOP).
Looked like checking for corosync have been already done:
https://github.com/ClusterLabs/sbd/pull/83
Don't know what about checking all processes of the pacemaker. Yep, this 
problems looked like artificial, but must be fixed.
There are other problems due to such solution was not heavily tested. For 
instance with default sync_timeout for the quorum device, this lead to both 
nodes will be rebooted: fault and healthy.
https://lists.clusterlabs.org/pipermail/users/2019-August/026145.html
I don't know is this fixed in the mainstream.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] VirtualDomain stop operation traced - but nothing appears in /var/lib/heartbeat/trace_ra/

2020-10-02 Thread Lentes, Bernd


- Am 30. Sep 2020 um 19:24 schrieb Vladislav Bogdanov bub...@hoster-ok.com:

> Hi

> Try to enable trace_ra for start op.

I'm tracing now start and stop and that works fine.


  

  

  
  
  
  
  

  

  


Thanks for any hint.

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum München

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/