date:20160919

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

2016-09-19 Thread Digimer

On 19/09/16 03:13 PM, Digimer wrote:
> On 19/09/16 03:07 PM, Digimer wrote:
>> On 19/09/16 02:39 PM, Digimer wrote:
>>> On 19/09/16 02:30 PM, Jan Pokorný wrote:
 On 18/09/16 15:37 -0400, Digimer wrote:
>   If, for example, a server's definition file is corrupted while the
> server is running, rgmanager will put the server into a 'failed' state.
> That's fine and fair.

 Please, be more precise.  Is it "vm" resource agent that you are talking
 about, hence server is the particular virtual machine to be managed?
 Is the agent in the role of a service (defined at a top-level) or
 a standard resource (without special treatment, possibly with
 dependent services further in the group)?
>>>
>>> In 'clustat', vm:foo reports 'failed' after the vm.sh calls a status and
>>> gets a bad return (because the foo.xml file was corrupted by creating a
>>> typo that breaks the XML, as an example).
>>>
>>> I'm not sure if that answers your question, sorry.
>>>
>   The problem is that, once the file is fixed, there appears to be no
> way to go failed -> started without disabling (and thus powering off)
> the VM. This is troublesom because it forces an interruption when the
> service could have been placed under resource management without a reboot.
>
>   For example, doing 'clusvcadm -e ' when the service was
> 'disabled' (say because of a manual boot of the server), rgmanager
> detects that the server is running fine and simply marks the server as
> 'started'. Is there no way to do something similar to go 'failed' ->
> 'started' without the 'disable' step?

 In case it's a VM as a service, this could possibly be "exploited"
 (never tested that, though):

 # MANWIDTH=72 man rgmanager | col -b \
   | sed -n '/^VIRTUAL MACHINE/{:a;p;n;/^\s*$/d;ba}'
> VIRTUAL MACHINE FEATURES
>Apart from what is noted in the VM resource agent, rgman-
>ager  provides  a  few  convenience features when dealing
>with virtual machines.
> * it will use live migration when transferring a virtual
> machine  to  a  more-preferred  host in the cluster as a
> consequence of failover domain operation
> * it will search the other instances of rgmanager in the
> cluster  in  the  case  that a user accidentally moves a
> virtual machine using other management tools
> * unlike services, adding a virtual  machine  to  rgman-
> ager’s  configuration will not cause the virtual machine
> to be restarted
> *  removing   a   virtual   machine   from   rgmanager’s
> configuration will leave the virtual machine running.

 (see the last two items).
>>>
>>> So a possible "recover" would be to remove the VM from rgmanager, then
>>> add it back? I can see that working, but it seems heavy handed. :)
>>>
>   I tried freezing the service, no luck. I also tried coalescing via
> '-c', but that didn't help either.

 Any path from "failed" in the resource (group) life-cycle goes either
 through "disabled" or "stopped" if I am not mistaken, so would rather
 experiment with adding a new service and dropping the old one per
 the above description as a possible workaround (perhaps in the reverse
 order so as to retain the same name for the service, indeed unless
 rgmanager would actively prevent that anyway -- no idea).
>>>
>>> This is my understanding as well, yes (that failed must go through
>>> 'disabled' or 'stopped').
>>>
>>> I'll try the remove/re-add option and report back.
>>
>> OK, didn't work.
>>
>> I corrupted the XML definition to cause rgmanager to report it as
>> 'failed', removed it from rgmanager (clustat no longer reported it at
>> all), re-added it and when it came back, it was still listed as 'failed'.
> 
> Ha!
> 
> So, it was still flagged as 'failed', so I called '-d' to disable it
> (after adding it back to rgmanager) and it went 'disabled' WITHOUT
> stopping the server. When I called '-e' on node 2 (the server was on
> node 1), it started on node 1 properly and returned to a 'started' state
> without restarting.
> 
> I wonder if I could call disable directly from the other node...

So yes, I can.

If I call -d on a node that ISN'T the host, it flags the server as
stopped without actually shutting it down. Then I can call '-e' and
bring it back up fine.

This feels like I am exploiting a bug though... I wonder if there is a
more "proper" way to recover the server?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scr

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

2016-09-19 Thread Digimer

On 19/09/16 03:07 PM, Digimer wrote:
> On 19/09/16 02:39 PM, Digimer wrote:
>> On 19/09/16 02:30 PM, Jan Pokorný wrote:
>>> On 18/09/16 15:37 -0400, Digimer wrote:
   If, for example, a server's definition file is corrupted while the
 server is running, rgmanager will put the server into a 'failed' state.
 That's fine and fair.
>>>
>>> Please, be more precise.  Is it "vm" resource agent that you are talking
>>> about, hence server is the particular virtual machine to be managed?
>>> Is the agent in the role of a service (defined at a top-level) or
>>> a standard resource (without special treatment, possibly with
>>> dependent services further in the group)?
>>
>> In 'clustat', vm:foo reports 'failed' after the vm.sh calls a status and
>> gets a bad return (because the foo.xml file was corrupted by creating a
>> typo that breaks the XML, as an example).
>>
>> I'm not sure if that answers your question, sorry.
>>
   The problem is that, once the file is fixed, there appears to be no
 way to go failed -> started without disabling (and thus powering off)
 the VM. This is troublesom because it forces an interruption when the
 service could have been placed under resource management without a reboot.

   For example, doing 'clusvcadm -e ' when the service was
 'disabled' (say because of a manual boot of the server), rgmanager
 detects that the server is running fine and simply marks the server as
 'started'. Is there no way to do something similar to go 'failed' ->
 'started' without the 'disable' step?
>>>
>>> In case it's a VM as a service, this could possibly be "exploited"
>>> (never tested that, though):
>>>
>>> # MANWIDTH=72 man rgmanager | col -b \
>>>   | sed -n '/^VIRTUAL MACHINE/{:a;p;n;/^\s*$/d;ba}'
 VIRTUAL MACHINE FEATURES
Apart from what is noted in the VM resource agent, rgman-
ager  provides  a  few  convenience features when dealing
with virtual machines.
 * it will use live migration when transferring a virtual
 machine  to  a  more-preferred  host in the cluster as a
 consequence of failover domain operation
 * it will search the other instances of rgmanager in the
 cluster  in  the  case  that a user accidentally moves a
 virtual machine using other management tools
 * unlike services, adding a virtual  machine  to  rgman-
 ager’s  configuration will not cause the virtual machine
 to be restarted
 *  removing   a   virtual   machine   from   rgmanager’s
 configuration will leave the virtual machine running.
>>>
>>> (see the last two items).
>>
>> So a possible "recover" would be to remove the VM from rgmanager, then
>> add it back? I can see that working, but it seems heavy handed. :)
>>
   I tried freezing the service, no luck. I also tried coalescing via
 '-c', but that didn't help either.
>>>
>>> Any path from "failed" in the resource (group) life-cycle goes either
>>> through "disabled" or "stopped" if I am not mistaken, so would rather
>>> experiment with adding a new service and dropping the old one per
>>> the above description as a possible workaround (perhaps in the reverse
>>> order so as to retain the same name for the service, indeed unless
>>> rgmanager would actively prevent that anyway -- no idea).
>>
>> This is my understanding as well, yes (that failed must go through
>> 'disabled' or 'stopped').
>>
>> I'll try the remove/re-add option and report back.
> 
> OK, didn't work.
> 
> I corrupted the XML definition to cause rgmanager to report it as
> 'failed', removed it from rgmanager (clustat no longer reported it at
> all), re-added it and when it came back, it was still listed as 'failed'.

Ha!

So, it was still flagged as 'failed', so I called '-d' to disable it
(after adding it back to rgmanager) and it went 'disabled' WITHOUT
stopping the server. When I called '-e' on node 2 (the server was on
node 1), it started on node 1 properly and returned to a 'started' state
without restarting.

I wonder if I could call disable directly from the other node...

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

2016-09-19 Thread Digimer

On 19/09/16 02:39 PM, Digimer wrote:
> On 19/09/16 02:30 PM, Jan Pokorný wrote:
>> On 18/09/16 15:37 -0400, Digimer wrote:
>>>   If, for example, a server's definition file is corrupted while the
>>> server is running, rgmanager will put the server into a 'failed' state.
>>> That's fine and fair.
>>
>> Please, be more precise.  Is it "vm" resource agent that you are talking
>> about, hence server is the particular virtual machine to be managed?
>> Is the agent in the role of a service (defined at a top-level) or
>> a standard resource (without special treatment, possibly with
>> dependent services further in the group)?
> 
> In 'clustat', vm:foo reports 'failed' after the vm.sh calls a status and
> gets a bad return (because the foo.xml file was corrupted by creating a
> typo that breaks the XML, as an example).
> 
> I'm not sure if that answers your question, sorry.
> 
>>>   The problem is that, once the file is fixed, there appears to be no
>>> way to go failed -> started without disabling (and thus powering off)
>>> the VM. This is troublesom because it forces an interruption when the
>>> service could have been placed under resource management without a reboot.
>>>
>>>   For example, doing 'clusvcadm -e ' when the service was
>>> 'disabled' (say because of a manual boot of the server), rgmanager
>>> detects that the server is running fine and simply marks the server as
>>> 'started'. Is there no way to do something similar to go 'failed' ->
>>> 'started' without the 'disable' step?
>>
>> In case it's a VM as a service, this could possibly be "exploited"
>> (never tested that, though):
>>
>> # MANWIDTH=72 man rgmanager | col -b \
>>   | sed -n '/^VIRTUAL MACHINE/{:a;p;n;/^\s*$/d;ba}'
>>> VIRTUAL MACHINE FEATURES
>>>Apart from what is noted in the VM resource agent, rgman-
>>>ager  provides  a  few  convenience features when dealing
>>>with virtual machines.
>>> * it will use live migration when transferring a virtual
>>> machine  to  a  more-preferred  host in the cluster as a
>>> consequence of failover domain operation
>>> * it will search the other instances of rgmanager in the
>>> cluster  in  the  case  that a user accidentally moves a
>>> virtual machine using other management tools
>>> * unlike services, adding a virtual  machine  to  rgman-
>>> ager’s  configuration will not cause the virtual machine
>>> to be restarted
>>> *  removing   a   virtual   machine   from   rgmanager’s
>>> configuration will leave the virtual machine running.
>>
>> (see the last two items).
> 
> So a possible "recover" would be to remove the VM from rgmanager, then
> add it back? I can see that working, but it seems heavy handed. :)
> 
>>>   I tried freezing the service, no luck. I also tried coalescing via
>>> '-c', but that didn't help either.
>>
>> Any path from "failed" in the resource (group) life-cycle goes either
>> through "disabled" or "stopped" if I am not mistaken, so would rather
>> experiment with adding a new service and dropping the old one per
>> the above description as a possible workaround (perhaps in the reverse
>> order so as to retain the same name for the service, indeed unless
>> rgmanager would actively prevent that anyway -- no idea).
> 
> This is my understanding as well, yes (that failed must go through
> 'disabled' or 'stopped').
> 
> I'll try the remove/re-add option and report back.

OK, didn't work.

I corrupted the XML definition to cause rgmanager to report it as
'failed', removed it from rgmanager (clustat no longer reported it at
all), re-added it and when it came back, it was still listed as 'failed'.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

2016-09-19 Thread Jan Pokorný

On 18/09/16 15:37 -0400, Digimer wrote:
>   If, for example, a server's definition file is corrupted while the
> server is running, rgmanager will put the server into a 'failed' state.
> That's fine and fair.

Please, be more precise.  Is it "vm" resource agent that you are talking
about, hence server is the particular virtual machine to be managed?
Is the agent in the role of a service (defined at a top-level) or
a standard resource (without special treatment, possibly with
dependent services further in the group)?

>   The problem is that, once the file is fixed, there appears to be no
> way to go failed -> started without disabling (and thus powering off)
> the VM. This is troublesom because it forces an interruption when the
> service could have been placed under resource management without a reboot.
> 
>   For example, doing 'clusvcadm -e ' when the service was
> 'disabled' (say because of a manual boot of the server), rgmanager
> detects that the server is running fine and simply marks the server as
> 'started'. Is there no way to do something similar to go 'failed' ->
> 'started' without the 'disable' step?

In case it's a VM as a service, this could possibly be "exploited"
(never tested that, though):

# MANWIDTH=72 man rgmanager | col -b \
  | sed -n '/^VIRTUAL MACHINE/{:a;p;n;/^\s*$/d;ba}'
> VIRTUAL MACHINE FEATURES
>Apart from what is noted in the VM resource agent, rgman-
>ager  provides  a  few  convenience features when dealing
>with virtual machines.
> * it will use live migration when transferring a virtual
> machine  to  a  more-preferred  host in the cluster as a
> consequence of failover domain operation
> * it will search the other instances of rgmanager in the
> cluster  in  the  case  that a user accidentally moves a
> virtual machine using other management tools
> * unlike services, adding a virtual  machine  to  rgman-
> ager’s  configuration will not cause the virtual machine
> to be restarted
> *  removing   a   virtual   machine   from   rgmanager’s
> configuration will leave the virtual machine running.

(see the last two items).

>   I tried freezing the service, no luck. I also tried coalescing via
> '-c', but that didn't help either.

Any path from "failed" in the resource (group) life-cycle goes either
through "disabled" or "stopped" if I am not mistaken, so would rather
experiment with adding a new service and dropping the old one per
the above description as a possible workaround (perhaps in the reverse
order so as to retain the same name for the service, indeed unless
rgmanager would actively prevent that anyway -- no idea).

-- 
Jan (Poki)


pgpYkbrCIZZSp.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

2016-09-19 Thread Digimer

On 19/09/16 02:30 PM, Jan Pokorný wrote:
> On 18/09/16 15:37 -0400, Digimer wrote:
>>   If, for example, a server's definition file is corrupted while the
>> server is running, rgmanager will put the server into a 'failed' state.
>> That's fine and fair.
> 
> Please, be more precise.  Is it "vm" resource agent that you are talking
> about, hence server is the particular virtual machine to be managed?
> Is the agent in the role of a service (defined at a top-level) or
> a standard resource (without special treatment, possibly with
> dependent services further in the group)?

In 'clustat', vm:foo reports 'failed' after the vm.sh calls a status and
gets a bad return (because the foo.xml file was corrupted by creating a
typo that breaks the XML, as an example).

I'm not sure if that answers your question, sorry.

>>   The problem is that, once the file is fixed, there appears to be no
>> way to go failed -> started without disabling (and thus powering off)
>> the VM. This is troublesom because it forces an interruption when the
>> service could have been placed under resource management without a reboot.
>>
>>   For example, doing 'clusvcadm -e ' when the service was
>> 'disabled' (say because of a manual boot of the server), rgmanager
>> detects that the server is running fine and simply marks the server as
>> 'started'. Is there no way to do something similar to go 'failed' ->
>> 'started' without the 'disable' step?
> 
> In case it's a VM as a service, this could possibly be "exploited"
> (never tested that, though):
> 
> # MANWIDTH=72 man rgmanager | col -b \
>   | sed -n '/^VIRTUAL MACHINE/{:a;p;n;/^\s*$/d;ba}'
>> VIRTUAL MACHINE FEATURES
>>Apart from what is noted in the VM resource agent, rgman-
>>ager  provides  a  few  convenience features when dealing
>>with virtual machines.
>> * it will use live migration when transferring a virtual
>> machine  to  a  more-preferred  host in the cluster as a
>> consequence of failover domain operation
>> * it will search the other instances of rgmanager in the
>> cluster  in  the  case  that a user accidentally moves a
>> virtual machine using other management tools
>> * unlike services, adding a virtual  machine  to  rgman-
>> ager’s  configuration will not cause the virtual machine
>> to be restarted
>> *  removing   a   virtual   machine   from   rgmanager’s
>> configuration will leave the virtual machine running.
> 
> (see the last two items).

So a possible "recover" would be to remove the VM from rgmanager, then
add it back? I can see that working, but it seems heavy handed. :)

>>   I tried freezing the service, no luck. I also tried coalescing via
>> '-c', but that didn't help either.
> 
> Any path from "failed" in the resource (group) life-cycle goes either
> through "disabled" or "stopped" if I am not mistaken, so would rather
> experiment with adding a new service and dropping the old one per
> the above description as a possible workaround (perhaps in the reverse
> order so as to retain the same name for the service, indeed unless
> rgmanager would actively prevent that anyway -- no idea).

This is my understanding as well, yes (that failed must go through
'disabled' or 'stopped').

I'll try the remove/re-add option and report back.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Ken Gaillot

On 09/19/2016 10:04 AM, Jan Pokorný wrote:
> On 19/09/16 10:18 +, Auer, Jens wrote:
>> Ok, after reading the log files again I found 
>>
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop 
>> mda-ip_stop_0 on MDA1PFP-PCS01 (local)
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: 
>> MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface 
>> [bond0] No such device.\n ]
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
>> [bond0] No such device.
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
>> Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ 
>> ocf-exit-reason:Unknown interface [bond0] No such device. ]
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok 
>> (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, 
>> Pending=0, Fired=0, Skipped=0, Incomplete=0, 
>> Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition 
>> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
>> origin=notify_crmd ]
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
>> origin=abort_transition_graph ]
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
>> Ignore
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
>> monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
>> re-starting anywhere: operation monitor failed 'not configured' (6)
>>
>> I think that explains why the resource is not started on the other
>> node, but I am not sure this is a good decision. It seems to be a
>> little harsh to prevent the resource from starting anywhere,
>> especially considering that the other node will be able to start the
>> resource. 

The resource agent is supposed to return "not configured" only when the
*pacemaker* configuration of the resource is inherently invalid, so
there's no chance of it starting anywhere.

As Jan suggested, make sure you've applied any resource-agents updates.
If that doesn't fix it, it sounds like a bug in the agent, or something
really is wrong with your pacemaker resource config.

> 
> The problem to start with is that based on 
> 
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
>> [bond0] No such device.
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
> 
> you may be using too ancient version resource-agents:
> 
> https://github.com/ClusterLabs/resource-agents/pull/320
> 
> so until you update, the troubleshooting would be quite moot.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] No DRBD resource promoted to master in Active/Passive setup

2016-09-19 Thread Ken Gaillot

On 09/19/2016 09:48 AM, Auer, Jens wrote:
> Hi,
> 
>> Is the network interface being taken down here used for corosync
>> communication? If so, that is a node-level failure, and pacemaker will
>> fence.
> 
> We have different connections on each server:
> - A bonded 10GB network card for data traffic that will be accessed via a 
> virtual ip managed by pacemaker in 192.168.120.1/24. In the cluster nodes 
> MDA1PFP-S01 and MDA1PFP-S02 are assigned to 192.168.120.10 and 192.168.120.11.
> 
> - A dedicated back-to-back connection for corosync heartbeats in 
> 192.168.121.1/24. MDA1PFP-PCS01 and MDA1PFP-S02 are assigned to 
> 192.168.121.10 and 192.168.121.11. When the cluster is created, we use these 
> as primary node names and use the 10GB device as a second backup connection 
> for increased reliability: pcs cluster setup --name MDA1PFP 
> MDA1PFP-PCS01,MDA1PFP-S01 MDA1PFP-PCS02,MDA1PFP-S02
> 
> - A dedicated back-to-back connection for drbd in 192.168.122.1/24. Hosts 
> MDA1PFP-DRBD01 and MDA1PFP-DRBD02 are assigned 192.168.23.10 and 
> 192.168.123.11.

Ah, nice.

> Given that I think it is not a node-level failure. pcs status also reports 
> the nodes as online. I think this should not trigger fencing from pacemaker.
> 
>> When DRBD is configured with 'fencing resource-only' and 'fence-peer
>> "/usr/lib/drbd/crm-fence-peer.sh";', and DRBD detects a network outage,
>> it will try to add a constraint that prevents the other node from
>> becoming master. It removes the constraint when connectivity is restored.
> 
>> I am not familiar with all the under-the-hood details, but IIUC, if
>> pacemaker actually fences the node, then the other node can still take
>> over the DRBD. But if there is a network outage and no pacemaker
>> fencing, then you'll see the behavior you describe -- DRBD prevents
>> master takeover, to avoid stale data being used.
> 
> This is my understanding as well, but there should be no network outage for 
> DRBD. I can reproduce the behavior by stopping cluster nodes which DRBD seems 
> to interpret as network outages since it cannot communicate with the shutdown 
> node anymore. Maybe I should ask on the DRBD mailing list?

OK, I think I follow you now: you're ifdown'ing the data traffic
interface, but the interfaces for both corosync and DRBD traffic are
still up. So, pacemaker detects the virtual IP failure on the traffic
interface, and correctly recovers the IP on the other node, but the DRBD
master role is not recovered.

If the behavior goes away when you remove the DRBD fencing config, then
it sounds like DRBD is seeing it as a network outage, and is adding the
constraint to prevent a stale master. Yes, I think that would be worth
bringing up on the DRBD list, though there might be some DRBD users here
who can chime in, too.

> Cheers,
>   Jens
> --
> Jens Auer | CGI | Software-Engineer
> CGI (Germany) GmbH & Co. KG
> Rheinstraße 95 | 64295 Darmstadt | Germany
> T: +49 6151 36860 154
> jens.a...@cgi.com
> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
> de.cgi.com/pflichtangaben.
> 
> CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
> Group Inc. and its affiliates may be contained in this message. If you are 
> not a recipient indicated or intended in this message (or responsible for 
> delivery of this message to such person), or you think for any reason that 
> this message may have been addressed to you in error, you may not use or copy 
> or deliver this message to anyone else. In such case, you should destroy this 
> message and are asked to notify the sender by reply e-mail.
> 
> 
> Von: Ken Gaillot [kgail...@redhat.com]
> Gesendet: Montag, 19. September 2016 16:28
> An: Auer, Jens; Cluster Labs - All topics related to open-source clustering 
> welcomed
> Betreff: Re: [ClusterLabs] No DRBD resource promoted to master in 
> Active/Passive setup
> 
> On 09/19/2016 02:31 AM, Auer, Jens wrote:
>> Hi,
>>
>> I am not sure that pacemaker should do any fencing here. In my setting, 
>> corosync is configured to use a back-to-back connection for heartbeats. This 
>> is different subnet then used by the ping resource that checks the network 
>> connectivity and detects a failure. In my test, I bring down the network 
>> device used by ping and this triggers the failover. The node status is known 
>> by pacemaker since it receives heartbeats and it only a resource failure. I 
>> asked for fencing conditions a few days ago, and basically was asserted that 
>> resource failure should not trigger STONITH actions if not explicitly 
>> configured.
> 
> Is the network interface being taken down here used for corosync
> communication? If so, that is a node-level failure, and pacemaker will
> fence.
> 
> There is a bit of a distinction between DRBD fencing and pacemaker
> fencing. The DRBD configuration is designed so that DRBD's fencing
> method is to go through pacemaker.
> 
> When DRBD is configu

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Jan Pokorný

On 19/09/16 10:18 +, Auer, Jens wrote:
> Ok, after reading the log files again I found 
> 
> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop 
> mda-ip_stop_0 on MDA1PFP-PCS01 (local)
> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: 
> MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface 
> [bond0] No such device.\n ]
> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
> [bond0] No such device.
> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
> Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ 
> ocf-exit-reason:Unknown interface [bond0] No such device. ]
> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok 
> (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, 
> Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition 
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
> origin=notify_crmd ]
> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
> Ignore
> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
> monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
> re-starting anywhere: operation monitor failed 'not configured' (6)
> 
> I think that explains why the resource is not started on the other
> node, but I am not sure this is a good decision. It seems to be a
> little harsh to prevent the resource from starting anywhere,
> especially considering that the other node will be able to start the
> resource. 

The problem to start with is that based on 

> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
> [bond0] No such device.
> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed

you may be using too ancient version resource-agents:

https://github.com/ClusterLabs/resource-agents/pull/320

so until you update, the troubleshooting would be quite moot.

-- 
Jan (Poki)


pgpSfUQcIcCaO.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] No DRBD resource promoted to master in Active/Passive setup

2016-09-19 Thread Auer, Jens

Hi,

> Is the network interface being taken down here used for corosync
> communication? If so, that is a node-level failure, and pacemaker will
> fence.

We have different connections on each server:
- A bonded 10GB network card for data traffic that will be accessed via a 
virtual ip managed by pacemaker in 192.168.120.1/24. In the cluster nodes 
MDA1PFP-S01 and MDA1PFP-S02 are assigned to 192.168.120.10 and 192.168.120.11.

- A dedicated back-to-back connection for corosync heartbeats in 
192.168.121.1/24. MDA1PFP-PCS01 and MDA1PFP-S02 are assigned to 192.168.121.10 
and 192.168.121.11. When the cluster is created, we use these as primary node 
names and use the 10GB device as a second backup connection for increased 
reliability: pcs cluster setup --name MDA1PFP MDA1PFP-PCS01,MDA1PFP-S01 
MDA1PFP-PCS02,MDA1PFP-S02

- A dedicated back-to-back connection for drbd in 192.168.122.1/24. Hosts 
MDA1PFP-DRBD01 and MDA1PFP-DRBD02 are assigned 192.168.23.10 and 192.168.123.11.

Given that I think it is not a node-level failure. pcs status also reports the 
nodes as online. I think this should not trigger fencing from pacemaker.

> When DRBD is configured with 'fencing resource-only' and 'fence-peer
> "/usr/lib/drbd/crm-fence-peer.sh";', and DRBD detects a network outage,
> it will try to add a constraint that prevents the other node from
> becoming master. It removes the constraint when connectivity is restored.

> I am not familiar with all the under-the-hood details, but IIUC, if
> pacemaker actually fences the node, then the other node can still take
> over the DRBD. But if there is a network outage and no pacemaker
> fencing, then you'll see the behavior you describe -- DRBD prevents
> master takeover, to avoid stale data being used.

This is my understanding as well, but there should be no network outage for 
DRBD. I can reproduce the behavior by stopping cluster nodes which DRBD seems 
to interpret as network outages since it cannot communicate with the shutdown 
node anymore. Maybe I should ask on the DRBD mailing list?

Cheers,
  Jens
--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.

Von: Ken Gaillot [kgail...@redhat.com]
Gesendet: Montag, 19. September 2016 16:28
An: Auer, Jens; Cluster Labs - All topics related to open-source clustering 
welcomed
Betreff: Re: [ClusterLabs] No DRBD resource promoted to master in 
Active/Passive setup

On 09/19/2016 02:31 AM, Auer, Jens wrote:
> Hi,
>
> I am not sure that pacemaker should do any fencing here. In my setting, 
> corosync is configured to use a back-to-back connection for heartbeats. This 
> is different subnet then used by the ping resource that checks the network 
> connectivity and detects a failure. In my test, I bring down the network 
> device used by ping and this triggers the failover. The node status is known 
> by pacemaker since it receives heartbeats and it only a resource failure. I 
> asked for fencing conditions a few days ago, and basically was asserted that 
> resource failure should not trigger STONITH actions if not explicitly 
> configured.

Is the network interface being taken down here used for corosync
communication? If so, that is a node-level failure, and pacemaker will
fence.

There is a bit of a distinction between DRBD fencing and pacemaker
fencing. The DRBD configuration is designed so that DRBD's fencing
method is to go through pacemaker.

When DRBD is configured with 'fencing resource-only' and 'fence-peer
"/usr/lib/drbd/crm-fence-peer.sh";', and DRBD detects a network outage,
it will try to add a constraint that prevents the other node from
becoming master. It removes the constraint when connectivity is restored.

I am not familiar with all the under-the-hood details, but IIUC, if
pacemaker actually fences the node, then the other node can still take
over the DRBD. But if there is a network outage and no pacemaker
fencing, then you'll see the behavior you describe -- DRBD prevents
master takeover, to avoid stale data being used.

> I am also wondering why this is "sticky". After a failover test the DRBD 
> resources are not working even if I restart the cluster on all nodes.
>
> Best wishes,
>   Jens
>
> --
> Dr. Jens Auer | CGI | Software Engineer
> CGI Deutschland L

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Auer, Jens

Hi,

>> After the restart ifconfig still shows the device bond0 to be not RUNNING:
>> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
>> bond0: flags=5123  mtu 1500
>> inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
>> ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
>> RX packets 2034  bytes 286728 (280.0 KiB)
>> RX errors 0  dropped 29  overruns 0  frame 0
>> TX packets 2284  bytes 355975 (347.6 KiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

There seems to be some difference because the device is not RUNNING;
mdaf-pf-pep-spare 14:17:53 999 0 ~ # ifconfig
bond0: flags=5187  mtu 1500
inet 192.168.120.10  netmask 255.255.255.0  broadcast 192.168.120.255
inet6 fe80::5eb9:1ff:fe9c:e7fc  prefixlen 64  scopeid 0x20
ether 5c:b9:01:9c:e7:fc  txqueuelen 3  (Ethernet)
RX packets 15455692  bytes 22377220306 (20.8 GiB)
RX errors 0  dropped 2392  overruns 0  frame 0
TX packets 14706747  bytes 21361519159 (19.8 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Also the netmask and the ip address are wrong. I have configured the device to 
192.168.120.10 with netmask 192.168.120.10. How does IpAddr2 get the wrong 
configuration? I have no idea.

>Anyway, you should rather be using "ip" command from iproute suite
>than various if* tools that come short in some cases:
>http://inai.de/2008/02/19
>This would also be consistent with IPaddr2 uses under the hood.

We are using RedHat 7 and this uses either NetworkManager or the network 
scripts. We use the later and ifup/ifdown should be the correct way to use the 
network card. I also tried using ip link set dev bond0 up/down and it brings up 
the device with the correct ip address and network mask. 

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.

Von: Jan Pokorný [jpoko...@redhat.com]
Gesendet: Montag, 19. September 2016 14:57
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

On 19/09/16 09:15 +, Auer, Jens wrote:
> After the restart ifconfig still shows the device bond0 to be not RUNNING:
> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> bond0: flags=5123  mtu 1500
> inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
> ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
> RX packets 2034  bytes 286728 (280.0 KiB)
> RX errors 0  dropped 29  overruns 0  frame 0
> TX packets 2284  bytes 355975 (347.6 KiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

This seems to suggest bond0 interface is up and address-assigned
(well, the netmask is strange).  So there would be nothing
contradictory to what I said on the address of IPaddr2.

Anyway, you should rather be using "ip" command from iproute suite
than various if* tools that come short in some cases:
http://inai.de/2008/02/19
This would also be consistent with IPaddr2 uses under the hood.

--
Jan (Poki)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Lars Ellenberg

On Mon, Sep 19, 2016 at 02:57:57PM +0200, Jan Pokorný wrote:
> On 19/09/16 09:15 +, Auer, Jens wrote:
> > After the restart ifconfig still shows the device bond0 to be not RUNNING:
> > MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> > bond0: flags=5123  mtu 1500
> > inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
> > ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
> > RX packets 2034  bytes 286728 (280.0 KiB)
> > RX errors 0  dropped 29  overruns 0  frame 0
> > TX packets 2284  bytes 355975 (347.6 KiB)
> > TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> This seems to suggest bond0 interface is up and address-assigned
> (well, the netmask is strange).  So there would be nothing
> contradictory to what I said on the address of IPaddr2.
> 
> Anyway, you should rather be using "ip" command from iproute suite
> than various if* tools that come short in some cases:
> http://inai.de/2008/02/19
> This would also be consistent with IPaddr2 uses under the hood.

The resource agent only controlls and checks
the presence of a certain IP on a certain NIC
(and some parameters).

What you likely ended up with after the "restart"
is an "empty" bonding device with that IP assigned,
but without any "slave" devices, or at least
with the slave devices still set to link down.

If you really wanted the RA to also know about the slaves,
and be able to properly and fully configure a bonding,
you'd have to enhance that resource agent.

If you want the IP to move to some other node,
if it has connectivity problems, use a "ping" and/or
"ethmonitor" resource in addition to the IP.

If you wanted to test-drive cluster response against a
failing network device, your test was wrong.

If you wanted to test-drive cluster response against
a "fat fingered" (or even evil) operator or admin:
give up right there...
You'll never be able to cover it all :-)

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] No DRBD resource promoted to master in Active/Passive setup

2016-09-19 Thread Ken Gaillot

On 09/19/2016 02:31 AM, Auer, Jens wrote:
> Hi,
> 
> I am not sure that pacemaker should do any fencing here. In my setting, 
> corosync is configured to use a back-to-back connection for heartbeats. This 
> is different subnet then used by the ping resource that checks the network 
> connectivity and detects a failure. In my test, I bring down the network 
> device used by ping and this triggers the failover. The node status is known 
> by pacemaker since it receives heartbeats and it only a resource failure. I 
> asked for fencing conditions a few days ago, and basically was asserted that 
> resource failure should not trigger STONITH actions if not explicitly 
> configured.

Is the network interface being taken down here used for corosync
communication? If so, that is a node-level failure, and pacemaker will
fence.

There is a bit of a distinction between DRBD fencing and pacemaker
fencing. The DRBD configuration is designed so that DRBD's fencing
method is to go through pacemaker.

When DRBD is configured with 'fencing resource-only' and 'fence-peer
"/usr/lib/drbd/crm-fence-peer.sh";', and DRBD detects a network outage,
it will try to add a constraint that prevents the other node from
becoming master. It removes the constraint when connectivity is restored.

I am not familiar with all the under-the-hood details, but IIUC, if
pacemaker actually fences the node, then the other node can still take
over the DRBD. But if there is a network outage and no pacemaker
fencing, then you'll see the behavior you describe -- DRBD prevents
master takeover, to avoid stale data being used.


> I am also wondering why this is "sticky". After a failover test the DRBD 
> resources are not working even if I restart the cluster on all nodes. 
> 
> Best wishes,
>   Jens
> 
> --
> Dr. Jens Auer | CGI | Software Engineer
> CGI Deutschland Ltd. & Co. KG
> Rheinstraße 95 | 64295 Darmstadt | Germany
> T: +49 6151 36860 154
> jens.a...@cgi.com
> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
> de.cgi.com/pflichtangaben.
> 
> CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
> Group Inc. and its affiliates may be contained in this message. If you are 
> not a recipient indicated or intended in this message (or responsible for 
> delivery of this message to such person), or you think for any reason that 
> this message may have been addressed to you in error, you may not use or copy 
> or deliver this message to anyone else. In such case, you should destroy this 
> message and are asked to notify the sender by reply e-mail.
> 
>> -Original Message-
>> From: Ken Gaillot [mailto:kgail...@redhat.com]
>> Sent: 16 September 2016 17:56
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] No DRBD resource promoted to master in 
>> Active/Passive
>> setup
>>
>> On 09/16/2016 10:02 AM, Auer, Jens wrote:
>>> Hi,
>>>
>>> I have an Active/Passive configuration with a drbd mast/slave resource:
>>>
>>> MDA1PFP-S01 14:40:27 1803 0 ~ # pcs status Cluster name: MDA1PFP
>>> Last updated: Fri Sep 16 14:41:18 2016Last change: Fri Sep 16
>>> 14:39:49 2016 by root via cibadmin on MDA1PFP-PCS01
>>> Stack: corosync
>>> Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition
>>> with quorum
>>> 2 nodes and 7 resources configured
>>>
>>> Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
>>>
>>> Full list of resources:
>>>
>>>  Master/Slave Set: drbd1_sync [drbd1]
>>>  Masters: [ MDA1PFP-PCS02 ]
>>>  Slaves: [ MDA1PFP-PCS01 ]
>>>  mda-ip(ocf::heartbeat:IPaddr2):Started MDA1PFP-PCS02
>>>  Clone Set: ping-clone [ping]
>>>  Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
>>>  ACTIVE(ocf::heartbeat:Dummy):Started MDA1PFP-PCS02
>>>  shared_fs(ocf::heartbeat:Filesystem):Started MDA1PFP-PCS02
>>>
>>> PCSD Status:
>>>   MDA1PFP-PCS01: Online
>>>   MDA1PFP-PCS02: Online
>>>
>>> Daemon Status:
>>>   corosync: active/disabled
>>>   pacemaker: active/disabled
>>>   pcsd: active/enabled
>>>
>>> MDA1PFP-S01 14:41:19 1804 0 ~ # pcs resource --full
>>>  Master: drbd1_sync
>>>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
>>> clone-node-max=1 notify=true
>>>   Resource: drbd1 (class=ocf provider=linbit type=drbd)
>>>Attributes: drbd_resource=shared_fs
>>>Operations: start interval=0s timeout=240 (drbd1-start-interval-0s)
>>>promote interval=0s timeout=90 (drbd1-promote-interval-0s)
>>>demote interval=0s timeout=90 (drbd1-demote-interval-0s)
>>>stop interval=0s timeout=100 (drbd1-stop-interval-0s)
>>>monitor interval=60s (drbd1-monitor-interval-60s)
>>>  Resource: mda-ip (class=ocf provider=heartbeat type=IPaddr2)
>>>   Attributes: ip=192.168.120.20 cidr_netmask=32 nic=bond0
>>>   Operations: start interval=0s timeout=20s (mda-ip-start-interval-0s)
>>>   stop interval=0s timeout=20s (mda-ip-stop-interval-0s)
>>>   monitor interval=1s (mda-ip-monitor-interval-1

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Jan Pokorný

On 19/09/16 09:15 +, Auer, Jens wrote:
> After the restart ifconfig still shows the device bond0 to be not RUNNING:
> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> bond0: flags=5123  mtu 1500
> inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
> ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
> RX packets 2034  bytes 286728 (280.0 KiB)
> RX errors 0  dropped 29  overruns 0  frame 0
> TX packets 2284  bytes 355975 (347.6 KiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

This seems to suggest bond0 interface is up and address-assigned
(well, the netmask is strange).  So there would be nothing
contradictory to what I said on the address of IPaddr2.

Anyway, you should rather be using "ip" command from iproute suite
than various if* tools that come short in some cases:
http://inai.de/2008/02/19
This would also be consistent with IPaddr2 uses under the hood.

-- 
Jan (Poki)

pgpb5Futj8WMD.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] where do I find the null fencing device?

2016-09-19 Thread Kristoffer Grönlund

Dan Swartzendruber  writes:

> I wanted to do some experiments, and the null fencing agent seemed to be 
> just what I wanted.  I don't find it anywhere, even after installing 
> fence-agents-all and cluster-glue (this is on CentOS 7, btw...)  
> Thanks...

On SUSE distributions, it's packaged in a separate cluster-glue-devel
package. Not sure what the packages for CentOS look like.

Cheers,
Kristoffer

>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Auer, Jens

Ok, after reading the log files again I found 

Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop 
mda-ip_stop_0 on MDA1PFP-PCS01 (local)
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: 
MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface 
[bond0] No such device.\n ]
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
[bond0] No such device.
Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ 
ocf-exit-reason:Unknown interface [bond0] No such device. ]
Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok 
(node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
Ignore
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
re-starting anywhere: operation monitor failed 'not configured' (6)

I think that explains why the resource is not started on the other node, but I 
am not sure this is a good decision. It seems to be a little harsh to prevent 
the resource from starting anywhere, especially considering that the other node 
will be able to start the resource. 

Cheers,
  Jens
--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Auer, Jens
Gesendet: Montag, 19. September 2016 12:08
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: AW: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

Hi,

> Would "rmmod " be a better hammer of choice?

I am just testing what happens in case of hardware/network issues. Any hammer 
is good enough. Worst case would be that I unplug the machine, maybe with ILO.

I have created a simple testing setup of a two-node cluter with a virtual ip 
and a ping resource which should move to the other node when I unload the 
drivers on the active node. The configuration is
MDA1PFP-S02 10:02:53 1203 0 ~ # pcs cluster setup --name MDA1PFP 
MDA1PFP-PCS01,MDA1PFP-S01 MDA1PFP-PCS02,MDA1PFP-S02
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
MDA1PFP-PCS01: Succeeded
MDA1PFP-PCS02: Succeeded
Synchronizing pcsd certificates on nodes MDA1PFP-PCS01, MDA1PFP-PCS02...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success

Restaring pcsd on the nodes in order to reload the certificates...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success
MDA1PFP-S02 10:03:02 1204 0 ~ # pcs cluster start --all
MDA1PFP-PCS01: Starting Cluster...
MDA1PFP-PCS02: Starting Cluster...
MDA1PFP-S02 10:03:03 1205 0 ~ # sleep 5
rm -f mda; pcs cluster cib mda
pcs -f mda property set no-quorum-policy=ignore

pcs -f mda resource create mda-ip ocf:heartbeat:IPaddr2 ip=192.168.120.20 
cidr_netmask=32 nic=bond0 op monitor interval=1s
MDA1PFP-S02 10:03:08 1206 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS01 
--name ServerRole --update PRIME
MDA1PFP-S02 10:03:08 1207 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS02 
--name ServerRole --update BACKUP
MDA1PFP-S02 10:03:08 1208 0 ~ # pcs property set stonith-enabled=false
MDA1PFP-S02 10:03:08 1209 0 ~ # rm -f mda; pcs cluster cib mda
MDA1PFP-S02 10:03:08 1210 0 ~ # pcs -f mda property set no-quorum-policy=ignore
MDA1PFP-S02 10:03:08 1211 0 ~ #
MDA1PFP-S02 10:03:08 1211 0 ~ # pcs -f mda resource create mda-ip 
ocf:heartbeat:IPaddr2 ip=192.168.120.20 cidr_netmask=32 nic=b

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Auer, Jens

Hi,

> Would "rmmod " be a better hammer of choice?

I am just testing what happens in case of hardware/network issues. Any hammer 
is good enough. Worst case would be that I unplug the machine, maybe with ILO.

I have created a simple testing setup of a two-node cluter with a virtual ip 
and a ping resource which should move to the other node when I unload the 
drivers on the active node. The configuration is 
MDA1PFP-S02 10:02:53 1203 0 ~ # pcs cluster setup --name MDA1PFP 
MDA1PFP-PCS01,MDA1PFP-S01 MDA1PFP-PCS02,MDA1PFP-S02
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
MDA1PFP-PCS01: Succeeded
MDA1PFP-PCS02: Succeeded
Synchronizing pcsd certificates on nodes MDA1PFP-PCS01, MDA1PFP-PCS02...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success

Restaring pcsd on the nodes in order to reload the certificates...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success
MDA1PFP-S02 10:03:02 1204 0 ~ # pcs cluster start --all
MDA1PFP-PCS01: Starting Cluster...
MDA1PFP-PCS02: Starting Cluster...
MDA1PFP-S02 10:03:03 1205 0 ~ # sleep 5
rm -f mda; pcs cluster cib mda
pcs -f mda property set no-quorum-policy=ignore

pcs -f mda resource create mda-ip ocf:heartbeat:IPaddr2 ip=192.168.120.20 
cidr_netmask=32 nic=bond0 op monitor interval=1s
MDA1PFP-S02 10:03:08 1206 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS01 
--name ServerRole --update PRIME
MDA1PFP-S02 10:03:08 1207 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS02 
--name ServerRole --update BACKUP
MDA1PFP-S02 10:03:08 1208 0 ~ # pcs property set stonith-enabled=false
MDA1PFP-S02 10:03:08 1209 0 ~ # rm -f mda; pcs cluster cib mda
MDA1PFP-S02 10:03:08 1210 0 ~ # pcs -f mda property set no-quorum-policy=ignore
MDA1PFP-S02 10:03:08 1211 0 ~ # 
MDA1PFP-S02 10:03:08 1211 0 ~ # pcs -f mda resource create mda-ip 
ocf:heartbeat:IPaddr2 ip=192.168.120.20 cidr_netmask=32 nic=bond0 op monitor 
interval=1s
MDA1PFP-S02 10:03:08 1212 0 ~ # pcs -f mda resource create ping 
ocf:pacemaker:ping dampen=5s multiplier=1000 host_list=pf-pep-dev-1  params 
timeout=1 attempts=3  op monitor interval=1 --clone
MDA1PFP-S02 10:03:12 1213 0 ~ # pcs -f mda constraint location mda-ip rule 
score=-INFINITY pingd lt 1 or not_defined pingd
MDA1PFP-S02 10:03:12 1214 0 ~ # pcs cluster cib-push mda
CIB updated

When I now unload the drivers on the active node the VIP resource is stopped 
but never started on the other node although it can ping.

MDA1PFP-S01 10:02:49 2162 0 ~ # modprobe -r bonding; modprobe -r ixgbe
MDA1PFP-S01 10:03:45 2163 0 ~ # pcs status
Cluster name: MDA1PFP
Last updated: Mon Sep 19 10:04:38 2016  Last change: Mon Sep 19 
10:03:25 2016 by hacluster via crmd on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS01 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
2 nodes and 3 resources configured

Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Full list of resources:

 mda-ip (ocf::heartbeat:IPaddr2):   Stopped
 Clone Set: ping-clone [ping]
 Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

Failed Actions:
* mda-ip_monitor_1000 on MDA1PFP-PCS01 'not configured' (6): call=14, 
status=complete, exitreason='Unknown interface [bond0] No such device.',
last-rc-change='Mon Sep 19 10:03:45 2016', queued=0ms, exec=0ms


PCSD Status:
  MDA1PFP-PCS01: Online
  MDA1PFP-PCS02: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

The log from the otehr node to which the resource should be migrated is:
Sep 19 10:03:12 MDA1PFP-S02 pcsd: Starting pcsd:
Sep 19 10:03:12 MDA1PFP-S02 systemd: Starting PCS GUI and remote configuration 
interface...
Sep 19 10:03:12 MDA1PFP-S02 systemd: Started PCS GUI and remote configuration 
interface.
Sep 19 10:03:15 MDA1PFP-S02 attrd[12444]:  notice: Updating all attributes 
after cib_refresh_notify event
Sep 19 10:03:15 MDA1PFP-S02 crmd[12446]:  notice: Notifications disabled
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]: warning: FSA: Input I_DC_TIMEOUT from 
crm_timer_popped() received in state S_PENDING
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]:  notice: State transition S_ELECTION 
-> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
origin=do_election_count_vote ]
Sep 19 10:03:25 MDA1PFP-S02 crmd[12446]:  notice: State transition S_PENDING -> 
S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond 
]
Sep 19 10:03:25 MDA1PFP-S02 attrd[12444]:  notice: Processing sync-response 
from MDA1PFP-PCS01
Sep 19 10:03:26 MDA1PFP-S02 crmd[12446]:  notice: Operation ping_monitor_0: not 
running (node=MDA1PFP-PCS02, call=10, rc=7, cib-update=13, confirmed=true)
Sep 19 10:03:26 MDA1PFP-S02 crmd[12446]:  notice: Operation mda-ip_monitor_0: 
not running (node=MDA1PFP-PCS02, call=5, rc=7, cib-update=14, confirmed=true)
Sep 19 10:03:28 MDA1PFP-S02 crmd[12446]:  notice: Operation ping_start_0: ok 
(node=MDA1PFP-PCS02, c

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Auer, Jens

Hi,

I just checked if the VIP resource brings up the network device and it turns 
out it doesn't.

I created a simple cluster with one VIP resource:
MDA1PFP-S01 09:06:34 2115 0 ~ # pcs cluster setup --name MDA1PFP 
MDA1PFP-PCS01,MDA1PFP-S01 MDA1PFP-PCS02,MDA1PFP-S02
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
MDA1PFP-PCS01: Succeeded
MDA1PFP-PCS02: Succeeded
Synchronizing pcsd certificates on nodes MDA1PFP-PCS01, MDA1PFP-PCS02...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success

Restaring pcsd on the nodes in order to reload the certificates...
MDA1PFP-PCS01: Success
MDA1PFP-PCS02: Success
MDA1PFP-S01 09:06:40 2116 0 ~ # pcs cluster start --all
MDA1PFP-PCS01: Starting Cluster...
MDA1PFP-PCS02: Starting Cluster...
MDA1PFP-S01 09:06:41 2117 0 ~ # sleep 5
rm -f mda; pcs cluster cib mda
pcs -f mda property set no-quorum-policy=ignore

pcs -f mda resource create mda-ip ocf:heartbeat:IPaddr2 ip=192.168.120.20 
cidr_netmask=32 nic=bond0 op monitor interval=1s
pcs -f mda constraint location mda-ip prefers MDA1PFP-PCS01=50
MDA1PFP-S01 09:06:46 2118 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS01 
--name ServerRole --update PRIME
MDA1PFP-S01 09:06:46 2119 0 ~ # crm_attribute --type nodes --node MDA1PFP-PCS02 
--name ServerRole --update BACKUP
MDA1PFP-S01 09:06:46 2120 0 ~ # pcs property set stonith-enabled=false
MDA1PFP-S01 09:06:47 2121 0 ~ # rm -f mda; pcs cluster cib mda
MDA1PFP-S01 09:06:47 2122 0 ~ # pcs -f mda property set no-quorum-policy=ignore
MDA1PFP-S01 09:06:47 2123 0 ~ # 
MDA1PFP-S01 09:06:47 2123 0 ~ # pcs -f mda resource create mda-ip 
ocf:heartbeat:IPaddr2 ip=192.168.120.20 cidr_netmask=32 nic=bond0 op monitor 
interval=1s
MDA1PFP-S01 09:06:47 2124 0 ~ # pcs -f mda constraint location mda-ip prefers 
MDA1PFP-PCS01=50
MDA1PFP-S01 09:06:47 2125 0 ~ # pcs cluster cib-push mda
CIB updated

Now, I bring down the network device, wait for the failure and the restart.
MDA1PFP-S01 09:06:48 2126 0 ~ # ifdown bond0
Last updated: Mon Sep 19 09:10:29 2016  Last change: Mon Sep 19 
09:07:03 2016 by hacluster via crmd on MDA1PFP-PCS01
Stack: corosync
Current DC: MDA1PFP-PCS01 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
2 nodes and 1 resource configured

Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]

mda-ip  (ocf::heartbeat:IPaddr2):   Started MDA1PFP-PCS01

Failed Actions:
* mda-ip_monitor_1000 on MDA1PFP-PCS01 'not running' (7): call=7, 
status=complete, exitreason='none',
last-rc-change='Mon Sep 19 09:07:54 2016', queued=0ms, exec=0ms

After the restart ifconfig still shows the device bond0 to be not RUNNING:
MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
bond0: flags=5123  mtu 1500
inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
RX packets 2034  bytes 286728 (280.0 KiB)
RX errors 0  dropped 29  overruns 0  frame 0
TX packets 2284  bytes 355975 (347.6 KiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0: flags=4099  mtu 1500
inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
ether 52:54:00:74:d9:39  txqueuelen 0  (Ethernet)
RX packets 0  bytes 0 (0.0 B)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 0  bytes 0 (0.0 B)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Pinging another node via that device fails as expected:
MDA1PFP-S01 09:08:00 2128 0 ~ # ping pf-pep-dev-1
PING pf-pep-dev-1 (192.168.120.1) 56(84) bytes of data.
^C
--- pf-pep-dev-1 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 2999ms

The question is why the monitor operation detects the failure once after 
bringing the device down, but then restarts it and does not detect any further 
errors.

Best wishes,
  Jens


--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Jan Pokorný [jpoko...@redhat.com]
Gesendet: Freitag, 16. September 2016 23:13
An: users@clusterlabs.org
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with d

Re: [ClusterLabs] No DRBD resource promoted to master in Active/Passive setup

2016-09-19 Thread Auer, Jens

Hi,

I am not sure that pacemaker should do any fencing here. In my setting, 
corosync is configured to use a back-to-back connection for heartbeats. This is 
different subnet then used by the ping resource that checks the network 
connectivity and detects a failure. In my test, I bring down the network device 
used by ping and this triggers the failover. The node status is known by 
pacemaker since it receives heartbeats and it only a resource failure. I asked 
for fencing conditions a few days ago, and basically was asserted that resource 
failure should not trigger STONITH actions if not explicitly configured.

I am also wondering why this is "sticky". After a failover test the DRBD 
resources are not working even if I restart the cluster on all nodes. 

Best wishes,
  Jens

--
Dr. Jens Auer | CGI | Software Engineer
CGI Deutschland Ltd. & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.

> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: 16 September 2016 17:56
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] No DRBD resource promoted to master in 
> Active/Passive
> setup
> 
> On 09/16/2016 10:02 AM, Auer, Jens wrote:
> > Hi,
> >
> > I have an Active/Passive configuration with a drbd mast/slave resource:
> >
> > MDA1PFP-S01 14:40:27 1803 0 ~ # pcs status Cluster name: MDA1PFP
> > Last updated: Fri Sep 16 14:41:18 2016Last change: Fri Sep 16
> > 14:39:49 2016 by root via cibadmin on MDA1PFP-PCS01
> > Stack: corosync
> > Current DC: MDA1PFP-PCS02 (version 1.1.13-10.el7-44eb2dd) - partition
> > with quorum
> > 2 nodes and 7 resources configured
> >
> > Online: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
> >
> > Full list of resources:
> >
> >  Master/Slave Set: drbd1_sync [drbd1]
> >  Masters: [ MDA1PFP-PCS02 ]
> >  Slaves: [ MDA1PFP-PCS01 ]
> >  mda-ip(ocf::heartbeat:IPaddr2):Started MDA1PFP-PCS02
> >  Clone Set: ping-clone [ping]
> >  Started: [ MDA1PFP-PCS01 MDA1PFP-PCS02 ]
> >  ACTIVE(ocf::heartbeat:Dummy):Started MDA1PFP-PCS02
> >  shared_fs(ocf::heartbeat:Filesystem):Started MDA1PFP-PCS02
> >
> > PCSD Status:
> >   MDA1PFP-PCS01: Online
> >   MDA1PFP-PCS02: Online
> >
> > Daemon Status:
> >   corosync: active/disabled
> >   pacemaker: active/disabled
> >   pcsd: active/enabled
> >
> > MDA1PFP-S01 14:41:19 1804 0 ~ # pcs resource --full
> >  Master: drbd1_sync
> >   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> > clone-node-max=1 notify=true
> >   Resource: drbd1 (class=ocf provider=linbit type=drbd)
> >Attributes: drbd_resource=shared_fs
> >Operations: start interval=0s timeout=240 (drbd1-start-interval-0s)
> >promote interval=0s timeout=90 (drbd1-promote-interval-0s)
> >demote interval=0s timeout=90 (drbd1-demote-interval-0s)
> >stop interval=0s timeout=100 (drbd1-stop-interval-0s)
> >monitor interval=60s (drbd1-monitor-interval-60s)
> >  Resource: mda-ip (class=ocf provider=heartbeat type=IPaddr2)
> >   Attributes: ip=192.168.120.20 cidr_netmask=32 nic=bond0
> >   Operations: start interval=0s timeout=20s (mda-ip-start-interval-0s)
> >   stop interval=0s timeout=20s (mda-ip-stop-interval-0s)
> >   monitor interval=1s (mda-ip-monitor-interval-1s)
> >  Clone: ping-clone
> >   Resource: ping (class=ocf provider=pacemaker type=ping)
> >Attributes: dampen=5s multiplier=1000 host_list=pf-pep-dev-1
> > timeout=1 attempts=3
> >Operations: start interval=0s timeout=60 (ping-start-interval-0s)
> >stop interval=0s timeout=20 (ping-stop-interval-0s)
> >monitor interval=1 (ping-monitor-interval-1)
> >  Resource: ACTIVE (class=ocf provider=heartbeat type=Dummy)
> >   Operations: start interval=0s timeout=20 (ACTIVE-start-interval-0s)
> >   stop interval=0s timeout=20 (ACTIVE-stop-interval-0s)
> >   monitor interval=10 timeout=20
> > (ACTIVE-monitor-interval-10)
> >  Resource: shared_fs (class=ocf provider=heartbeat type=Filesystem)
> >   Attributes: device=/dev/drbd1 directory=/shared_fs fstype=xfs
> >   Operations: start interval=0s timeout=60 (shared_fs-start-interval-0s)
> >   stop interval=0s timeout=60 (shared_fs-stop-interval-0s)
> >   monitor interval=20 timeout=40
> > (shared_fs-m

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

Re: [ClusterLabs] [rgmanager] Recovering a failed (but running) server in rgmanager

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Re: [ClusterLabs] No DRBD resource promoted to master in Active/Passive setup

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Re: [ClusterLabs] No DRBD resource promoted to master in Active/Passive setup

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Re: [ClusterLabs] No DRBD resource promoted to master in Active/Passive setup

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Re: [ClusterLabs] where do I find the null fencing device?

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

Re: [ClusterLabs] No DRBD resource promoted to master in Active/Passive setup

18 matches

Site Navigation

Mail list logo

Footer information