Re: [ClusterLabs] Cluster active/active

2016-10-07 Thread Dayvidson Bezerra
The company only uses Ubuntu, and do not want another distro in your
environment.

I'm scolding to solve this .. I've done active / passive cluster with DRBD
but active / not active.

What I have read is about the use of pacemaker + corosync and using GSF2.

Someone who has already succeeded in making active / active in Linux Ubuntu?


2016-10-07 20:50 GMT-03:00 Digimer :

> On 07/10/16 07:46 PM, Dayvidson Bezerra wrote:
> > Hello.
> > I am wanting to set up an active / active cluster in ubuntu 16.04 with
> > pacemaket and corosync and following the clusterlabs documentation I'm
> > not getting.
> > Someone has documentation that might help?
>
> Last I checked (and it's not been recently), ubuntu's support for HA is
> still lacking. It's recommended for people new to HA to use either RHEL
> (CentOS) or SUSE. Red Hat and SUSE both have paid staff who make sure
> that HA works well.
>
> If you want to use Ubuntu, after you get a working config in either EL
> or SUSE, then you can port. That way, if you run into issues, you will
> know your config is good and that you're dealing with an OS issue. Keeps
> the fewest variables in play at a time.
>
> Also, I don't know of any good docs for HA on ubuntu, for the same reason.
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 

*Dayvidson Bezerra*


*Pós-Graduado em Gerenciamento de Redes - FIR-PEGraduado em Redes de
Computadores - FMN**F: +55 81 9877-5127*

*Skype: dayvidson.bezerra**Lattes: **http://lattes.cnpq.br/3299061783823913
*
*Linked In: *http://br.linkedin.com/pub/dayvidson-bezerra/2a/772/bb7
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster active/active

2016-10-07 Thread Digimer
On 07/10/16 07:46 PM, Dayvidson Bezerra wrote:
> Hello.
> I am wanting to set up an active / active cluster in ubuntu 16.04 with
> pacemaket and corosync and following the clusterlabs documentation I'm
> not getting.
> Someone has documentation that might help?

Last I checked (and it's not been recently), ubuntu's support for HA is
still lacking. It's recommended for people new to HA to use either RHEL
(CentOS) or SUSE. Red Hat and SUSE both have paid staff who make sure
that HA works well.

If you want to use Ubuntu, after you get a working config in either EL
or SUSE, then you can port. That way, if you run into issues, you will
know your config is good and that you're dealing with an OS issue. Keeps
the fewest variables in play at a time.

Also, I don't know of any good docs for HA on ubuntu, for the same reason.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Cluster active/active

2016-10-07 Thread Dayvidson Bezerra
Hello.
I am wanting to set up an active / active cluster in ubuntu 16.04 with
pacemaket and corosync and following the clusterlabs documentation I'm not
getting.
Someone has documentation that might help?







-- 

*Dayvidson Bezerra*


*Pós-Graduado em Gerenciamento de Redes - FIR-PEGraduado em Redes de
Computadores - FMN**F: +55 81 9877-5127*

*Skype: dayvidson.bezerra**Lattes: **http://lattes.cnpq.br/3299061783823913
*
*Linked In: *http://br.linkedin.com/pub/dayvidson-bezerra/2a/772/bb7
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Question] About the timing of the stop of the monitor of the Slave resource.

2016-10-07 Thread renayama19661014
Hi All,

(Sorry...Because a format collapsed, I send it again.)

I ask about the movement of the Master/Slave resource.

Does the next movement not have the problem?

Step 1) Constitute a cluster.
-
[root@rh72-01 ~]# crm_mon -1
Stack: corosync
Current DC: rh72-01 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Fri Oct  7 22:12:31 2016          Last change: Fri Oct  7
22:12:29 2016 by root via cibadmin on rh72-01

2 nodes and 3 resources configured

Online: [ rh72-01 rh72-02 ]

 prmDummy       (ocf::pacemaker:Dummy): Started rh72-01
 Master/Slave Set: msStateful [prmStateful]
     Masters: [ rh72-01 ]
     Slaves: [ rh72-02 ]
-

Step 2) Set pseudotrouble in start of prmDummy of the rh72-02 node.
-
dummy_start() {
return $OCF_ERR_GENERIC
    local RETVAL

    dummy_monitor
(snip)
-

Step 3) Stop rh72-01 node.
The monitor of msStateful stops.
Promote has not been yet carried out.
-
[root@rh72-01 ~]# systemctl stop pacemaker

[root@rh72-02 ~]# crm_mon -1
Stack: corosync
Current DC: rh72-02 (version 1.1.15-e174ec8) - partition WITHOUT quorum
Last updated: Fri Oct  7 22:14:30 2016          Last change: Fri Oct  7
22:14:11 2016 by root via cibadmin on rh72-01

2 nodes and 3 resources configured

Online: [ rh72-02 ]
OFFLINE: [ rh72-01 ]

 Master/Slave Set: msStateful [prmStateful]
     Slaves: [ rh72-02 ]

Failed Actions:
* prmDummy_start_0 on rh72-02 'unknown error' (1): call=14, status=complete,
exitreason='none',
    last-rc-change='Fri Oct  7 22:14:27 2016', queued=0ms, exec=36ms


Oct  7 22:14:27 rh72-02 lrmd[2772]:    info: Cancelling ocf operation
prmStateful_monitor_2
Oct  7 22:14:27 rh72-02 crmd[2775]:    info: Result of monitor operation for
prmStateful on rh72-02: Cancelled

-

The indication of crm_mon sees msStateful as Slave, too.

Because the Promote handling of resource of Slave was not carried out, I
thought that the monitor should not stop.

Sorry...Possibly I may only forget a past discussion.

By a past discussion, was there any reason to carry out the cancellation of the
monitor of the Slave resource first?


I registered these contents with Bugzilla.
I attach the crm_report file to Bugzilla.* 
http://bugs.clusterlabs.org/show_bug.cgi?id=5302

Best Regards,
Hideo Yamacuhi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [Question] About the timing of the stop of the monitor of the Slave resource.

2016-10-07 Thread renayama19661014
Hi All, I ask about the movement of the Master/Slave resource. Does the next 
movement not have the problem? Step 1) Constitute a cluster.
-
[root@rh72-01 ~]# crm_mon -1
Stack: corosync
Current DC: rh72-01 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Fri Oct  7 22:12:31 2016  Last change: Fri Oct  7
22:12:29 2016 by root via cibadmin on rh72-01 2 nodes and 3 resources 
configured Online: [ rh72-01 rh72-02 ] prmDummy   (ocf::pacemaker:Dummy): 
Started rh72-01 Master/Slave Set: msStateful [prmStateful] Masters: [ rh72-01 ] 
Slaves: [ rh72-02 ]
- Step 2) Set pseudotrouble in start of prmDummy of the rh72-02 node.
-
dummy_start() {
return $OCF_ERR_GENERIC local RETVAL dummy_monitor
(snip)
- Step 3) Stop rh72-01 node.
The monitor of msStateful stops.
Promote has not been yet carried out.
-
[root@rh72-01 ~]# systemctl stop pacemaker [root@rh72-02 ~]# crm_mon -1
Stack: corosync
Current DC: rh72-02 (version 1.1.15-e174ec8) - partition WITHOUT quorum
Last updated: Fri Oct  7 22:14:30 2016  Last change: Fri Oct  7
22:14:11 2016 by root via cibadmin on rh72-01 2 nodes and 3 resources 
configured Online: [ rh72-02 ]
OFFLINE: [ rh72-01 ] Master/Slave Set: msStateful [prmStateful] Slaves: [ 
rh72-02 ] Failed Actions:
* prmDummy_start_0 on rh72-02 'unknown error' (1): call=14, status=complete,
exitreason='none', last-rc-change='Fri Oct  7 22:14:27 2016', queued=0ms, 
exec=36ms Oct  7 22:14:27 rh72-02 lrmd[2772]:info: Cancelling ocf operation
prmStateful_monitor_2
Oct  7 22:14:27 rh72-02 crmd[2775]:info: Result of monitor operation for
prmStateful on rh72-02: Cancelled - The indication of crm_mon sees 
msStateful as Slave, too. Because the Promote handling of resource of Slave was 
not carried out, I
thought that the monitor should not stop. Sorry...Possibly I may only forget a 
past discussion. By a past discussion, was there any reason to carry out the 
cancellation of the
monitor of the Slave resource first?
* I registered these contents with 
Bugziila.(http://bugs.clusterlabs.org/show_bug.cgi?id=5302)
* I attached the file of crm_report to Bugzilla.

Best Regards,
Hideo Yamacuhi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-10-07 Thread renayama19661014
Hi All,

Our user may not necessarily use sdb.

I confirmed that there was a method using WD service of corosync as one method 
not to use sdb.
Pacemaker watches the process of pacemaker by WD service using CMAP and can 
carry out watchdog.


We can set up a patch of pacemaker.

Was the discussion of using WD service over so far?


Best Regard,
Hideo Yamauchi.


- Original Message -
> From: Klaus Wenninger 
> To: Ulrich Windl ; users@clusterlabs.org
> Cc: 
> Date: 2016/10/7, Fri 17:47
> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is 
> frozen, cluster decisions are delayed infinitely
> 
> On 10/07/2016 08:14 AM, Ulrich Windl wrote:
>  Klaus Wenninger  schrieb am 
> 06.10.2016 um 18:03 in
>>  Nachricht <3980cfdd-ebd9-1597-f6bd-a1ca808f7...@redhat.com>:
>>>  On 10/05/2016 04:22 PM, renayama19661...@ybb.ne.jp wrote:
  Hi All,
 
>>  If a user uses sbd, can the cluster evade a problem of 
> SIGSTOP of crmd?
>   
>  As pointed out earlier, maybe crmd should feed a watchdog. Then 
> stopping 
>>>  crmd 
>  will reboot the node (unless the watchdog fails).
  Thank you for comment.
 
  We examine watchdog of crmd, too.
  In addition, I comment after examination advanced.
>>>  Was thinking of doing a small test implementation going
>>>  a little in the direction Lars Ellenberg had been pointing out.
>>> 
>>>  a couple of thoughts I had so far:
>>> 
>>>  - add an API (via DBus or libqb - favoring libqb atm) to sbd
>>>    an application can use to create a watchdog within sbd
>>  Why has it to be done within sbd?
> Not necessarily, could be spawned out as well into an own project or
> something already existent could be taken.
> Remember to have added a dbus-interface to
> https://sourceforge.net/projects/watchdog/ for a project once.
> If you have a suggestion I'm open.
> Going off sbd would have the advantage of a smooth start:
> 
> - cluster/pacemaker-watcher are there already and can
>   be replaced/moved over time
> - the lifecycle of the daemon (when started/stopped) is
>   already something that is in the code and in the people's minds
> 
>>>  - parameters for the first are a name and a timeout
>>> 
>>>  - first use-case would be crmd observation
>>> 
>>>  - later on we could think of removing pacemaker dependencies
>>>    from sbd by moving the actual implementation of
>>>    pacemaker-watcher and probably cluster-watcher as well
>>>    into pacemaker - using the new API
>>> 
>>>  - this of course creates sbd dependency within pacemaker so
>>>    that it would make sense to offer a simpler and self-contained
>>>    implementation within pacemaker as an alternative
>>  I think the watchdog interface is so simple that you don't need a relay 
> for it. The only limit I can imagine is the number of watchdogs available of 
> some specific hardware.
> That is the point ;-)
>>>    thus it would be favorable to have the dependency
>>>    within a non-compulsory pacemaker-rpm so that
>>>    we can offer an alternative that doesn't use sbd
>>>    at maybe the cost of being less reliable or one
>>>    that owns a hardware-watchdog by itself for systems
>>>    where this is still unused.
>>> 
>>>    - e.g. via some kind of plugin (Andrew forgive me -
>>>                                                     no pils ;-) )
>>>    - or via an additional daemon
>>> 
>>>  What did you have in mind?
>>>  Maybe it makes sense to synchronize...
>>> 
>>>  Regards,
>>>  Klaus
>>>   
 
  Best Regards,
  Hideo Yamauchi.
 
 
 
  - Original Message -
>  From: Ulrich Windl 
>  To: users@clusterlabs.org; renayama19661...@ybb.ne.jp 
>  Cc: 
>  Date: 2016/10/5, Wed 23:08
>  Subject: Antw: Re: [ClusterLabs] Antw: Re: When the DC crmd is 
> frozen, 
>>>  cluster decisions are delayed infinitely
    schrieb am 
> 21.09.2016 um 11:52 
>  in Nachricht
>  <876439.61305...@web200311.mail.ssk.yahoo.co.jp>:
>>   Hi All,
>> 
>>   Was the final conclusion given about this problem?
>> 
>>   If a user uses sbd, can the cluster evade a problem of 
> SIGSTOP of crmd?
>  As pointed out earlier, maybe crmd should feed a watchdog. Then 
> stopping 
>>>  crmd 
>  will reboot the node (unless the watchdog fails).
> 
>>   We are interested in this problem, too.
>> 
>>   Best Regards,
>> 
>>   Hideo Yamauchi.
>> 
>> 
>>   ___
>>   Users mailing list: Users@clusterlabs.org 
>>   http://clusterlabs.org/mailman/listinfo/users 
>> 
>>   Project Home: http://www.clusterlabs.org 
>>   Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>   Bugs: http://bugs.clusterlabs.org 
  ___
  Users mailing list: Users@clusterlabs.org 
  http://clusterlabs.org/mailman/listinfo/users 
 
  Project Home: http://www

[ClusterLabs] Pacemaker and Oracle ASM

2016-10-07 Thread Chad Cravens
Hello:

I'm working on a project where the client is using Oracle ASM (volume
manager) for database storage. I have implemented a cluster before using
LVM with ext4 and understand there are resource agents (RA) already
existing within the ocf:heartbeat group that can manage which nodes connect
and disconnect to the filesystem and prevents data corruption. For example:

pcs resource create my_lvm LVM volgrpname=my_vg \
exclusive=true --group apachegroup

pcs resource create my_fs Filesystem \
device="/dev/my_vg/my_lv" directory="/var/www" fstype="ext4" --group \
apachegroup

I'm curious if anyone has had a situation where Oracle ASM is used instead
of LVM? ASM seems pretty standard for Oracle databases, but not sure what
resource agent I can use to manage the ASM manager?

Thanks!

-- 
Kindest Regards,
Chad Cravens
(843) 291-8340

[image: http://www.ossys.com] 
[image: http://www.linkedin.com/company/open-source-systems-llc]
   [image:
https://www.facebook.com/OpenSrcSys] 
   [image: https://twitter.com/OpenSrcSys] 
 [image: http://www.youtube.com/OpenSrcSys]
   [image: http://www.ossys.com/feed]
   [image: cont...@ossys.com] 
Chad Cravens
(843) 291-8340
chad.crav...@ossys.com
http://www.ossys.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-07 Thread Martin Schlegel
Thanks for all responses from Jan, Ulrich and Digimer !

We are already using bond'ed network interfaces, but we are also forced to go
across IP-subnets. Certain routes between routers can go and have gone missing.

This has happened for one of our node's public network, where it was
inaccessible to other local, public IP-subnets. If this were to happen in
parallel on another node of our private network the entire cluster would be
down, just because - as Ulrich said "It's a ring !" - both heartbeat rings are
marked faulty. It's not an optimal result, because cluster communication is in
fact 100% possible between all nodes.

With an increasing number of nodes this risk is fairly big. Just think about
providers of bigger cloud infrastructures.

With the above scenario in mind - is there a better (tested and recommended) way
to configure this ?
... or is knet the way to go in the future then ?


Regards,
Martin Schlegel


> Jan Friesse  hat am 7. Oktober 2016 um 11:28 geschrieben:
> 
> Martin Schlegel napsal(a):
> 
> > Thanks for the confirmation Jan, but this sounds a bit scary to me !
> > 
> > Spinning this experiment a bit further ...
> > 
> > Would this not also mean that with a passive rrp with 2 rings it only takes
> > 2
> > different nodes that are not able to communicate on different networks at
> > the
> > same time to have all rings marked faulty on _every_node ... therefore all
> > cluster members loosing quorum immediately even though n-2 cluster members
> > are
> > technically able to send and receive heartbeat messages through all 2 rings
> > ?
> 
> Not exactly, but this situation causes corosync to start behaving really 
> badly spending most of the time in "creating new membership" loop.
> 
> Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by 
> replace it by knet is biggest TODO for 3.x.
> 
> Regards,
>  Honza
> 
> > I really hope the answer is no and the cluster still somehow has a quorum in
> > this case.
> > 
> > Regards,
> > Martin Schlegel
> 
> >> Jan Friesse  hat am 5. Oktober 2016 um 09:01
> >> geschrieben:>>
> >> Martin,
> >>
> >>> Hello all,
> >>>
> >>> I am trying to understand why the following 2 Corosync heartbeat ring
> >>> failure
> >>> scenarios
> >>> I have been testing and hope somebody can explain why this makes any
> >>> sense.
> >>>
> >>> Consider the following cluster:
> >>>
> >>> * 3x Nodes: A, B and C
> >>> * 2x NICs for each Node
> >>> * Corosync 2.3.5 configured with "rrp_mode: passive" and
> >>> udpu transport with ring id 0 and 1 on each node.
> >>> * On each node "corosync-cfgtool -s" shows:
> >>> [...] ring 0 active with no faults
> >>> [...] ring 1 active with no faults
> >>>
> >>> Consider the following scenarios:
> >>>
> >>> 1. On node A only block all communication on the first NIC configured with
> >>> ring id 0
> >>> 2. On node A only block all communication on all NICs configured with
> >>> ring id 0 and 1
> >>>
> >>> The result of the above scenarios is as follows:
> >>>
> >>> 1. Nodes A, B and C (!) display the following ring status:
> >>> [...] Marking ringid 0 interface  FAULTY
> >>> [...] ring 1 active with no faults
> >>> 2. Node A is shown as OFFLINE - B and C display the following ring status:
> >>> [...] ring 0 active with no faults
> >>> [...] ring 1 active with no faults
> >>>
> >>> Questions:
> >>> 1. Is this the expected outcome ?
> >>
> >> Yes
> >>
> >>> 2. In experiment 1. B and C can still communicate with each other over
> >>> both
> >>> NICs, so why are
> >>> B and C not displaying a "no faults" status for ring id 0 and 1 just like
> >>> in experiment 2.
> >>
> >> Because this is how RRP works. RRP marks whole ring as failed so every
> >> node sees that ring as failed.
> >>
> >>> when node A is completely unreachable ?
> >>
> >> Because it's different scenario. In scenario 1 there are 3 nodes
> >> membership where one of them has failed one ring -> whole ring is
> >> failed. In scenario 2 there are 2 nodes membership where both rings
> >> works as expected. Node A is completely unreachable and it's not in the
> >> membership.
> >>
> >> Regards,
> >> Honza
> >>
> >>> Regards,
> >>> Martin Schlegel
> >>>
> >>> ___
> >>> Users mailing list: Users@clusterlabs.org
> >>> http://clusterlabs.org/mailman/listinfo/users
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>
> >>>
> 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlab

Re: [ClusterLabs] Antw: Re: Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-07 Thread Dmitri Maziuk

On 2016-10-07 01:18, Ulrich Windl wrote:


Any hardware may fail at any time. We even had an onboard NIC, that
stopped operating correctly some day, we had CPU chache errors, RAM
parity errors, PCI bus errors, and everything you can imagine.


:) http://dilbert.com/strip/1995-06-24

Our vendor's been good to us: over the last dozen or so years we only 
had about 4 dead mobos, 3 psus (same batch), a few of dimms and one sata 
backplane.  But we mostly run storage so my perception 
is heavily biased towards disks.


Dima


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-07 Thread Jan Friesse

Martin Schlegel napsal(a):

Thanks for the confirmation Jan, but this sounds a bit scary to me !

Spinning this experiment a bit further ...

Would this not also mean that with a passive rrp with 2 rings it only takes 2
different nodes that are not able to communicate on different networks at the
same time to have all rings marked faulty on _every_node ... therefore all
cluster members loosing quorum immediately even though n-2 cluster members are
technically able to send and receive heartbeat messages through all 2 rings ?


Not exactly, but this situation causes corosync to start behaving really 
badly spending most of the time in "creating new membership" loop.


Yes, RRP is simply bad. If you can, use bonding. Improvement of RRP by 
replace it by knet is biggest TODO for 3.x.


Regards,
  Honza



I really hope the answer is no and the cluster still somehow has a quorum in
this case.

Regards,
Martin Schlegel



Jan Friesse  hat am 5. Oktober 2016 um 09:01 geschrieben:

Martin,


Hello all,

I am trying to understand why the following 2 Corosync heartbeat ring
failure
scenarios
I have been testing and hope somebody can explain why this makes any sense.

Consider the following cluster:

  * 3x Nodes: A, B and C
  * 2x NICs for each Node
  * Corosync 2.3.5 configured with "rrp_mode: passive" and
  udpu transport with ring id 0 and 1 on each node.
  * On each node "corosync-cfgtool -s" shows:
  [...] ring 0 active with no faults
  [...] ring 1 active with no faults

Consider the following scenarios:

  1. On node A only block all communication on the first NIC configured with
ring id 0
  2. On node A only block all communication on all NICs configured with
ring id 0 and 1

The result of the above scenarios is as follows:

  1. Nodes A, B and C (!) display the following ring status:
  [...] Marking ringid 0 interface  FAULTY
  [...] ring 1 active with no faults
  2. Node A is shown as OFFLINE - B and C display the following ring status:
  [...] ring 0 active with no faults
  [...] ring 1 active with no faults

Questions:
  1. Is this the expected outcome ?


Yes


2. In experiment 1. B and C can still communicate with each other over both
NICs, so why are
  B and C not displaying a "no faults" status for ring id 0 and 1 just like
in experiment 2.


Because this is how RRP works. RRP marks whole ring as failed so every
node sees that ring as failed.


when node A is completely unreachable ?


Because it's different scenario. In scenario 1 there are 3 nodes
membership where one of them has failed one ring -> whole ring is
failed. In scenario 2 there are 2 nodes membership where both rings
works as expected. Node A is completely unreachable and it's not in the
membership.

Regards,
  Honza


Regards,
Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org






___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-10-07 Thread Klaus Wenninger
On 10/07/2016 08:14 AM, Ulrich Windl wrote:
 Klaus Wenninger  schrieb am 06.10.2016 um 18:03 in
> Nachricht <3980cfdd-ebd9-1597-f6bd-a1ca808f7...@redhat.com>:
>> On 10/05/2016 04:22 PM, renayama19661...@ybb.ne.jp wrote:
>>> Hi All,
>>>
> If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd?
  
 As pointed out earlier, maybe crmd should feed a watchdog. Then stopping 
>> crmd 
 will reboot the node (unless the watchdog fails).
>>> Thank you for comment.
>>>
>>> We examine watchdog of crmd, too.
>>> In addition, I comment after examination advanced.
>> Was thinking of doing a small test implementation going
>> a little in the direction Lars Ellenberg had been pointing out.
>>
>> a couple of thoughts I had so far:
>>
>> - add an API (via DBus or libqb - favoring libqb atm) to sbd
>>   an application can use to create a watchdog within sbd
> Why has it to be done within sbd?
Not necessarily, could be spawned out as well into an own project or
something already existent could be taken.
Remember to have added a dbus-interface to
https://sourceforge.net/projects/watchdog/ for a project once.
If you have a suggestion I'm open.
Going off sbd would have the advantage of a smooth start:

- cluster/pacemaker-watcher are there already and can
  be replaced/moved over time
- the lifecycle of the daemon (when started/stopped) is
  already something that is in the code and in the people's minds

>> - parameters for the first are a name and a timeout
>>
>> - first use-case would be crmd observation
>>
>> - later on we could think of removing pacemaker dependencies
>>   from sbd by moving the actual implementation of
>>   pacemaker-watcher and probably cluster-watcher as well
>>   into pacemaker - using the new API
>>
>> - this of course creates sbd dependency within pacemaker so
>>   that it would make sense to offer a simpler and self-contained
>>   implementation within pacemaker as an alternative
> I think the watchdog interface is so simple that you don't need a relay for 
> it. The only limit I can imagine is the number of watchdogs available of some 
> specific hardware.
That is the point ;-)
>>   thus it would be favorable to have the dependency
>>   within a non-compulsory pacemaker-rpm so that
>>   we can offer an alternative that doesn't use sbd
>>   at maybe the cost of being less reliable or one
>>   that owns a hardware-watchdog by itself for systems
>>   where this is still unused.
>>
>>   - e.g. via some kind of plugin (Andrew forgive me -
>>no pils ;-) )
>>   - or via an additional daemon
>>
>> What did you have in mind?
>> Maybe it makes sense to synchronize...
>>
>> Regards,
>> Klaus
>>  
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>>
>>>
>>> - Original Message -
 From: Ulrich Windl 
 To: users@clusterlabs.org; renayama19661...@ybb.ne.jp 
 Cc: 
 Date: 2016/10/5, Wed 23:08
 Subject: Antw: Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, 
>> cluster decisions are delayed infinitely
>>>   schrieb am 21.09.2016 um 11:52 
 in Nachricht
 <876439.61305...@web200311.mail.ssk.yahoo.co.jp>:
>  Hi All,
>
>  Was the final conclusion given about this problem?
>
>  If a user uses sbd, can the cluster evade a problem of SIGSTOP of crmd?
 As pointed out earlier, maybe crmd should feed a watchdog. Then stopping 
>> crmd 
 will reboot the node (unless the watchdog fails).

>  We are interested in this problem, too.
>
>  Best Regards,
>
>  Hideo Yamauchi.
>
>
>  ___
>  Users mailing list: Users@clusterlabs.org 
>  http://clusterlabs.org/mailman/listinfo/users 
>
>  Project Home: http://www.clusterlabs.org 
>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>  Bugs: http://bugs.clusterlabs.org 
>>> ___
>>> Users mailing list: Users@clusterlabs.org 
>>> http://clusterlabs.org/mailman/listinfo/users 
>>>
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>
>


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org