[ClusterLabs] How to declare ping primitive with rule

2018-06-08 Thread Salvatore D'angelo
Hi All,

I have a PostgreSQL cluster on three nodes (Master/Sync/Async) with WAL files 
stored on two GlusterFS nodes. In total, 5 machines.
Let call the first three machines: pg1, pg2, pg3. The other two (without 
pacemaker): pgalog1, pgalog2.

Now I this code works fine on some bare metal machines. I was able to port it 
on Docker (because this simplify tests and allow us to experiment). So far so 
good.

I wasn’t the original author of the code. Now we have some scripts to create 
this cluster that works fine on bare metal as said before. In particular, I 
have this piece of code:

cat - 

Re: [ClusterLabs] Ansible role to configure Pacemaker

2018-06-08 Thread Jan Pokorný
On 07/06/18 17:57 +0100, Adam Spiers wrote:
> Jan Pokorný  wrote:
>> While I see why Ansible is compelling, I feel it's important to
>> challenge this trend of trying to bend/rebrand _machine-local
>> configuration management tool_ as _distributed system management tool_
>> (pacemaker is distributed application/framework of sorts), which Ansible
>> alone is _not_, as far as I know, hence the effort doesn't seem to be
>> 100% sound (which really matters if reliability is the goal).
> 
> I'm not sure I understand.  Are you saying Ansible is a machine-local
> configuration management tool not a distributed system management
> tool?  Because I don't think that statement is accurate; Ansible was
> absolutely designed from the beginning for orchestrating config
> management over multiple machines (unlike Chef or Puppet).  But as a
> RH employee you must know that already, so I'm probably missing
> something ;-)

I know as little as anyone who don't care much about that surface,
so yes, I may be inaccurate and will gladly stand corrected and
enlighted.  In part, I take a role of devil's advocate here, it
is in line with precaution-full approach that I'd suggest to anyone
taking HA seriously, and at worst, I'll end up being just overly
pessimistic (feel free to shame me, then ;-)
But without the likeminded, we wouldn't be using seatbelts...

>> Once more, this has nothing to do with the announced project, it's
>> just the trending fuss on this topic that indicates me that people
>> independently, as they keenly invent their own wheel (here: Ansible
>> roles), get blind to the fallacy everything must work nicely with
>> multi machine shared-state scenarios like they are used to with
>> single host bootstrapping, without any shortcomings.
> 
> Ansible is not intended purely for single-host bootstrapping.
> But again I'm sure you already know that, so I'm a bit confused what
> your point is here.

By counterexample, is it then fully qualified to control distributed
systems where the holistic knowledge about the cluster partition would
be inherently present in the equations?

>> But there are, and precisely because not the optimal tool for the
>> task gets selected!  Just imagine what would happen if a single
>> machine got configured independently with multiple Ansible actors
>> (there may be mechanisms -- relatively easy within the same host --
>> that would prevent such interferences, but assume now they are not
>> strong enough).
> 
> ICBW but it sounds you are imagining a problem which isn't always
> there, and even when it is there, it's not big enough to justify
> chucking away the other benefits of automating deployment of Pacemaker
> via something like Ansible.  In other words, don't throw the baby out
> with the bathwater[0].
> 
> [0] 
> https://en.wikipedia.org/wiki/Don%27t_throw_the_baby_out_with_the_bathwater

Sorry if I was understood like that, I am just afraid the possible
shortcomings of automation (presumably not directly attended) like that
are not very apparent to anyone who would just pick allegedly
"stock solution for my task", and in HA, everyone should tread
especially lightly as mentioned.

> For example I work on a product which uses Ansible running from a
> central node to deploy clusters.  By virtue of the documented contract
> with the customer about what deployment / maintenance procedures are
> supported, we can assume that only one Ansible actor will ever run
> concurrently.  If we are worried that the customer will ignore the
> documentation and take actions we don't support, we can implement some
> kind of simple locking on the deployer node and that's plenty good
> enough.  And yes, this makes the deployer node a SPoF, but again there
> are perfectly acceptable and simple ways to mitigate that issue
> (briefly: make it easy to turn any node into the deployer).

Documenting limitations is vital, and I don't have a single bit against
solutions that underwent such scrutiny to prevent surprises.

> So whilst the concerns you write about here are potentially
> correct from a theoretical perspective, in the real world they are
> most likely not strong enough to prevent us from being interested in
> using (say) Ansible to deploy Pacemaker.
> 
>> What will happen?  Likely some mess-ups will occur as
>> glorified idempotence is hard to achieve atomically.  Voila, inflicted
>> race conditions, one by one, get exercised, until there's enough of
>> bad luck that the rule of idempotence gets broken, just because of
>> these processes emulating a schizophrenic (at the same time
>> multitasking) admin.  Ouch!
>> 
>> Now, reflect This to the situation with possibly concurrent
>> cluster configuration.  One cannot really expect the cluster
>> stack to be bullet-proof against these sorts of mishandling.
>> Single cluster administrator operating at a time?  Ideal!
>> Few administrators presumably with separate areas of
>> configuration interest?  Pacemaker is quite ready.
>> Cluster configuration 

Re: [ClusterLabs] corosync not able to form cluster

2018-06-08 Thread Prasad Nagaraj
Hi Christine - Thanks for looking into the logs.
I also see that the node eventually comes out of GATHER state here:

Jun 07 16:56:10 corosync [TOTEM ] entering GATHER state from 0.
Jun 07 16:56:10 corosync [TOTEM ] Creating commit token because I am the rep.

Does it mean, it has timed out or given up and then came out ?

second point, I did see some unexpected entries when I did tcpdump on the
node coro.4.. [ Its also pasted in one of the earlier threads] You can see
that it was receiving messages like

10:23:17.117347 IP 172.22.0.13.50468 > 172.22.0.4.netsupport: UDP, length
332
10:23:17.140960 IP 172.22.0.8.50438 > 172.22.0.4.netsupport: UDP, length 82
10:23:17.141319 IP 172.22.0.6.38535 > 172.22.0.4.netsupport: UDP, length 156

Please note that 172.22.0.8 and 172.22.0.6 are not part of my group and I
was wondering why these messages are coming ?

Thanks!

On Fri, Jun 8, 2018 at 2:34 PM, Christine Caulfield 
wrote:

> On 07/06/18 18:32, Prasad Nagaraj wrote:
> > Hi Christine - Got it:)
> >
> > I have collected few seconds of debug logs from all nodes after startup.
> > Please find them attached.
> > Please let me know if this will help us to identify rootcause.
> >
>
> The problem is on the node coro.4 - it never gets out of the JOIN
>
> "Jun 07 16:55:37 corosync [TOTEM ] entering GATHER state from 11."
>
> process so something is wrong on that node, either a rogue routing table
> entry, dangling iptables rule or even a broken NIC.
>
> Chrissie
>
>
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync not able to form cluster

2018-06-08 Thread Christine Caulfield
On 07/06/18 18:32, Prasad Nagaraj wrote:
> Hi Christine - Got it:)
> 
> I have collected few seconds of debug logs from all nodes after startup.
> Please find them attached.
> Please let me know if this will help us to identify rootcause.
> 

The problem is on the node coro.4 - it never gets out of the JOIN

"Jun 07 16:55:37 corosync [TOTEM ] entering GATHER state from 11."

process so something is wrong on that node, either a rogue routing table
entry, dangling iptables rule or even a broken NIC.

Chrissie

> Thanks!
> 
> On Thu, Jun 7, 2018 at 8:43 PM, Christine Caulfield  > wrote:
> 
> On 07/06/18 15:53, Prasad Nagaraj wrote:
> > Hi - As you can see in the corosync.conf details - i have already kept
> > debug: on
> > 
> 
> But only in the (disabled) AMF subsystem, not for corosync as a whole :)
> 
>     logger_subsys {
>     subsys: AMF
>     debug: on
>     }
> 
> 
> Chrissie
> 
> 
> > 
> > On Thu, 7 Jun 2018, 8:03 pm Christine Caulfield,  
> > >> wrote:
> >
> >     On 07/06/18 15:24, Prasad Nagaraj wrote:
> >     >
> >     > No iptables or otherwise firewalls are setup on these nodes.
> >     >
> >     > One observation is that each node sends messages on with its
> own ring
> >     > sequence number which is not converging.. I have seen that
> in a good
> >     > cluster, when nodes respond with same sequence number, the
> >     membership is
> >     > automatically formed. But in our case, that is not the case.
> >     >
> >
> >     That's just a side-effect of the cluster not forming. It's not
> causing
> >     it. Can you enable full corosync debugging (just add debug:on
> to the end
> >     of the logging {} stanza) and see if that has any more useful
> >     information (I only need the corosync bits, not the pcmk ones)
> >
> >     Chrissie
> >
> >     > Example: we can see that one node sends
> >     > Jun 07 07:55:04 corosync [pcmk  ] notice: pcmk_peer_update:
> >     Transitional
> >     > membership event on ring 71084: memb=1, new=0, lost=0
> >     > .
> >     > Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update:
> >     Transitional
> >     > membership event on ring 71096: memb=1, new=0, lost=0
> >     > Jun 07 07:55:16 corosync [pcmk  ] notice: pcmk_peer_update:
> Stable
> >     > membership event on ring 71096: memb=1, new=0, lost=0
> >     >
> >     > other node sends messages with its own numbers
> >     > Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update:
> >     Transitional
> >     > membership event on ring 71088: memb=1, new=0, lost=0
> >     > Jun 07 07:55:12 corosync [pcmk  ] notice: pcmk_peer_update:
> Stable
> >     > membership event on ring 71088: memb=1, new=0, lost=0
> >     > ...
> >     > Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update:
> >     Transitional
> >     > membership event on ring 71100: memb=1, new=0, lost=0
> >     > Jun 07 07:55:24 corosync [pcmk  ] notice: pcmk_peer_update:
> Stable
> >     > membership event on ring 71100: memb=1, new=0, lost=0
> >     >
> >     > Any idea why this happens, and why the seq. numbers from
> different
> >     nodes
> >     > are not converging ?
> >     >
> >     > Thanks!
> >     >
> >     >
> >     >
> >     >
> >     >
> >     > ___
> >     > Users mailing list: Users@clusterlabs.org
> 
> >     >
> >     > https://lists.clusterlabs.org/mailman/listinfo/users
> 
> >     >
> >     > Project Home: http://www.clusterlabs.org
> >     > Getting started:
> >     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> 
> >     > Bugs: http://bugs.clusterlabs.org
> >     >
> > 
> >     ___
> >     Users mailing list: Users@clusterlabs.org
>   >
> >     https://lists.clusterlabs.org/mailman/listinfo/users
> 
> >
> >     Project Home: http://www.clusterlabs.org
> >     Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> 
> >     Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> > ___
> > Users mailing list: