Re: [Linux-HA] Corosync on cluster with 3+ nodes

Hermes Flying Sun, 02 Dec 2012 11:27:28 -0800

When you say:

"So if corosync dies, pacemaker loses it's foundation and falls over dead."



You mean pacemaker instance in all of nodes, right?


Also (my emphasis >><<)

"The other node(s) in the cluster will declare the >>node<< as failed, fence it 
and then decide if any services need to be restarted"

Which >>node<< are your talking about? Itself? It will try to restart itself if 
corosync i.e. pacemaker dies?








________________________________
 From: Digimer <li...@alteeve.ca>
To: Hermes Flying <flyingher...@yahoo.com> 
Cc: General Linux-HA mailing list <linux-ha@lists.linux-ha.org> 
Sent: Sunday, December 2, 2012 9:10 PM
Subject: Re: [Linux-HA] Corosync on cluster with 3+ nodes
 
Corosync is used for cluster communications, membership and quorum. It
doesn't care what the cluster is used for. It coordinates nothing. Think
if it like the foundation of a house.

Pacemaker, in turn, doesn't care how members in the cluster come or go,
it only cares about who is a member now and whether or not it has
quorum. When membership changes, it looks at the service(s) on the
cluster and decides if anything should be stopped, started, relocated or
restarted. Think of this as the house on top of the foundation.

So if corosync dies, pacemaker loses it's foundation and falls over
dead. The other node(s) in the cluster will declare the node as failed,
fence it and then decide if any services need to be restarted. So if the
node had a VIP, and pacemaker was configured to restart the VIP
elsewhere, it would then do so.

Now, if the VIP itself fails, but all the nodes are healthy, then
pacemaker may simply try to restart the VIP on the same machine. If it
keeps failing, then it may relocate the VIP to another node in the
cluster. This, again, depends on how pacemaker is configured to manage
the VIP.

On 12/02/2012 01:59 PM, Hermes Flying wrote:
> But the VIP service is in coordination with corosync right? I mean when
> you say:
> 
> "if the node hosting the VIP fails, pacemaker may try to restart it or
> it might relocate it, depending on how you've configured things"
> 
> What does failure mean? That the service crashed OR also that corosync
> failed (so can not reach rest of nodes)?
> 
> 
> ------------------------------------------------------------------------
> *From:* Digimer <li...@alteeve.ca>
> *To:* Hermes Flying <flyingher...@yahoo.com>
> *Cc:* General Linux-HA mailing list <linux-ha@lists.linux-ha.org>
> *Sent:* Sunday, December 2, 2012 8:46 PM
> *Subject:* Re: [Linux-HA] Corosync on cluster with 3+ nodes
> 
> As I said; each _service_ can have the concept of "primary", just not
> pacemaker itself. I gave an example earlier;
> 
> Pacemkaer might have two services;
> * DRBD, active on all nodes.
> * VIP, active on one node only.
> 
> In this example, the DRBD service is Active/Active. If it fails on a
> given node, it will try to restart. If that fails, it will *not*
> relocate. Here, there is no "primary".
> 
> The VIP on the other hand runs on one node at a time only. Generally it
> will start on the first active node, but you might configure it to
> prefer one node. If that preferred node comes online later, pacemaker
> will migrate it. If there is no preferred node, then the VIP will stay
> where it is. If the node hosting the VIP fails, pacemaker may try to
> restart it or it might relocate it, depending on how you've configured
> things. In this case, the VIP service has the concept of "primary",
> though it's better to think of it as "Active".
> 
> Make sense?
> 
> On 12/02/2012 01:35 PM, Hermes Flying wrote:
>> Hi,
>> So you are saying I should not use the notion of "primary" ok.
>> When I have 3 nodes, won't 1 node have the VIP? How is this node defined
>> in Pacemaker's terminology if "primary" is inappropriate?
>>
>> Best Regards
>>
>> ------------------------------------------------------------------------
>> *From:* Digimer <li...@alteeve.ca>
>> *To:* Hermes Flying <flyingher...@yahoo.com>; General Linux-HA mailing
>> list <linux-ha@lists.linux-ha.org>
>> *Sent:* Sunday, December 2, 2012 8:22 PM
>> *Subject:* Re: [Linux-HA] Corosync on cluster with 3+ nodes
>>
>> On 12/02/2012 02:56 AM, Hermes Flying wrote:
>>> Hi,
>>> For a cluster with 2 nodes I was explained what would happen. The
>> other node will take over using fencing.
>>
>> It will take over *after* fencing. Two separate concepts.
>>
>> Fencing ensures that a lost node is truly gone and not just partitioned.
>> Once fencing succeeds and the lost node is known to be down, _then_
>> recovery of service(s) that had been running on the victim will begin.
>>
>>> But in clusters with 3+ nodes what happens when corosync fails? I
>> assume that if the communication fails with the primary, all other nodes
>> consider themselves eligible to become primaries. Is this the case?
>>
>> Corosync failing will be treated as a failure in the node and the node
>> will be removed and fenced. Any services that had been running on it may
>> or may not be recovered, depending on the rules defined for that given
>> service. If it is recovered, then where it is restarted again depends on
>> how each service was configured.
>>
>>> 1)If a node has problem communicating with the primary AND has network
>> problem with the rest of the network (clients) does it still try to
>> become the primary (try to kill other nodes?)
>>
>> Please drop the idea of pacemaker being "primary"; that's the wrong way
>> to look at it.
>>
>> If pacemaker (via corosync) loses contact with it's peer(s), then it
>> checks the quorum policy. If quorum is enabled, it checks to see if it
>> had quorum. If it does, it will try to fence it's peer. If it doesn't,
>> it will shut down any services it might have been running. Likely in
>> this case, one of the nodes with quorum will fence it shortly.
>>
>>> 2) In practice if the corosync fails but the primary is still up and
>> running and serving requests, is primary attempted to be "killed" by the
>> other nodes?Or you use some other way to figure out that this is a
>> network failure, primary has not crashed?
>>
>> Again, drop the notion of "primary". Whether a node tries to fence it's
>> peer is a question of whether it has quorum (or if quorum is disabled).
>> Failing corosync is the same as failing the whole node. Pacemaker will
>> fail is corosync dies.
>>
>>> 3)Finally on corosync failure I assume the primary does nothing, as it
>> does not care about the backups. Is this correct?
>>
>> This question doesn't make sense.
>>
>>> Thank you!
>>
>> np
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>>
>>
> 
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> 
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Corosync on cluster with 3+ nodes

Reply via email to