Re: [Linux-cluster] Fedora 19 cluster stack and Cluster registry components

Maxwell, Jamison [HDS] Thu, 25 Apr 2013 12:28:32 -0700

Genius!




Jamison Maxwell
Sr. Systems Administrator

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Digimer
Sent: Wednesday, April 24, 2013 2:22 PM
To: Michael Richmond
Cc: linux clustering
Subject: Re: [Linux-cluster] Fedora 19 cluster stack and Cluster registry 
components

Hi,

   The way I deal with avoiding dual-fence is to put a delay into one of the 
nodes. For example, I can specify that if Node 1 is to be fenced, Node 2 will 
pause for X seconds (usually 15 in my setups). This way, if both nodes try to 
fence the other at the same time, Node 1 will have killed Node 2 long before 
2's 15 second timer expired. However, if Node
1 really was dead, Node 2 would still fence 1 and then recover, albeit with a 
15 second delay in recovery. Simple and effective. :)

I'm not sure if there is a specific RHEL 6.4 + pacemaker tutorial up yet, but 
keep an eye on clusterlabs. I *think* Andrew is working on that. If not, I plan 
to go back to working on my tutorial when I return to the office in May. 
However, that will still be *many* months before it's done.

digimer

On 04/24/2013 01:54 PM, Michael Richmond wrote:
> Hi Digimer,
> Thanks for your detailed comments.
>
> What you have described with regard to fencing is common practice for
> two node clusters that I have implemented in a few proprietary cluster
> implementations that I have worked on. However, fencing is does not
> completely solve the split-brain problem in two-node clusters. There
> is still the potential for both NodeA and NodeB to decide to fence at
> the same time. In this case, each node performs the fencing operation
> to fence the other node with the result that both nodes get fenced.
>
> To avoid this, most clustering systems can be optionally configured
> with a shared resource (usually a shared LUN) that is used to weight
> the decision about which node gets fenced. Additionally, the shared
> LUN can be used as a coarse communication mechanism to aid the
> election of a winning node. As I'm sure you are aware, a quorum disk
> is typically used to determine which partition has access to the
> larger/important portion of the cluster resources to determine the
> nodes that must be fenced because they are in a separate network partition.
>
> Since you mention that qdiskd has an uncertain future, it would appear
> that the pacemaker-based stack has a potential functionality gap with
> regard to two-node clusters. That is, unless some other approach is
> taken to resolve network partitions.
>
>  From what I understand, the CIB is at risk for unintended roll-back
> of a write in the case where a two-node cluster has nodes up at
> differing times. For example, assuming time
>
>          Time 0  Node A up                       Node B up       (CIB 
> contains "CIB0")
>          Time 1  Node A up                       Node B down
>          Time 2  Node A writes update to CIB     Node B booting (not joined 
> cluster)
>                  (CIB contains "CIB1")
>          Time 3  Node A down                     Node B up       (CIB 
> contains "CIB0")
>
> After Time 3, Node B is operating with a CIB that contains "CIB0" and
> has no way of seeing the CIB contents "CIB1" written by Node A. In
> effect, the write by Node A was rolled-back when Node A went down.
>
> Thanks again for your input.
>
> Is there any description available about how to configure the
> pacemaker/chorosync stack on RHEL6.4?
>
> Regards,
> Michael Richmond
>
> michael richmond | principal software engineer | flashsoft, sandisk |
> +1.408.425.6731
>
>
>
>
> On 23/4/13 6:07 PM, "Digimer" <[email protected]> wrote:
>
>> First up, before I begin, I am looking to pacemaker for the future as
>> well and do not yet use it. So please take whatever I say about
>> pacemaker with a grain of sand. Andrew, on the other hand, is the
>> author and anything he says can be taken as authoritative on the topic.
>>
>> On the future;
>>
>> I also have a 2-node project/product that I am working to update in
>> time for the release of RHEL 7. Speaking entirely for myself, I can
>> tell you that I am planning to use Pacemaker from RHEL 7.0. As a Red
>> hat outsider, I can only speak as a member of the community, but I
>> have every reason to believe that the pacemaker resource manager will
>> be the one used from 7.0 and forward.
>>
>> As for the CIB, yes, it's a local XML file stored on each node.
>> Synchronization occurs via updates pushed over corosync to nodes
>> active in the cluster. As I understand it, when a node that had been
>> offline connects to the cluster, it receives any updates to the CIB.
>>
>> Dealing with 2-node clusters, setting aside qdisk which has an
>> uncertain future I believe, you can not use quorum. For this reason,
>> it is possible for a node to boot up, fail to reach it's peer and
>> think it's the only one running. It will start your HA services and
>> voila, two nodes offering the same services at the same time in an
>> uncoordinated manner. This is bad and it is called a "split-brain".
>>
>> The way to avoid split-brains in 2-node clusters is to use fence
>> devices, aka stonith devices (exact same thing by two different names).
>> This is _always_ wise to use, but in 2-node clusters, it is critical.
>>
>> So imagine back to your scenario;
>>
>> If a node came up and tried to connect to it's peer but failed to do
>> so, before proceeding, it would fence (usually forcibly power off)
>> the other node. Only after doing so would it start the HA services.
>> In this way, both nodes can never be offering the same HA service at the 
>> same time.
>>
>> The risk here though is a "fence loop". If you set the cluster to
>> start on boot and if there is a break in the connection, you can have
>> an initial state where, upon the break in the network, both try to
>> fence the other. The faster node wins, forcing the other node off and
>> resuming to operate on it's own. This is fine and exactly what you
>> want. However, now the fenced node powers back up, starts it's
>> cluster stack, fails to reach it's peer and fences it. It finishes
>> starting, offers the HA services and goes on it's way ... until the
>> other node boots back up. :)
>>
>> Personally, I avoid this by _not_ starting the cluster stack on boot.
>> My reasoning is that, if a node fails and gets rebooted, I want to
>> check it over myself before I let it back into the cluster (I get
>> alert emails when something like this happens). It's not a risk from
>> an HA perspective because it's services would have recovered on the
>> surviving peer long before it reboots anyway. This also has the added
>> benefit of avoiding a fence loop, no matter what happens.
>>
>> Cheers
>>
>> digimer
>>
>> On 04/23/2013 02:07 PM, Michael Richmond wrote:
>>> Andrew and Digimer,
>>> Thank you for taking the time to respond, you have collaborated some
>>> of what I've been putting together as the likely direction.
>>>
>>> I am working on adapting some cluster-aware storage features for use
>>> in a Linux cluster environment. With this kind of project it is
>>> useful to try and predict where the Linux community is heading so
>>> that I can focus my development work on what will be the "current"
>>> cluster stack around my anticipated release dates. Any predictions
>>> are simply educated guesses that may prove to be wrong, but are
>>> useful with regard to developing plans. From my reading of various
>>> web pages and piecing things together I found that RHEL 7 is
>>> intended to be based on Fedora 18, so I assume that the new
>>> Pacemaker stack has a good chance of being rolled out in RHEL
>>> 7.1/7.2, or even possibly 7.0.
>>>
>>> Hearing that there is official word that the intention is for
>>> Pacemaker to be the official cluster stack helps me put my
>>> development plans together.
>>>
>>>
>>> The project I am working on is focused on two-node clusters. But I
>>> also need a persistent, cluster-wide data store to hold a small
>>> amount of state (less than 1KB). This data store is what I refer to
>>> as a cluster-registry.
>>> The state data records the last-known operational state for the
>>> storage feature. This last-known state helps drive recovery
>>> operations for the storage feature during node bring-up. This
>>> project is specifically aimed at integrating generic functionality into the 
>>> Linux cluster stack.
>>>
>>> I have been thinking about using the cluster configuration file for
>>> this storage which I assume is the CIB referenced by Andrew. But I
>>> can imagine cases where the CIB file may loose updates if it does
>>> not utilize shared storage media. My understanding is that the CIB
>>> file is stored on each node using local disk storage.
>>>
>>> For example, consider a two-node cluster that is configured with a
>>> quorum disk on shared storage media. If at a given point in time
>>> NodeB is up and NodeB is down. NodeA can form quorate and start
>>> cluster services (including HA applications). Assume that NodeA
>>> updates the CIB to record some state update. If NodeB starts booting
>>> but before NodeB joins the cluster, NodeA crashes. At this point,
>>> the updated CIB only resides on NodeA and cannot be accessed by
>>> NodeB even if NodeB can access the quorum disk as form quorate.
>>> Effectively, NodeB cannot be aware of the update from NodeA which
>>> will result in an implicit roll-back of any updates performed by
>>> NodeA.
>>>
>>> With a two-node cluster, there are two options for resolving this:
>>> * prevent any update to the cluster registry/CIB unless all nodes
>>> are part of the cluster. (This is not practical since it undermines
>>> some of the reasons for building clusters.)
>>> * store the cluster registry on shared storage so that there is one
>>> source of truth.
>>>
>>> It is possible that the nature of the data stored in the CIB is
>>> resilient to the example scenario that I describe. In this case,
>>> maybe the CIB is not an appropriate data store for my cluster
>>> registry data. In this case I am either looking for an appropriate
>>> Linux component to use for my cluster registry, or I will build a
>>> custom data store that provides atomic update semantics on shared
>>> storage.
>>>
>>> Any thoughts and/or pointers would be appreciated.
>>>
>>> Thanks,
>>> Michael Richmond
>>>
>>> --
>>> michael richmond | principal software engineer | flashsoft, sandisk
>>> |
>>> +1.408.425.6731
>>>
>>>
>>>
>>>
>>> On 22/4/13 4:37 PM, "Andrew Beekhof" <[email protected]> wrote:
>>>
>>>>
>>>> On 23/04/2013, at 4:59 AM, Digimer <[email protected]> wrote:
>>>>
>>>>> On 04/22/2013 02:36 PM, Michael Richmond wrote:
>>>>>> Hello,
>>>>>> I am researching the new cluster stack that is scheduled to be
>>>>>> delivered in Fedora 19. Does anyone on this list have a sense for
>>>>>> the timeframe for this new stack to be rolled into a RHEL
>>>>>> release? (I assume the earliest would be RHEL 7.)
>>>>>>
>>>>>> On the Windows platform, Microsoft Cluster Services provides a
>>>>>> cluster-wide registry service that is basically a cluster-wide
>>>>>> key:value store with atomic updates and support to store the
>>>>>> registry on shared disk. The  storage on shared disk allows
>>>>>> access and use of the registry in cases where nodes are
>>>>>> frequently joining and leaving the cluster.
>>>>>>
>>>>>> Are there any component(s) that can be used to provide a similar
>>>>>> registry in the Linux cluster stack? (The current RHEL 6 stack,
>>>>>> and/or the new Fedora 19 stack.)
>>>>>>
>>>>>> Thanks in advance for your information, Michael Richmond
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>>    First up, Red Hat's policy of what is coming is "we'll announce
>>>>> on release day". So anything else is a guess. As it is, Pacemaker
>>>>> is in tech-preview in RHEL 6, and the best guess is that it will
>>>>> be the official resource manager in RHEL 7, but it's just that, a guess.
>>>>
>>>> I believe we're officially allowed to say that it is our
>>>> _intention_ that Pacemaker will be the one and only supported stack
>>>> in RHEL7.
>>>>
>>>>>
>>>>>    As for the registry question; I am not entirely sure what it is
>>>>> you are asking here (sorry, not familiar with windows). I can say
>>>>> that pacemaker uses something called the CIB (cluster information
>>>>> base) which is an XML file containing the cluster's configuration
>>>>> and state. It can be updated from any node and the changes will
>>>>> push to the other nodes immediately.
>>>>
>>>> How many of these attributes are you planning to have?
>>>> You can throw a few in there, but I'd not use it for 100's or
>>>> 1000's of them - its mainly designed to store the resource/service 
>>>> configuration.
>>>>
>>>>
>>>>> Does this answer your question?
>>>>>
>>>>>    The current RHEL 6 cluster is corosync + cman + rgmanager. It
>>>>> also uses an XML config and it can be updated from any node and
>>>>> push out to the other nodes.
>>>>>
>>>>>    Perhaps a better way to help would be to ask what, exactly, you
>>>>> want to build your cluster for?
>>>>>
>>>>> Cheers
>>>>>
>>>>> --
>>>>> Digimer
>>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for
>>>>> cancer is trapped in the mind of a person without access to
>>>>> education?
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> [email protected]
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> ________________________________
>>>
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s) named 
>>> above.
>>> If the reader of this message is not the intended recipient, you are
>>> hereby notified that you have received this message in error and
>>> that any review, dissemination, distribution, or copying of this
>>> message is strictly prohibited. If you have received this
>>> communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies
>>> of this message in your possession (whether hard copies or electronically 
>>> stored copies).
>>>
>>>
>>
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/ What if the cure for
>> cancer is trapped in the mind of a person without access to
>> education?
>
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>


--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is 
trapped in the mind of a person without access to education?

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Fedora 19 cluster stack and Cluster registry components

Reply via email to