Re: [Linux-cluster] Fedora 19 cluster stack and Cluster registry components

Andrew Beekhof Thu, 25 Apr 2013 16:36:13 -0700

On 25/04/2013, at 4:21 AM, Digimer <li...@alteeve.ca> wrote:

> Hi,
> 
>  The way I deal with avoiding dual-fence is to put a delay into one of the 
> nodes. For example, I can specify that if Node 1 is to be fenced, Node 2 will 
> pause for X seconds (usually 15 in my setups). This way, if both nodes try to 
> fence the other at the same time, Node 1 will have killed Node 2 long before 
> 2's 15 second timer expired. However, if Node 1 really was dead, Node 2 would 
> still fence 1 and then recover, albeit with a 15 second delay in recovery. 
> Simple and effective. :)
> 
> I'm not sure if there is a specific RHEL 6.4 + pacemaker tutorial up yet, but 
> keep an eye on clusterlabs. I *think* Andrew is working on that.


There is a rhel-6 quickstart and "clusters from scratch" document that includes 
cman is also applicable

> If not, I plan to go back to working on my tutorial when I return to the 
> office in May. However, that will still be *many* months before it's done.
> 
> digimer
> 
> On 04/24/2013 01:54 PM, Michael Richmond wrote:
>> Hi Digimer,
>> Thanks for your detailed comments.
>> 
>> What you have described with regard to fencing is common practice for two
>> node clusters that I have implemented in a few proprietary cluster
>> implementations that I have worked on. However, fencing is does not
>> completely solve the split-brain problem in two-node clusters. There is
>> still the potential for both NodeA and NodeB to decide to fence at the
>> same time. In this case, each node performs the fencing operation to fence
>> the other node with the result that both nodes get fenced.
>> 
>> To avoid this, most clustering systems can be optionally configured with a
>> shared resource (usually a shared LUN) that is used to weight the decision
>> about which node gets fenced. Additionally, the shared LUN can be used as
>> a coarse communication mechanism to aid the election of a winning node. As
>> I'm sure you are aware, a quorum disk is typically used to determine which
>> partition has access to the larger/important portion of the cluster
>> resources to determine the nodes that must be fenced because they are in a
>> separate network partition.
>> 
>> Since you mention that qdiskd has an uncertain future, it would appear
>> that the pacemaker-based stack has a potential functionality gap with
>> regard to two-node clusters. That is, unless some other approach is taken
>> to resolve network partitions.
>> 
>> From what I understand, the CIB is at risk for unintended roll-back of a
>> write in the case where a two-node cluster has nodes up at differing
>> times. For example, assuming time
>> 
>>         Time 0  Node A up                       Node B up       (CIB 
>> contains "CIB0")
>>         Time 1  Node A up                       Node B down
>>         Time 2  Node A writes update to CIB     Node B booting (not joined 
>> cluster)
>>                 (CIB contains "CIB1")
>>         Time 3  Node A down                     Node B up       (CIB 
>> contains "CIB0")
>> 
>> After Time 3, Node B is operating with a CIB that contains "CIB0" and has
>> no way of seeing the CIB contents "CIB1" written by Node A. In effect, the
>> write by Node A was rolled-back when Node A went down.
>> 
>> Thanks again for your input.
>> 
>> Is there any description available about how to configure the
>> pacemaker/chorosync stack on RHEL6.4?
>> 
>> Regards,
>> Michael Richmond
>> 
>> michael richmond | principal software engineer | flashsoft, sandisk |
>> +1.408.425.6731
>> 
>> 
>> 
>> 
>> On 23/4/13 6:07 PM, "Digimer" <li...@alteeve.ca> wrote:
>> 
>>> First up, before I begin, I am looking to pacemaker for the future as
>>> well and do not yet use it. So please take whatever I say about
>>> pacemaker with a grain of sand. Andrew, on the other hand, is the author
>>> and anything he says can be taken as authoritative on the topic.
>>> 
>>> On the future;
>>> 
>>> I also have a 2-node project/product that I am working to update in time
>>> for the release of RHEL 7. Speaking entirely for myself, I can tell you
>>> that I am planning to use Pacemaker from RHEL 7.0. As a Red hat
>>> outsider, I can only speak as a member of the community, but I have
>>> every reason to believe that the pacemaker resource manager will be the
>>> one used from 7.0 and forward.
>>> 
>>> As for the CIB, yes, it's a local XML file stored on each node.
>>> Synchronization occurs via updates pushed over corosync to nodes active
>>> in the cluster. As I understand it, when a node that had been offline
>>> connects to the cluster, it receives any updates to the CIB.
>>> 
>>> Dealing with 2-node clusters, setting aside qdisk which has an uncertain
>>> future I believe, you can not use quorum. For this reason, it is
>>> possible for a node to boot up, fail to reach it's peer and think it's
>>> the only one running. It will start your HA services and voila, two
>>> nodes offering the same services at the same time in an uncoordinated
>>> manner. This is bad and it is called a "split-brain".
>>> 
>>> The way to avoid split-brains in 2-node clusters is to use fence
>>> devices, aka stonith devices (exact same thing by two different names).
>>> This is _always_ wise to use, but in 2-node clusters, it is critical.
>>> 
>>> So imagine back to your scenario;
>>> 
>>> If a node came up and tried to connect to it's peer but failed to do so,
>>> before proceeding, it would fence (usually forcibly power off) the other
>>> node. Only after doing so would it start the HA services. In this way,
>>> both nodes can never be offering the same HA service at the same time.
>>> 
>>> The risk here though is a "fence loop". If you set the cluster to start
>>> on boot and if there is a break in the connection, you can have an
>>> initial state where, upon the break in the network, both try to fence
>>> the other. The faster node wins, forcing the other node off and resuming
>>> to operate on it's own. This is fine and exactly what you want. However,
>>> now the fenced node powers back up, starts it's cluster stack, fails to
>>> reach it's peer and fences it. It finishes starting, offers the HA
>>> services and goes on it's way ... until the other node boots back up. :)
>>> 
>>> Personally, I avoid this by _not_ starting the cluster stack on boot. My
>>> reasoning is that, if a node fails and gets rebooted, I want to check it
>>> over myself before I let it back into the cluster (I get alert emails
>>> when something like this happens). It's not a risk from an HA
>>> perspective because it's services would have recovered on the surviving
>>> peer long before it reboots anyway. This also has the added benefit of
>>> avoiding a fence loop, no matter what happens.
>>> 
>>> Cheers
>>> 
>>> digimer
>>> 
>>> On 04/23/2013 02:07 PM, Michael Richmond wrote:
>>>> Andrew and Digimer,
>>>> Thank you for taking the time to respond, you have collaborated some of
>>>> what I've been putting together as the likely direction.
>>>> 
>>>> I am working on adapting some cluster-aware storage features for use in
>>>> a
>>>> Linux cluster environment. With this kind of project it is useful to try
>>>> and predict where the Linux community is heading so that I can focus my
>>>> development work on what will be the "current" cluster stack around my
>>>> anticipated release dates. Any predictions are simply educated guesses
>>>> that may prove to be wrong, but are useful with regard to developing
>>>> plans. From my reading of various web pages and piecing things together
>>>> I
>>>> found that RHEL 7 is intended to be based on Fedora 18, so I assume that
>>>> the new Pacemaker stack has a good chance of being rolled out in RHEL
>>>> 7.1/7.2, or even possibly 7.0.
>>>> 
>>>> Hearing that there is official word that the intention is for Pacemaker
>>>> to
>>>> be the official cluster stack helps me put my development plans
>>>> together.
>>>> 
>>>> 
>>>> The project I am working on is focused on two-node clusters. But I also
>>>> need a persistent, cluster-wide data store to hold a small amount of
>>>> state
>>>> (less than 1KB). This data store is what I refer to as a
>>>> cluster-registry.
>>>> The state data records the last-known operational state for the storage
>>>> feature. This last-known state helps drive recovery operations for the
>>>> storage feature during node bring-up. This project is specifically aimed
>>>> at integrating generic functionality into the Linux cluster stack.
>>>> 
>>>> I have been thinking about using the cluster configuration file for this
>>>> storage which I assume is the CIB referenced by Andrew. But I can
>>>> imagine
>>>> cases where the CIB file may loose updates if it does not utilize shared
>>>> storage media. My understanding is that the CIB file is stored on each
>>>> node using local disk storage.
>>>> 
>>>> For example, consider a two-node cluster that is configured with a
>>>> quorum
>>>> disk on shared storage media. If at a given point in time NodeB is up
>>>> and
>>>> NodeB is down. NodeA can form quorate and start cluster services
>>>> (including HA applications). Assume that NodeA updates the CIB to record
>>>> some state update. If NodeB starts booting but before NodeB joins the
>>>> cluster, NodeA crashes. At this point, the updated CIB only resides on
>>>> NodeA and cannot be accessed by NodeB even if NodeB can access the
>>>> quorum
>>>> disk as form quorate. Effectively, NodeB cannot be aware of the update
>>>> from NodeA which will result in an implicit roll-back of any updates
>>>> performed by NodeA.
>>>> 
>>>> With a two-node cluster, there are two options for resolving this:
>>>> * prevent any update to the cluster registry/CIB unless all nodes are
>>>> part
>>>> of the cluster. (This is not practical since it undermines some of the
>>>> reasons for building clusters.)
>>>> * store the cluster registry on shared storage so that there is one
>>>> source
>>>> of truth.
>>>> 
>>>> It is possible that the nature of the data stored in the CIB is
>>>> resilient
>>>> to the example scenario that I describe. In this case, maybe the CIB is
>>>> not an appropriate data store for my cluster registry data. In this
>>>> case I
>>>> am either looking for an appropriate Linux component to use for my
>>>> cluster
>>>> registry, or I will build a custom data store that provides atomic
>>>> update
>>>> semantics on shared storage.
>>>> 
>>>> Any thoughts and/or pointers would be appreciated.
>>>> 
>>>> Thanks,
>>>> Michael Richmond
>>>> 
>>>> --
>>>> michael richmond | principal software engineer | flashsoft, sandisk |
>>>> +1.408.425.6731
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 22/4/13 4:37 PM, "Andrew Beekhof" <and...@beekhof.net> wrote:
>>>> 
>>>>> 
>>>>> On 23/04/2013, at 4:59 AM, Digimer <li...@alteeve.ca> wrote:
>>>>> 
>>>>>> On 04/22/2013 02:36 PM, Michael Richmond wrote:
>>>>>>> Hello,
>>>>>>> I am researching the new cluster stack that is scheduled to be
>>>>>>> delivered
>>>>>>> in Fedora 19. Does anyone on this list have a sense for the timeframe
>>>>>>> for this new stack to be rolled into a RHEL release? (I assume the
>>>>>>> earliest would be RHEL 7.)
>>>>>>> 
>>>>>>> On the Windows platform, Microsoft Cluster Services provides a
>>>>>>> cluster-wide registry service that is basically a cluster-wide
>>>>>>> key:value
>>>>>>> store with atomic updates and support to store the registry on shared
>>>>>>> disk. The  storage on shared disk allows access and use of the
>>>>>>> registry
>>>>>>> in cases where nodes are frequently joining and leaving the cluster.
>>>>>>> 
>>>>>>> Are there any component(s) that can be used to provide a similar
>>>>>>> registry in the Linux cluster stack? (The current RHEL 6 stack,
>>>>>>> and/or
>>>>>>> the new Fedora 19 stack.)
>>>>>>> 
>>>>>>> Thanks in advance for your information,
>>>>>>> Michael Richmond
>>>>>> 
>>>>>> Hi Michael,
>>>>>> 
>>>>>>   First up, Red Hat's policy of what is coming is "we'll announce on
>>>>>> release day". So anything else is a guess. As it is, Pacemaker is in
>>>>>> tech-preview in RHEL 6, and the best guess is that it will be the
>>>>>> official resource manager in RHEL 7, but it's just that, a guess.
>>>>> 
>>>>> I believe we're officially allowed to say that it is our _intention_
>>>>> that
>>>>> Pacemaker will be the one and only supported stack in RHEL7.
>>>>> 
>>>>>> 
>>>>>>   As for the registry question; I am not entirely sure what it is you
>>>>>> are asking here (sorry, not familiar with windows). I can say that
>>>>>> pacemaker uses something called the CIB (cluster information base)
>>>>>> which
>>>>>> is an XML file containing the cluster's configuration and state. It
>>>>>> can
>>>>>> be updated from any node and the changes will push to the other nodes
>>>>>> immediately.
>>>>> 
>>>>> How many of these attributes are you planning to have?
>>>>> You can throw a few in there, but I'd not use it for 100's or 1000's of
>>>>> them - its mainly designed to store the resource/service configuration.
>>>>> 
>>>>> 
>>>>>> Does this answer your question?
>>>>>> 
>>>>>>   The current RHEL 6 cluster is corosync + cman + rgmanager. It also
>>>>>> uses an XML config and it can be updated from any node and push out to
>>>>>> the other nodes.
>>>>>> 
>>>>>>   Perhaps a better way to help would be to ask what, exactly, you want
>>>>>> to build your cluster for?
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> --
>>>>>> Digimer
>>>>>> Papers and Projects: https://alteeve.ca/w/
>>>>>> What if the cure for cancer is trapped in the mind of a person without
>>>>>> access to education?
>>>>>> 
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster@redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>> 
>>>> 
>>>> 
>>>> ________________________________
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail message
>>>> is intended only for the use of the designated recipient(s) named above.
>>>> If the reader of this message is not the intended recipient, you are
>>>> hereby notified that you have received this message in error and that
>>>> any review, dissemination, distribution, or copying of this message is
>>>> strictly prohibited. If you have received this communication in error,
>>>> please notify the sender by telephone or e-mail (as shown above)
>>>> immediately and destroy any and all copies of this message in your
>>>> possession (whether hard copies or electronically stored copies).
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
>> 
>> 
>> ________________________________
>> 
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) named above. If the 
>> reader of this message is not the intended recipient, you are hereby 
>> notified that you have received this message in error and that any review, 
>> dissemination, distribution, or copying of this message is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>> any and all copies of this message in your possession (whether hard copies 
>> or electronically stored copies).
>> 
> 
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without access 
> to education?


-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Fedora 19 cluster stack and Cluster registry components

Reply via email to