On 25/04/2013, at 4:21 AM, Digimer <li...@alteeve.ca> wrote: > Hi, > > The way I deal with avoiding dual-fence is to put a delay into one of the > nodes. For example, I can specify that if Node 1 is to be fenced, Node 2 will > pause for X seconds (usually 15 in my setups). This way, if both nodes try to > fence the other at the same time, Node 1 will have killed Node 2 long before > 2's 15 second timer expired. However, if Node 1 really was dead, Node 2 would > still fence 1 and then recover, albeit with a 15 second delay in recovery. > Simple and effective. :) > > I'm not sure if there is a specific RHEL 6.4 + pacemaker tutorial up yet, but > keep an eye on clusterlabs. I *think* Andrew is working on that.
There is a rhel-6 quickstart and "clusters from scratch" document that includes cman is also applicable > If not, I plan to go back to working on my tutorial when I return to the > office in May. However, that will still be *many* months before it's done. > > digimer > > On 04/24/2013 01:54 PM, Michael Richmond wrote: >> Hi Digimer, >> Thanks for your detailed comments. >> >> What you have described with regard to fencing is common practice for two >> node clusters that I have implemented in a few proprietary cluster >> implementations that I have worked on. However, fencing is does not >> completely solve the split-brain problem in two-node clusters. There is >> still the potential for both NodeA and NodeB to decide to fence at the >> same time. In this case, each node performs the fencing operation to fence >> the other node with the result that both nodes get fenced. >> >> To avoid this, most clustering systems can be optionally configured with a >> shared resource (usually a shared LUN) that is used to weight the decision >> about which node gets fenced. Additionally, the shared LUN can be used as >> a coarse communication mechanism to aid the election of a winning node. As >> I'm sure you are aware, a quorum disk is typically used to determine which >> partition has access to the larger/important portion of the cluster >> resources to determine the nodes that must be fenced because they are in a >> separate network partition. >> >> Since you mention that qdiskd has an uncertain future, it would appear >> that the pacemaker-based stack has a potential functionality gap with >> regard to two-node clusters. That is, unless some other approach is taken >> to resolve network partitions. >> >> From what I understand, the CIB is at risk for unintended roll-back of a >> write in the case where a two-node cluster has nodes up at differing >> times. For example, assuming time >> >> Time 0 Node A up Node B up (CIB >> contains "CIB0") >> Time 1 Node A up Node B down >> Time 2 Node A writes update to CIB Node B booting (not joined >> cluster) >> (CIB contains "CIB1") >> Time 3 Node A down Node B up (CIB >> contains "CIB0") >> >> After Time 3, Node B is operating with a CIB that contains "CIB0" and has >> no way of seeing the CIB contents "CIB1" written by Node A. In effect, the >> write by Node A was rolled-back when Node A went down. >> >> Thanks again for your input. >> >> Is there any description available about how to configure the >> pacemaker/chorosync stack on RHEL6.4? >> >> Regards, >> Michael Richmond >> >> michael richmond | principal software engineer | flashsoft, sandisk | >> +1.408.425.6731 >> >> >> >> >> On 23/4/13 6:07 PM, "Digimer" <li...@alteeve.ca> wrote: >> >>> First up, before I begin, I am looking to pacemaker for the future as >>> well and do not yet use it. So please take whatever I say about >>> pacemaker with a grain of sand. Andrew, on the other hand, is the author >>> and anything he says can be taken as authoritative on the topic. >>> >>> On the future; >>> >>> I also have a 2-node project/product that I am working to update in time >>> for the release of RHEL 7. Speaking entirely for myself, I can tell you >>> that I am planning to use Pacemaker from RHEL 7.0. As a Red hat >>> outsider, I can only speak as a member of the community, but I have >>> every reason to believe that the pacemaker resource manager will be the >>> one used from 7.0 and forward. >>> >>> As for the CIB, yes, it's a local XML file stored on each node. >>> Synchronization occurs via updates pushed over corosync to nodes active >>> in the cluster. As I understand it, when a node that had been offline >>> connects to the cluster, it receives any updates to the CIB. >>> >>> Dealing with 2-node clusters, setting aside qdisk which has an uncertain >>> future I believe, you can not use quorum. For this reason, it is >>> possible for a node to boot up, fail to reach it's peer and think it's >>> the only one running. It will start your HA services and voila, two >>> nodes offering the same services at the same time in an uncoordinated >>> manner. This is bad and it is called a "split-brain". >>> >>> The way to avoid split-brains in 2-node clusters is to use fence >>> devices, aka stonith devices (exact same thing by two different names). >>> This is _always_ wise to use, but in 2-node clusters, it is critical. >>> >>> So imagine back to your scenario; >>> >>> If a node came up and tried to connect to it's peer but failed to do so, >>> before proceeding, it would fence (usually forcibly power off) the other >>> node. Only after doing so would it start the HA services. In this way, >>> both nodes can never be offering the same HA service at the same time. >>> >>> The risk here though is a "fence loop". If you set the cluster to start >>> on boot and if there is a break in the connection, you can have an >>> initial state where, upon the break in the network, both try to fence >>> the other. The faster node wins, forcing the other node off and resuming >>> to operate on it's own. This is fine and exactly what you want. However, >>> now the fenced node powers back up, starts it's cluster stack, fails to >>> reach it's peer and fences it. It finishes starting, offers the HA >>> services and goes on it's way ... until the other node boots back up. :) >>> >>> Personally, I avoid this by _not_ starting the cluster stack on boot. My >>> reasoning is that, if a node fails and gets rebooted, I want to check it >>> over myself before I let it back into the cluster (I get alert emails >>> when something like this happens). It's not a risk from an HA >>> perspective because it's services would have recovered on the surviving >>> peer long before it reboots anyway. This also has the added benefit of >>> avoiding a fence loop, no matter what happens. >>> >>> Cheers >>> >>> digimer >>> >>> On 04/23/2013 02:07 PM, Michael Richmond wrote: >>>> Andrew and Digimer, >>>> Thank you for taking the time to respond, you have collaborated some of >>>> what I've been putting together as the likely direction. >>>> >>>> I am working on adapting some cluster-aware storage features for use in >>>> a >>>> Linux cluster environment. With this kind of project it is useful to try >>>> and predict where the Linux community is heading so that I can focus my >>>> development work on what will be the "current" cluster stack around my >>>> anticipated release dates. Any predictions are simply educated guesses >>>> that may prove to be wrong, but are useful with regard to developing >>>> plans. From my reading of various web pages and piecing things together >>>> I >>>> found that RHEL 7 is intended to be based on Fedora 18, so I assume that >>>> the new Pacemaker stack has a good chance of being rolled out in RHEL >>>> 7.1/7.2, or even possibly 7.0. >>>> >>>> Hearing that there is official word that the intention is for Pacemaker >>>> to >>>> be the official cluster stack helps me put my development plans >>>> together. >>>> >>>> >>>> The project I am working on is focused on two-node clusters. But I also >>>> need a persistent, cluster-wide data store to hold a small amount of >>>> state >>>> (less than 1KB). This data store is what I refer to as a >>>> cluster-registry. >>>> The state data records the last-known operational state for the storage >>>> feature. This last-known state helps drive recovery operations for the >>>> storage feature during node bring-up. This project is specifically aimed >>>> at integrating generic functionality into the Linux cluster stack. >>>> >>>> I have been thinking about using the cluster configuration file for this >>>> storage which I assume is the CIB referenced by Andrew. But I can >>>> imagine >>>> cases where the CIB file may loose updates if it does not utilize shared >>>> storage media. My understanding is that the CIB file is stored on each >>>> node using local disk storage. >>>> >>>> For example, consider a two-node cluster that is configured with a >>>> quorum >>>> disk on shared storage media. If at a given point in time NodeB is up >>>> and >>>> NodeB is down. NodeA can form quorate and start cluster services >>>> (including HA applications). Assume that NodeA updates the CIB to record >>>> some state update. If NodeB starts booting but before NodeB joins the >>>> cluster, NodeA crashes. At this point, the updated CIB only resides on >>>> NodeA and cannot be accessed by NodeB even if NodeB can access the >>>> quorum >>>> disk as form quorate. Effectively, NodeB cannot be aware of the update >>>> from NodeA which will result in an implicit roll-back of any updates >>>> performed by NodeA. >>>> >>>> With a two-node cluster, there are two options for resolving this: >>>> * prevent any update to the cluster registry/CIB unless all nodes are >>>> part >>>> of the cluster. (This is not practical since it undermines some of the >>>> reasons for building clusters.) >>>> * store the cluster registry on shared storage so that there is one >>>> source >>>> of truth. >>>> >>>> It is possible that the nature of the data stored in the CIB is >>>> resilient >>>> to the example scenario that I describe. In this case, maybe the CIB is >>>> not an appropriate data store for my cluster registry data. In this >>>> case I >>>> am either looking for an appropriate Linux component to use for my >>>> cluster >>>> registry, or I will build a custom data store that provides atomic >>>> update >>>> semantics on shared storage. >>>> >>>> Any thoughts and/or pointers would be appreciated. >>>> >>>> Thanks, >>>> Michael Richmond >>>> >>>> -- >>>> michael richmond | principal software engineer | flashsoft, sandisk | >>>> +1.408.425.6731 >>>> >>>> >>>> >>>> >>>> On 22/4/13 4:37 PM, "Andrew Beekhof" <and...@beekhof.net> wrote: >>>> >>>>> >>>>> On 23/04/2013, at 4:59 AM, Digimer <li...@alteeve.ca> wrote: >>>>> >>>>>> On 04/22/2013 02:36 PM, Michael Richmond wrote: >>>>>>> Hello, >>>>>>> I am researching the new cluster stack that is scheduled to be >>>>>>> delivered >>>>>>> in Fedora 19. Does anyone on this list have a sense for the timeframe >>>>>>> for this new stack to be rolled into a RHEL release? (I assume the >>>>>>> earliest would be RHEL 7.) >>>>>>> >>>>>>> On the Windows platform, Microsoft Cluster Services provides a >>>>>>> cluster-wide registry service that is basically a cluster-wide >>>>>>> key:value >>>>>>> store with atomic updates and support to store the registry on shared >>>>>>> disk. The storage on shared disk allows access and use of the >>>>>>> registry >>>>>>> in cases where nodes are frequently joining and leaving the cluster. >>>>>>> >>>>>>> Are there any component(s) that can be used to provide a similar >>>>>>> registry in the Linux cluster stack? (The current RHEL 6 stack, >>>>>>> and/or >>>>>>> the new Fedora 19 stack.) >>>>>>> >>>>>>> Thanks in advance for your information, >>>>>>> Michael Richmond >>>>>> >>>>>> Hi Michael, >>>>>> >>>>>> First up, Red Hat's policy of what is coming is "we'll announce on >>>>>> release day". So anything else is a guess. As it is, Pacemaker is in >>>>>> tech-preview in RHEL 6, and the best guess is that it will be the >>>>>> official resource manager in RHEL 7, but it's just that, a guess. >>>>> >>>>> I believe we're officially allowed to say that it is our _intention_ >>>>> that >>>>> Pacemaker will be the one and only supported stack in RHEL7. >>>>> >>>>>> >>>>>> As for the registry question; I am not entirely sure what it is you >>>>>> are asking here (sorry, not familiar with windows). I can say that >>>>>> pacemaker uses something called the CIB (cluster information base) >>>>>> which >>>>>> is an XML file containing the cluster's configuration and state. It >>>>>> can >>>>>> be updated from any node and the changes will push to the other nodes >>>>>> immediately. >>>>> >>>>> How many of these attributes are you planning to have? >>>>> You can throw a few in there, but I'd not use it for 100's or 1000's of >>>>> them - its mainly designed to store the resource/service configuration. >>>>> >>>>> >>>>>> Does this answer your question? >>>>>> >>>>>> The current RHEL 6 cluster is corosync + cman + rgmanager. It also >>>>>> uses an XML config and it can be updated from any node and push out to >>>>>> the other nodes. >>>>>> >>>>>> Perhaps a better way to help would be to ask what, exactly, you want >>>>>> to build your cluster for? >>>>>> >>>>>> Cheers >>>>>> >>>>>> -- >>>>>> Digimer >>>>>> Papers and Projects: https://alteeve.ca/w/ >>>>>> What if the cure for cancer is trapped in the mind of a person without >>>>>> access to education? >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster@redhat.com >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>> >>>> >>>> ________________________________ >>>> >>>> PLEASE NOTE: The information contained in this electronic mail message >>>> is intended only for the use of the designated recipient(s) named above. >>>> If the reader of this message is not the intended recipient, you are >>>> hereby notified that you have received this message in error and that >>>> any review, dissemination, distribution, or copying of this message is >>>> strictly prohibited. If you have received this communication in error, >>>> please notify the sender by telephone or e-mail (as shown above) >>>> immediately and destroy any and all copies of this message in your >>>> possession (whether hard copies or electronically stored copies). >>>> >>>> >>> >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ >>> What if the cure for cancer is trapped in the mind of a person without >>> access to education? >> >> >> ________________________________ >> >> PLEASE NOTE: The information contained in this electronic mail message is >> intended only for the use of the designated recipient(s) named above. If the >> reader of this message is not the intended recipient, you are hereby >> notified that you have received this message in error and that any review, >> dissemination, distribution, or copying of this message is strictly >> prohibited. If you have received this communication in error, please notify >> the sender by telephone or e-mail (as shown above) immediately and destroy >> any and all copies of this message in your possession (whether hard copies >> or electronically stored copies). >> > > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without access > to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster