Genius!
Jamison Maxwell Sr. Systems Administrator -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Digimer Sent: Wednesday, April 24, 2013 2:22 PM To: Michael Richmond Cc: linux clustering Subject: Re: [Linux-cluster] Fedora 19 cluster stack and Cluster registry components Hi, The way I deal with avoiding dual-fence is to put a delay into one of the nodes. For example, I can specify that if Node 1 is to be fenced, Node 2 will pause for X seconds (usually 15 in my setups). This way, if both nodes try to fence the other at the same time, Node 1 will have killed Node 2 long before 2's 15 second timer expired. However, if Node 1 really was dead, Node 2 would still fence 1 and then recover, albeit with a 15 second delay in recovery. Simple and effective. :) I'm not sure if there is a specific RHEL 6.4 + pacemaker tutorial up yet, but keep an eye on clusterlabs. I *think* Andrew is working on that. If not, I plan to go back to working on my tutorial when I return to the office in May. However, that will still be *many* months before it's done. digimer On 04/24/2013 01:54 PM, Michael Richmond wrote: > Hi Digimer, > Thanks for your detailed comments. > > What you have described with regard to fencing is common practice for > two node clusters that I have implemented in a few proprietary cluster > implementations that I have worked on. However, fencing is does not > completely solve the split-brain problem in two-node clusters. There > is still the potential for both NodeA and NodeB to decide to fence at > the same time. In this case, each node performs the fencing operation > to fence the other node with the result that both nodes get fenced. > > To avoid this, most clustering systems can be optionally configured > with a shared resource (usually a shared LUN) that is used to weight > the decision about which node gets fenced. Additionally, the shared > LUN can be used as a coarse communication mechanism to aid the > election of a winning node. As I'm sure you are aware, a quorum disk > is typically used to determine which partition has access to the > larger/important portion of the cluster resources to determine the > nodes that must be fenced because they are in a separate network partition. > > Since you mention that qdiskd has an uncertain future, it would appear > that the pacemaker-based stack has a potential functionality gap with > regard to two-node clusters. That is, unless some other approach is > taken to resolve network partitions. > > From what I understand, the CIB is at risk for unintended roll-back > of a write in the case where a two-node cluster has nodes up at > differing times. For example, assuming time > > Time 0 Node A up Node B up (CIB > contains "CIB0") > Time 1 Node A up Node B down > Time 2 Node A writes update to CIB Node B booting (not joined > cluster) > (CIB contains "CIB1") > Time 3 Node A down Node B up (CIB > contains "CIB0") > > After Time 3, Node B is operating with a CIB that contains "CIB0" and > has no way of seeing the CIB contents "CIB1" written by Node A. In > effect, the write by Node A was rolled-back when Node A went down. > > Thanks again for your input. > > Is there any description available about how to configure the > pacemaker/chorosync stack on RHEL6.4? > > Regards, > Michael Richmond > > michael richmond | principal software engineer | flashsoft, sandisk | > +1.408.425.6731 > > > > > On 23/4/13 6:07 PM, "Digimer" <[email protected]> wrote: > >> First up, before I begin, I am looking to pacemaker for the future as >> well and do not yet use it. So please take whatever I say about >> pacemaker with a grain of sand. Andrew, on the other hand, is the >> author and anything he says can be taken as authoritative on the topic. >> >> On the future; >> >> I also have a 2-node project/product that I am working to update in >> time for the release of RHEL 7. Speaking entirely for myself, I can >> tell you that I am planning to use Pacemaker from RHEL 7.0. As a Red >> hat outsider, I can only speak as a member of the community, but I >> have every reason to believe that the pacemaker resource manager will >> be the one used from 7.0 and forward. >> >> As for the CIB, yes, it's a local XML file stored on each node. >> Synchronization occurs via updates pushed over corosync to nodes >> active in the cluster. As I understand it, when a node that had been >> offline connects to the cluster, it receives any updates to the CIB. >> >> Dealing with 2-node clusters, setting aside qdisk which has an >> uncertain future I believe, you can not use quorum. For this reason, >> it is possible for a node to boot up, fail to reach it's peer and >> think it's the only one running. It will start your HA services and >> voila, two nodes offering the same services at the same time in an >> uncoordinated manner. This is bad and it is called a "split-brain". >> >> The way to avoid split-brains in 2-node clusters is to use fence >> devices, aka stonith devices (exact same thing by two different names). >> This is _always_ wise to use, but in 2-node clusters, it is critical. >> >> So imagine back to your scenario; >> >> If a node came up and tried to connect to it's peer but failed to do >> so, before proceeding, it would fence (usually forcibly power off) >> the other node. Only after doing so would it start the HA services. >> In this way, both nodes can never be offering the same HA service at the >> same time. >> >> The risk here though is a "fence loop". If you set the cluster to >> start on boot and if there is a break in the connection, you can have >> an initial state where, upon the break in the network, both try to >> fence the other. The faster node wins, forcing the other node off and >> resuming to operate on it's own. This is fine and exactly what you >> want. However, now the fenced node powers back up, starts it's >> cluster stack, fails to reach it's peer and fences it. It finishes >> starting, offers the HA services and goes on it's way ... until the >> other node boots back up. :) >> >> Personally, I avoid this by _not_ starting the cluster stack on boot. >> My reasoning is that, if a node fails and gets rebooted, I want to >> check it over myself before I let it back into the cluster (I get >> alert emails when something like this happens). It's not a risk from >> an HA perspective because it's services would have recovered on the >> surviving peer long before it reboots anyway. This also has the added >> benefit of avoiding a fence loop, no matter what happens. >> >> Cheers >> >> digimer >> >> On 04/23/2013 02:07 PM, Michael Richmond wrote: >>> Andrew and Digimer, >>> Thank you for taking the time to respond, you have collaborated some >>> of what I've been putting together as the likely direction. >>> >>> I am working on adapting some cluster-aware storage features for use >>> in a Linux cluster environment. With this kind of project it is >>> useful to try and predict where the Linux community is heading so >>> that I can focus my development work on what will be the "current" >>> cluster stack around my anticipated release dates. Any predictions >>> are simply educated guesses that may prove to be wrong, but are >>> useful with regard to developing plans. From my reading of various >>> web pages and piecing things together I found that RHEL 7 is >>> intended to be based on Fedora 18, so I assume that the new >>> Pacemaker stack has a good chance of being rolled out in RHEL >>> 7.1/7.2, or even possibly 7.0. >>> >>> Hearing that there is official word that the intention is for >>> Pacemaker to be the official cluster stack helps me put my >>> development plans together. >>> >>> >>> The project I am working on is focused on two-node clusters. But I >>> also need a persistent, cluster-wide data store to hold a small >>> amount of state (less than 1KB). This data store is what I refer to >>> as a cluster-registry. >>> The state data records the last-known operational state for the >>> storage feature. This last-known state helps drive recovery >>> operations for the storage feature during node bring-up. This >>> project is specifically aimed at integrating generic functionality into the >>> Linux cluster stack. >>> >>> I have been thinking about using the cluster configuration file for >>> this storage which I assume is the CIB referenced by Andrew. But I >>> can imagine cases where the CIB file may loose updates if it does >>> not utilize shared storage media. My understanding is that the CIB >>> file is stored on each node using local disk storage. >>> >>> For example, consider a two-node cluster that is configured with a >>> quorum disk on shared storage media. If at a given point in time >>> NodeB is up and NodeB is down. NodeA can form quorate and start >>> cluster services (including HA applications). Assume that NodeA >>> updates the CIB to record some state update. If NodeB starts booting >>> but before NodeB joins the cluster, NodeA crashes. At this point, >>> the updated CIB only resides on NodeA and cannot be accessed by >>> NodeB even if NodeB can access the quorum disk as form quorate. >>> Effectively, NodeB cannot be aware of the update from NodeA which >>> will result in an implicit roll-back of any updates performed by >>> NodeA. >>> >>> With a two-node cluster, there are two options for resolving this: >>> * prevent any update to the cluster registry/CIB unless all nodes >>> are part of the cluster. (This is not practical since it undermines >>> some of the reasons for building clusters.) >>> * store the cluster registry on shared storage so that there is one >>> source of truth. >>> >>> It is possible that the nature of the data stored in the CIB is >>> resilient to the example scenario that I describe. In this case, >>> maybe the CIB is not an appropriate data store for my cluster >>> registry data. In this case I am either looking for an appropriate >>> Linux component to use for my cluster registry, or I will build a >>> custom data store that provides atomic update semantics on shared >>> storage. >>> >>> Any thoughts and/or pointers would be appreciated. >>> >>> Thanks, >>> Michael Richmond >>> >>> -- >>> michael richmond | principal software engineer | flashsoft, sandisk >>> | >>> +1.408.425.6731 >>> >>> >>> >>> >>> On 22/4/13 4:37 PM, "Andrew Beekhof" <[email protected]> wrote: >>> >>>> >>>> On 23/04/2013, at 4:59 AM, Digimer <[email protected]> wrote: >>>> >>>>> On 04/22/2013 02:36 PM, Michael Richmond wrote: >>>>>> Hello, >>>>>> I am researching the new cluster stack that is scheduled to be >>>>>> delivered in Fedora 19. Does anyone on this list have a sense for >>>>>> the timeframe for this new stack to be rolled into a RHEL >>>>>> release? (I assume the earliest would be RHEL 7.) >>>>>> >>>>>> On the Windows platform, Microsoft Cluster Services provides a >>>>>> cluster-wide registry service that is basically a cluster-wide >>>>>> key:value store with atomic updates and support to store the >>>>>> registry on shared disk. The storage on shared disk allows >>>>>> access and use of the registry in cases where nodes are >>>>>> frequently joining and leaving the cluster. >>>>>> >>>>>> Are there any component(s) that can be used to provide a similar >>>>>> registry in the Linux cluster stack? (The current RHEL 6 stack, >>>>>> and/or the new Fedora 19 stack.) >>>>>> >>>>>> Thanks in advance for your information, Michael Richmond >>>>> >>>>> Hi Michael, >>>>> >>>>> First up, Red Hat's policy of what is coming is "we'll announce >>>>> on release day". So anything else is a guess. As it is, Pacemaker >>>>> is in tech-preview in RHEL 6, and the best guess is that it will >>>>> be the official resource manager in RHEL 7, but it's just that, a guess. >>>> >>>> I believe we're officially allowed to say that it is our >>>> _intention_ that Pacemaker will be the one and only supported stack >>>> in RHEL7. >>>> >>>>> >>>>> As for the registry question; I am not entirely sure what it is >>>>> you are asking here (sorry, not familiar with windows). I can say >>>>> that pacemaker uses something called the CIB (cluster information >>>>> base) which is an XML file containing the cluster's configuration >>>>> and state. It can be updated from any node and the changes will >>>>> push to the other nodes immediately. >>>> >>>> How many of these attributes are you planning to have? >>>> You can throw a few in there, but I'd not use it for 100's or >>>> 1000's of them - its mainly designed to store the resource/service >>>> configuration. >>>> >>>> >>>>> Does this answer your question? >>>>> >>>>> The current RHEL 6 cluster is corosync + cman + rgmanager. It >>>>> also uses an XML config and it can be updated from any node and >>>>> push out to the other nodes. >>>>> >>>>> Perhaps a better way to help would be to ask what, exactly, you >>>>> want to build your cluster for? >>>>> >>>>> Cheers >>>>> >>>>> -- >>>>> Digimer >>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for >>>>> cancer is trapped in the mind of a person without access to >>>>> education? >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> [email protected] >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>> >>> >>> ________________________________ >>> >>> PLEASE NOTE: The information contained in this electronic mail >>> message is intended only for the use of the designated recipient(s) named >>> above. >>> If the reader of this message is not the intended recipient, you are >>> hereby notified that you have received this message in error and >>> that any review, dissemination, distribution, or copying of this >>> message is strictly prohibited. If you have received this >>> communication in error, please notify the sender by telephone or >>> e-mail (as shown above) immediately and destroy any and all copies >>> of this message in your possession (whether hard copies or electronically >>> stored copies). >>> >>> >> >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ What if the cure for >> cancer is trapped in the mind of a person without access to >> education? > > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
