Re: [Linux-HA] Looking for a suitable Stonith Solution
On Wed, Mar 2, 2011 at 9:05 AM, Stallmann, Andreas astallm...@conet.de wrote: Hi Andrew, If suicide is no supported fencing option, why is it still included with stonith? Left over from heartbeat v1 days I guess. Could also be a testing-only device like ssh. www.clusterlabs.org tells me, you're the Pacemaker project leader. Yes, but the stonith devices come from cluster-glue. So I guess Dejan or Florian are nominally in charge of those, but they've not been changed in forever. Would you, by chance, know who maintains or maintained the suicide-stonith-plugin? It maybe testing-only, yes. But at least, ssh is working as intended. It's badly documented, and I didn't find a single (official) document on howto implement a (stable!) suicide-stonith, Because you can't. Suicide is not, will not, can not be reliable. Yes, you're right. But under certain circumstances (1. nodes are still alive, 2. both redundant communication channels [networks] are down, 3. policy requires no node to be up, which has no quorum) it might be a good addition to a regular stonith (because if [2] happens, pacemaker/stonith will probably not be able to control a network power switch etc.) Could we agree on that? Sure. But even if you have a functioning suicide plugin, Pacemaker cannot ever make decisions that assume it worked. Because for all it knows the other side might consider itself to be perfectly healthy. If not: What's your recommended setup for (resp. against) such situations? Think of split sites here! You still need reliable fencing, if you cant provide that, there needs to be a human in the loop. The whole point of stonith is to create a known node state (off) in situations where you cannot be sure if your peer is alive, dead or some state in-between. Yes, so don't file suicide under stonith! We implemented a different approach in a two node cluster: We wrote a script that checks (by means of cron) the connectivity (by means of ping) to the peer (if connected, everything fine) and then (if peer are not reachable) to some quorum nodes. If either the peer or a majority of the quorum nodes are alive, nothing happens. If quorum is lost, the node shut's itself down. Wonderful, but the healthy side still can't do anything, because it can't know that the bad side is down. So what have you gained over no-quorum-policy=stop (which is the default) ? We did that, because drbd tended to misbehave in situations, where all network connectivity was lost. We'd rather have a clean shutdown on both sides, than a corrupt filesystem. I always consider this solution as unelegant, mainly because it wasn't controllable via crm. Thus I hoped, I could forget this solution when using pacemaker. It seems, I can not. If there's any interest from the community in our suicide by cron-solution, tell me if and how to contribute. It requires a sick node to suddenly start functioning correctly - so attempting to self-terminate makes some sense, relying on it to succeed does not seem prudent. Ys! But it's not always the node, that's sick. Sometimes (even with the best and most redundant network), the connectivity between the node ist the problem, not a marauding pacemaker or openais! Again: Please tell me, what's your solution in that case? Again, tell me how the other side is supposed to know and what you gain? On the other hand, it doen't make any other sense to name a no-quorum-policy suicide, if it's anything, but a suicide (if, at all, one could name it assisted suicide). This question is still unanswered. Does no quorum-policy suicide really have a meaning? yes, for N 2, it is a faster version of stop Or is it as well a leftover from the times of heartbeat. no Is it still functional? yes ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
On Fri, Feb 25, 2011 at 12:51 PM, Stallmann, Andreas astallm...@conet.de wrote: Hi! I conentrate both your answers into one mail, I hope that's allright for you. For now, I need an interim solution, which is, as of now, stonith via suicide. Doesn't work as suicide is not considered reliable - by definition the remaining nodes have no way to verify that the fencing operation was successful. Suspect it will still fail though, suicide isnt a supported fencing option - since obviously the other nodes can't confirm it happened. Ok then, I know I'm a little bit provocative right now: If suicide is no supported fencing option, why is it still included with stonith? Left over from heartbeat v1 days I guess. Could also be a testing-only device like ssh. It's badly documented, and I didn't find a single (official) document on howto implement a (stable!) suicide-stonith, Because you can't. Suicide is not, will not, can not be reliable. The whole point of stonith is to create a known node state (off) in situations where you cannot be sure if your peer is alive, dead or some state in-between. Suicide does not achieve this in any way, shape or form. It requires a sick node to suddenly start functioning correctly - so attempting to self-terminate makes some sense, relying on it to succeed does not seem prudent. but it's there, and thus it should be usable. If it isn't, the maintainer should please (please!) remove it or supply something that's working. I do know, that's quite demanding, because the maintainer will probably do the development in his (or her) free time. Still... I do as well agree, that suicide is a very special way of keeping a cluster consistent, very different from the other stonith methods. I wouldn't expect it under stonith, I'd rather think... Yes no-quorum-policy=suicide means that all nodes in the partition will end up being shot, but you still require a real stonith device so that _someone_else_ can perform it. ...that if you set no-quorum-policy=suicide, the suicide script is executed by the node itself. It should be an *extra* feature *besides* stonith. The procedure should be something like: 1) node1: Allright, I have no quorum anymore. Let's wait for a while... 2)... a while passes 3) node1: OK, I'm still without quorum, no contact to my peers, whatsoever. I'd rather shut myself down, before I cause a mess. If, during (2), the other nodes find a way to shut down the node externaly (if through ssh, a power switch, a virtualisation host...), that's even better, because then the cluster knows, that it's still consistent. I'm with you, here. If a split brain happens in a split site scenario, a suicide might be the only way to keep up consistency, because no one will be able to reach any device on the other site... Please correct me if I'm wrong. What do you do in such a case? What's your exemplary implementation of Linux-HA then? On the other hand, it doen't make any other sense to name a no-quorum-policy suicide, if it's anything, but a suicide (if, at all, one could name it assisted suicide). Please correct me: Do I have a utterly wrong understanding of the whole process (that could be very well the case), is the implementation not entirely thought through, or is the naming of certain components not as good as it could be? I might point you to http://osdir.com/ml/linux.highavailability.devel/2007-11/msg00026.html, because the same thing has been discussed then, and I very much do think, that Lars was right with what he wrote. Has anything changed in the concept of suicide/quorum-loss/stonith since then? That's not a provocative question, well, maybe it is, but it's not meant to be. In addition: Something that's missing from the manuals is a case study (or something the like) on how to implement a split side scenario. How should the cluster be build then? If you have to sides? If you have one? How should the storage-replication be set up? Is synchronous replication like in drbd really a good idea then, performance wise? I think I'll finally have to buy a book. :-) Any recommendations (either english or german prefered). Well, thank's a lot again, my brain didn't explode (that's something good, I feel), but I'm not entirely happy, though. Cheers and have a nice weekend, Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), Wilfried Pütz Vorsitzender des Aufsichtsrates/Chairman
Re: [Linux-HA] Looking for a suitable Stonith Solution
Hi! I conentrate both your answers into one mail, I hope that's allright for you. For now, I need an interim solution, which is, as of now, stonith via suicide. Doesn't work as suicide is not considered reliable - by definition the remaining nodes have no way to verify that the fencing operation was successful. Suspect it will still fail though, suicide isnt a supported fencing option - since obviously the other nodes can't confirm it happened. Ok then, I know I'm a little bit provocative right now: If suicide is no supported fencing option, why is it still included with stonith? It's badly documented, and I didn't find a single (official) document on howto implement a (stable!) suicide-stonith, but it's there, and thus it should be usable. If it isn't, the maintainer should please (please!) remove it or supply something that's working. I do know, that's quite demanding, because the maintainer will probably do the development in his (or her) free time. Still... I do as well agree, that suicide is a very special way of keeping a cluster consistent, very different from the other stonith methods. I wouldn't expect it under stonith, I'd rather think... Yes no-quorum-policy=suicide means that all nodes in the partition will end up being shot, but you still require a real stonith device so that _someone_else_ can perform it. ...that if you set no-quorum-policy=suicide, the suicide script is executed by the node itself. It should be an *extra* feature *besides* stonith. The procedure should be something like: 1) node1: Allright, I have no quorum anymore. Let's wait for a while... 2)... a while passes 3) node1: OK, I'm still without quorum, no contact to my peers, whatsoever. I'd rather shut myself down, before I cause a mess. If, during (2), the other nodes find a way to shut down the node externaly (if through ssh, a power switch, a virtualisation host...), that's even better, because then the cluster knows, that it's still consistent. I'm with you, here. If a split brain happens in a split site scenario, a suicide might be the only way to keep up consistency, because no one will be able to reach any device on the other site... Please correct me if I'm wrong. What do you do in such a case? What's your exemplary implementation of Linux-HA then? On the other hand, it doen't make any other sense to name a no-quorum-policy suicide, if it's anything, but a suicide (if, at all, one could name it assisted suicide). Please correct me: Do I have a utterly wrong understanding of the whole process (that could be very well the case), is the implementation not entirely thought through, or is the naming of certain components not as good as it could be? I might point you to http://osdir.com/ml/linux.highavailability.devel/2007-11/msg00026.html, because the same thing has been discussed then, and I very much do think, that Lars was right with what he wrote. Has anything changed in the concept of suicide/quorum-loss/stonith since then? That's not a provocative question, well, maybe it is, but it's not meant to be. In addition: Something that's missing from the manuals is a case study (or something the like) on how to implement a split side scenario. How should the cluster be build then? If you have to sides? If you have one? How should the storage-replication be set up? Is synchronous replication like in drbd really a good idea then, performance wise? I think I'll finally have to buy a book. :-) Any recommendations (either english or german prefered). Well, thank's a lot again, my brain didn't explode (that's something good, I feel), but I'm not entirely happy, though. Cheers and have a nice weekend, Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), Wilfried Pütz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
Hi! TNX for your answer. We will switch to sbd after the shared storage has been set up. For now, I need an interim solution, which is, as of now, stonith via suicide. My configuration doesn't work, though. I tried: ~~Output from crm configure show~~ primitive suicide_res stonith:suicide ... clone fenc_clon suicide_res ... property $id=cib-bootstrap-options \ dc-version=1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10 \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ stonith-enabled=true \ no-quorum-policy=suicide \ stonith-action=poweroff If I disconnect one node from the network, crm_mon shows: Current DC: mgmt03 - partition WITHOUT quorum ... Node mgmt01: UNCLEAN (offline) Node mgmt02: UNCLEAN (offline) Online: [ mgmt03 ] Clone Set: fenc_clon Started: [ ipfuie-mgmt03 ] Stopped: [ suicide_res:0 suicide_res:1 ] ~~~ No action, neither reboot nor poweroff is taken. 1. What did I do wrong here? 2. OK, let's be more precise: I have the feeling, that the suicide ressource should be in a default state of stopped (on all nodes) and should only be started on the node, which has to fence itself. Am I right? And, if yes, how is that accomplished? 3. How does the no-quorum-policy relate to the stonith-ressources? I didn't find any documentation, if the two have any connection at all. 4. Am I correct, that the no-quorum-policy is what a node (or a cluster partition) should do to itself, when it looses quorum (for example, shut down itself), and stonith is what the nodes with quorum try to do to the nodes without? 5. Shouldn't then no-quorum-policy=suicide be obsolet in case of suicide as stonith-method? TNX for your help (again), Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), Wilfried Pütz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
On Thu, Feb 24, 2011 at 2:49 PM, Stallmann, Andreas astallm...@conet.de wrote: Hi! TNX for your answer. We will switch to sbd after the shared storage has been set up. For now, I need an interim solution, which is, as of now, stonith via suicide. Doesn't work as suicide is not considered reliable - by definition the remaining nodes have no way to verify that the fencing operation was successful. Yes no-quorum-policy=suicide means that all nodes in the partition will end up being shot, but you still require a real stonith device so that _someone_else_ can perform it. My configuration doesn't work, though. I tried: ~~Output from crm configure show~~ primitive suicide_res stonith:suicide ... clone fenc_clon suicide_res ... property $id=cib-bootstrap-options \ dc-version=1.1.2-8b9ec9ccc5060457ac761dce1de719af86895b10 \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ stonith-enabled=true \ no-quorum-policy=suicide \ stonith-action=poweroff If I disconnect one node from the network, crm_mon shows: Current DC: mgmt03 - partition WITHOUT quorum ... Node mgmt01: UNCLEAN (offline) Node mgmt02: UNCLEAN (offline) Online: [ mgmt03 ] Clone Set: fenc_clon Started: [ ipfuie-mgmt03 ] Stopped: [ suicide_res:0 suicide_res:1 ] ~~~ No action, neither reboot nor poweroff is taken. 1. What did I do wrong here? 2. OK, let's be more precise: I have the feeling, that the suicide ressource should be in a default state of stopped (on all nodes) and should only be started on the node, which has to fence itself. Am I right? And, if yes, how is that accomplished? 3. How does the no-quorum-policy relate to the stonith-ressources? I didn't find any documentation, if the two have any connection at all. 4. Am I correct, that the no-quorum-policy is what a node (or a cluster partition) should do to itself, when it looses quorum (for example, shut down itself), and stonith is what the nodes with quorum try to do to the nodes without? 5. Shouldn't then no-quorum-policy=suicide be obsolet in case of suicide as stonith-method? TNX for your help (again), Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), Wilfried Pütz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Looking for a suitable Stonith Solution
Hello! I'm currently looking for a suitable stonith solution for our environment: 1. We have three cluster nodes running OpenSuSE 10.3 with corosync and pacemaker. 2. The nodes reside on two VMware ESXi-Servers ( v. 4.1.0) in two locations, where one VMware Server hosts two, the other hosts one node of our cluster. 3. We will finally have a shared storage, which might or might not be under our governance (this depends on the respective customer). I examined some stonith methods, with the following results (and I'd like to have your opinion to my conclusions): - (2) rules out any stonith-method, that relies on a power switch or a UPS, as the nodes are in no way physically connected to a power circuit. - (3) rules out sbd, as this method requires access to a physical device, that offers the shared storage. Am I right? The manual explicitly says, that sbd may even not be used on a DRBD-Partition. Question: Is there a way to insert the sbd-Header on a mounted drive instead of a physical partition? Are there any other methods of ressource fencing besides sbd? - (2) is not compatible with the vmware-stonith method, as it requires the vmware-host to be reachable via one single IP-Adress. This isn't the case in our szenario, the ESXi is not clustered. Question: Has anyone of you modified the vmware-stonith script to fit to a set up similar to ours? - the ssh-method is said to be not the wisest idea, as it requires the host to respond to ssh-requests. If the SSH-Daemon hangs or the cluster runs into a split-brain, this might not be the case anymore. Any other opinions? This, finally, leaves only one method: The suicide. Again, the SuSE HAE-manual claims that this method is not suitable for production environments because This requires action by the node's operating system and can fail under certain circumstances. Therefore avoid using this device whenever possible. Well, as far as I'm concerned, letting a node shut itself down does not seem that a bad idea. How's your experience with this method? Final Question: Have I missed any suitable method? Are there any other concepts of fencing / stonith, which I should give a closer look? Thanks in advance for all your answer (and I can well live with a RTFM, as long as you point me to the fine manual), Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke H?fer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans J?rgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: R?diger Zeyen (Sprecher/Chairman), Wilfried P?tz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
Hi, On Wed, Feb 23, 2011 at 09:55:00AM +, Stallmann, Andreas wrote: Hello! I'm currently looking for a suitable stonith solution for our environment: 1. We have three cluster nodes running OpenSuSE 10.3 with corosync and pacemaker. 2. The nodes reside on two VMware ESXi-Servers ( v. 4.1.0) in two locations, where one VMware Server hosts two, the other hosts one node of our cluster. 3. We will finally have a shared storage, which might or might not be under our governance (this depends on the respective customer). I examined some stonith methods, with the following results (and I'd like to have your opinion to my conclusions): - (2) rules out any stonith-method, that relies on a power switch or a UPS, as the nodes are in no way physically connected to a power circuit. - (3) rules out sbd, as this method requires access to a physical device, that offers the shared storage. Am I right? The manual explicitly says, that sbd may even not be used on a DRBD-Partition. Question: Is there a way to insert the sbd-Header on a mounted drive instead of a physical partition? Are there any other methods of ressource fencing besides sbd? The only requirement for sbd is to have a dedicated disk on shared storage. That disk (or partition, if you will) doesn't need to be big (1MB is enough). I don't see how (3) then is an obstacle. - (2) is not compatible with the vmware-stonith method, as it requires the vmware-host to be reachable via one single IP-Adress. This isn't the case in our szenario, the ESXi is not clustered. Question: Has anyone of you modified the vmware-stonith script to fit to a set up similar to ours? - the ssh-method is said to be not the wisest idea, as it requires the host to respond to ssh-requests. If the SSH-Daemon hangs or the cluster runs into a split-brain, this might not be the case anymore. Any other opinions? This, finally, leaves only one method: The suicide. Again, the SuSE HAE-manual claims that this method is not suitable for production environments because This requires action by the node's operating system and can fail under certain circumstances. Therefore avoid using this device whenever possible. Well, as far as I'm concerned, letting a node shut itself down does not seem that a bad idea. How's your experience with this method? No way telling if the suicide succeeded or not. Final Question: Have I missed any suitable method? Are there any other concepts of fencing / stonith, which I should give a closer look? There's also external/libvirt which hasn't been in any release yet, but seems to be of very good quality. You can get it here: http://hg.linux-ha.org/glue/file/tip/lib/plugins/stonith/external/libvirt and just put it into /usr/lib64/stonith/plugins/external Thanks in advance for all your answer (and I can well live with a RTFM, as long as you point me to the fine manual), There's a document on fencing at http://clusterlabs.org Thanks, Dejan Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Gesch?ftsf?hrer/Managing Directors: J?rgen Zender (Sprecher/Chairman), Anke H?fer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans J?rgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: R?diger Zeyen (Sprecher/Chairman), Wilfried P?tz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
Hi! - (3) rules out sbd, as this method requires access to a physical device, that offers the shared storage. Am I right? The manual explicitly says, that sbd may even not be used on a DRBD-Partition. Question: Is there a way to insert the sbd-Header on a mounted drive instead of a physical partition? Are there any other methods of ressource fencing besides sbd? The only requirement for sbd is to have a dedicated disk on shared storage. That disk (or partition, if you will) doesn't need to be big (1MB is enough). I don't see how (3) then is an obstacle. Let me see, if I got the manual (see: http://www.linux-ha.org/wiki/SBD_Fencing) right: a) Our customer might only grant us one storage access via nfs. Can one create an sbd on a NFS-share? b) If we set up a shared storage ourselves, we want it to be redundant itself, thus setting it up with drbd is very likely. The manual says: The SBD device must not make use of host-based RAID. and The SBD device must not reside on a drbd instance. Did I get this right: The sbd-Partition is not allowed to reside on either a RAID or a DBRD? Well? Doesn't that mess with the concept of redundancy? Let's say, we have a three-node shared storage, using DRBD to keep the partitions redundant between the shared-storage nodes, exporting the storage tot he other nodes via NFS: Where and how shall the sbd device be created? Only on one oft he storage nodes? Or on each oft he storage nodes? Or somehow on a clustered partition (that would mean drbd, again, wouldn't it?). To me only the latter makes sense, because, as the manual says: This can be a logical unit, partition, or a logical volume; but it must be accessible from all nodes. A... my brain is starting to explode... ;-) Please, I feel that I get something entirely wrong here. May, for example, the sbd be created on a partion or logical volume, that I created on an drbd-device (or RAID), and the no drbd-rule (or NO RAID) only means, that the sbd may not be created on the drbd (or RAID) directly? No way telling if the suicide succeeded or not. Yes, but on the other hand, suicide is quite independent of the network, while for all the power-off-methods (including vmware) I have to have at least access to the power device (or VM-host), which might not be the case, if all communication between two locations is demolished (classical split brain). There's also external/libvirt which hasn't been in any release yet, but seems to be of very good quality. You can get it here: Thanks, I'll check it out! There's a document on fencing at http://clusterlabs.org Which has been written by you, right? Don't get me wrong, it is excellent, and I already read it (it's nearly word-by-word included in the SLES HAE-Manual): http://www.clusterlabs.org/doc/crm_fencing.html Further help is still very welcome. TNX in advance, Andreas CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), Wilfried Pütz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
On Wed, Feb 23, 2011 at 12:19:20PM +, Stallmann, Andreas wrote: Hi! - (3) rules out sbd, as this method requires access to a physical device, that offers the shared storage. Am I right? The manual explicitly says, that sbd may even not be used on a DRBD-Partition. Question: Is there a way to insert the sbd-Header on a mounted drive instead of a physical partition? Are there any other methods of ressource fencing besides sbd? The only requirement for sbd is to have a dedicated disk on shared storage. That disk (or partition, if you will) doesn't need to be big (1MB is enough). I don't see how (3) then is an obstacle. Let me see, if I got the manual (see: http://www.linux-ha.org/wiki/SBD_Fencing) right: a) Our customer might only grant us one storage access via nfs. Can one create an sbd on a NFS-share? Please no-one try a loop-mounted image file on NFS ;-) Even though in theory it may work, if you mount -o sync ... *Outch* b) If we set up a shared storage ourselves, we want it to be redundant itself, thus setting it up with drbd is very likely. The manual says: The SBD device must not make use of host-based RAID. and The SBD device must not reside on a drbd instance. Did I get this right: The sbd-Partition is not allowed to reside on either a RAID or a DBRD? Well? Doesn't that mess with the concept of redundancy? Let's say, we have a three-node shared storage, using DRBD to keep the partitions redundant between the shared-storage nodes, exporting the storage tot he other nodes via NFS: Where and how shall the sbd device be created? Only on one oft he storage nodes? Or on each oft he storage nodes? Or somehow on a clustered partition (that would mean drbd, again, wouldn't it?). To me only the latter makes sense, because, as the manual says: This can be a logical unit, partition, or a logical volume; but it must be accessible from all nodes. A... my brain is starting to explode... ;-) Does this help? http://www.linux-ha.org/w/index.php?title=SBD_Fencingdiff=481oldid=97 -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Looking for a suitable Stonith Solution
Hi there! ... Please no-one try a loop-mounted image file on NFS ;-) Even though in theory it may work, if you mount -o sync ... *Outch* ... Does this help? http://www.linux-ha.org/w/index.php?title=SBD_Fencingdiff=481oldid=97 Yes, this helps... somehow. Well, I should use iSCSI to share my storage, right? And use an iSCSI-LUN to write the sbd on, right? Does the sbd-device have to be accessible to all the nodes at the same time? I mean: Do they all have to mount the sbd-device at the same time? And if the sbd resides on an iSCSCI-LUN which itself resides on a DRBD, and the communication crashes, won't that again destroy the effect of the poison pill? How about the Storage itself? We definetly have to use a different fencing approach there, right? (Or we have to give shared storage to our shared storage, which then again needs shared storage... ad infinitum). Thanks for your help so far. Andreas -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems CONET Solutions GmbH, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 9136) Geschäftsführer/Managing Directors: Jürgen Zender (Sprecher/Chairman), Anke Höfer Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Hans Jürgen Niemeier CONET Technologies AG, Theodor-Heuss-Allee 19, 53773 Hennef. Registergericht/Registration Court: Amtsgericht Siegburg (HRB Nr. 10328 ) Vorstand/Member of the Managementboard: Rüdiger Zeyen (Sprecher/Chairman), Wilfried Pütz Vorsitzender des Aufsichtsrates/Chairman of the Supervisory Board: Dr. Gerd Jakob ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems