>>> On Wed, Nov 7, 2007 at  6:25 PM, in message <[EMAIL PROTECTED]>,
Yan Fitterer <[EMAIL PROTECTED]> wrote: 

>>> In addition, I have been thinking of complementing this mechanism with a
>>> disk- based "STONITH" (otherwise known as "poison pill"...) so that the
>>> unreachable node may (if things aren't too badly broken) take its
>>> resources down, and stop the disk heartbeat, which would then allow the
>>> rest of the cluster to consider it having left the cluster safely, and
>>> migrate the resources.

I've missed a lot of thread so in the hope these comments add some value...

The reason we implemented a shared disk based communication channel
for cluster split brain detection, and suicide via poison pill - back in the 
late
90s - was because the Fibre Channel Arbitrated Loop SANs we had then for
more than two node clusters had no I/O fencing intelligence whatsoever, and
SCSI-3 reservations weren't reliable or even supported in many cases.

The rather brutal approach of killing a node just to be sure it doesn't leak
out an I/O onto a shared disk that's since received I/O from other servers in
the same cluster, was the excuse for today's smart storage subsystems 
and SAN fabrics that that can programmed to disable the initiator, at the target
side...

The rather tricky behavior of file systems to hang up the server OS because they
can't be umount'ed reliably can be somewhat worked around by running more
than one kernel on the same server - i.e. run your server applications inside
a VM, and get the unreliable code out of the kernel that runs the cluster 
software...

Hth,
Robert


_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to