Re: High availability ... again

Jure Peèar Tue, 22 Jun 2004 14:16:47 -0700

On Tue, 22 Jun 2004 18:52:09 +0200
Tore Anderson <[EMAIL PROTECTED]> wrote:


>   There's a third option, which is the one I prefer the most:  shared
>  block device.  Connect your two servers to a SAN, and store all of
>  Cyrus' data on one LUN, which both servers have access to.  Then, set
>  your cluster software to automatically mount the file system before
>  starting Cyrus.  You'll need STONITH or IO-fencing to protect against
>  file system corruption in a split-brain scenario, but other than that
>  it's a fairly simple solution that's unlikely to break in spectacular
>  ways.  You could share a SCSI cabinet between the servers instead of
>  using a SAN, though I can't say I reccomend it - too failure-prone.

It _can_ break in a spectacular way ... easily.

Our batch of disks turned out to have some fubar firmware, which caused them
to randomly fall out of the array under a specific load. That problem went
undetected during the testing phase.
So after you manage to get the array back together, you have a heavily
corrupted filesystem. Journaling and fast recovery? Not this time. It turned
out that a full fsck on a half a terabyte of reiserfs takes about 3 days to
finish. (That was more than a year ago, since then reiserfsck has improved a
lot).

Two things to be learned here ... 
* use different disks
* filesystem too is a single point of failure

Since then the only type of HA systems i trust are google like setups ...
designed to die, designed to corrupt data, designed to do other ugly things,
but the app running on them is designed to handle that. Hint: cyrus has room
for improvments here ... 


-- 

Jure Peèar
---
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html

Re: High availability ... again

Reply via email to