Hi Loic,

This thread has been evolving, but I'd like to push it back a bit. Earlier in the thread you pointed out the CERN study on silent data corruption:

http://fuji.web.cern.ch/fuji/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf

Actually, I was not the one who pointed out this study but I can't
remember who did.

Oops. Sorry Loic, sorry Leif. I'm clearly too senile to deal with two different four-letter names starting with 'L'.

If you are not already doing this, would it be possible for you to run
fsprobe(8) on your X4500 boxes to see if there are any silent data
corruption issues there?  You have a large enough storage farm to gather
meaningful statistics.

We are not using fsprobe on our X4500.

There are two reasons:

<SNIP>

I still think that the results would be interesting.

In response to the reasons you gave:

[1] I agree that if ZFS + hardware works as it is supposed to, there will not be any corruption. But it would be nice to prove this via experiment.

[2] You can probably force writes to disk by simply writing files too large to fit into the memory cache. Or modify fsprobe (or ask Peter to modify it) so that it fsync()s after writes rather than using the direct IO to bypass the device block buffer layer.

In any case by the end of the year I should have at least ten X4500s, and can do some testing myself. But your collection is an order of magnitude larger, so you can collect much more useful statistics. If those statistics show no data corruption, then someone like myself with many fewer systems can be very confident that no silent corruption is occuring.

Cheers,
     Bruce
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to