What was needed to fix the systems? Reboot? Hardware replacement?
On Wed, 16 Apr 2008, Gerry Creager wrote:
We've had two fail rather randomly. The failures did cause disk corruption
but it wasn't an undetected/undetectable sort. They started throwing errors
to syslog, then fell over and stopped accessing disks.
gerry
Bruce Allen wrote:
Hi Gerry,
So far the only problem we have had is with one Areca card that had a bad
2GB memory module. This generated lots of (correctable) single bit errors
but eventually caused real problems. Could you say something about the
reliability issues you have seen?
Cheers,
Bruce
On Wed, 16 Apr 2008, Gerry Creager wrote:
We've used AoE (CoRAID hardware) with pretty good success (modulo one RAID
shelf fire that was caused by a manufacturing defect and dealt with
promptly by CoRAID). We've had some reliability issues with Areca cards
but no data corruption on the systems we've built that way.
gerry
Bruce Allen wrote:
Hi Xavier,
PPS: We've also been doing some experiments with putting
OpenSolaris+ZFS on some of our generic (Supermicro + Areca) 16-disk
RAID systems, which were originally intended to run Linux.
I think that DESY proved some data corruption with such
configuration, so they switched to OpenSolaris+ZFS.
I'm confused. I am also talking about OpenSolaris+ZFS. What did DESY
try, and what did they switch to?
Sorry, I am indeed not clear. As far as I know, DESY found data
corruption using Linux and Areca cards. They moved from linux to
OpenSolaris and ZFS, avoiding other corruption. This has been discussed
in HEPiX storage workgroup. However, I can not speak on their behalf at
all. I'll try to get you in touch with someone more aware of this issue,
as my statements lack of figures.
I think that would be very interesting to the entire Beowulf mailing
list, so please suggest that they respond to the entire group, not just
to me personally. Here is an LKML thread about silent data corruption:
http://kerneltrap.org/mailarchive/linux-kernel/2007/9/10/191697
So far we have not seen any signs of data corruption on Linux+Areca
systems (and our data files carry both internal and external checksums,
so we would be sensitive to this).
Cheers,
Bruce
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf