Hi! * On Sat, Dec 02, 2006 at 05:17 PM (-0800), Kurtis D. Rader wrote:
> On Sat, 2006-12-02 01:56:06, Christoph Anton Mitterer wrote: > > The issue was basically the following: I found a severe bug mainly by > > fortune because it occurs very rarely. My test looks like the following: > > I have about 30GB of testing data on my harddisk,... I repeat verifying > > sha512 sums on these files and check if errors occur. One test pass > > verifies the 30GB 50 times,... about one to four differences are found in > > each pass. > > I'm also experiencing silent data corruption on writes to SATA disks > connected to a Nvidia controller (nForce 4 chipset). The problem is > 100% reproducible. Details of my configuration (mainboard model, lspci, > etc.) are near the bottom of this message. What follows is a summation > of my findings. > > I have confirmed the corruption is occurring on the writes and not the > reads. Furthermore, if I compare the original and copy while both are > still cached in memory no corruption is found. But as soon as I flush the > pagecache (by reading another file larger than memory) to force the copy > of the file to be read from disk the corruption is seen. The corruption > occurs with direct I/O and normal buffered filesystem I/O (ext3). > > Booting with "mem=1g" (system has 4 GiB installed) makes no difference. > So it isn't due to remapping memory above the 4 GiB boundary. Booting to > single user and ensuring no unnecessary modules (video, etc.) are loaded > also makes no difference. > > The problem affects both disks attached to the nVidia SATA controller but > not the two disks attached to the PATA side of the same controller. All > four disks are different models. The same SATA disks attached to > the Silicon Image 3114 SATA RAID controller (on the same mainboard) > experiences the same corruption but at a lower probability. The same > disks attached to a Promise TX2 SATA controller (in the same system) > experience no corruption. > > The system has run memtest86 for 24 hours with no errors. Although your problem report seems rather clearly to be related to the disk sub-system (e.g. as it only seems to appear at writings), I would just like to point out that running "memtest86" for some time without getting any errors does not necessarily state that the memory is faultless. I recently had one case where a machine (Athlon XP 2200+) crashed irregularly. "memtest86", running several days, didn't find anything, but running the "stress test" of "Prime95" [1] for a few minutes clearly showed that the machine just miscalculated (Prime95's stress test stops in this case). Just removing and reinserting the two memory modules (2 x 256 MB DDR-RAM) fixed it. The machine is now stable and "Prime95" hasn't stopped due to computational errors anymore since then. I suppose that one of the modules wasn't seated in its socket correctly, but I don't know why "memtest86" (and "memtest86+") didn't find it. Bye, Steffen [1] http://www.mersenne.org/freesoft.htm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/