On Wed, Sep 9, 2009 at 9:26 AM, Jens Axboe<jens.ax...@oracle.com> wrote: > On Wed, Sep 09 2009, Daniel J Blueman wrote: >> On Wed, Sep 9, 2009 at 8:01 AM, Jens Axboe<jens.ax...@oracle.com> wrote: >> > On Wed, Sep 09 2009, Markus Trippelsdorf wrote: >> >> On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote: >> >> > On Tue, Sep 08 2009, Markus Trippelsdorf wrote: >> >> > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: >> >> > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: >> >> > > > > Just got this error today in my dmesg: >> >> > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 >> >> > > > > private 43905798 >> >> > > > > >> >> > > > > linux % find . -inum 1483065 >> >> > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack >> >> > > > > >> >> > > > > It's the main pack file from my git linux kernel tree: >> >> > > > > >> >> > > > >> >> > > > Hmm, I ran into something very similar. Care to check what the >> >> > > > corrupted >> >> > > > block of data looks like (and how big it is)? >> >> > > >> >> > > I've already deleted the file in question unfortunately. >> >> > > On IRC Chris decided that either bad RAM or a harddrive error was the >> >> > > most likely reason for this chechsum mismatch. >> >> > >> >> > Darn, that's too bad. The corruption issue I had was also in a git pack >> >> > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff >> >> > in the file, and I blamed it on the (cheap) SSD drive that hosted the >> >> > local git repo. It's still the most likely explanation given the nature >> >> > of the problem, however it would have been really interesting to see >> >> > what corruption you had. >> >> >> >> If by cheap SSD drive you mean an Indilinx Barefoot based one, we might >> >> be using the same hardware (30GB Vertex in my case). >> > >> > Spooky, yes indeed that's the very same drive I'm using. Also see my >> > postings on this very issue here, top two entries: >> > >> > http://axboe.livejournal.com/ >> > >> > So that pretty much looks like it reaffirms some of my suspicions. Is >> > the drive in a laptop that you suspend and resume? >> >> If you're on firmware < 1.30, the changlog includes some fixes which >> may be relevant, eg if "block 0" is relative, or you're >> suspending/resuming: >> >> - Race condition occurred during soft reset handler >> - If read fail occurs during reading stamp information, firmware >> corrupted block 0. >> - Power off recovery had bug in certain circumstances >> >> http://www.ocztechnologyforum.com/forum/showthread.php?t=57516 > > The issue is pretty much moot at this point, since OCZ support were not > really interested in providing any sort of real technical support to > find out what really caused this issue. My main worry was reliability of > these cheaper SSD drives, and that worry is still not resolved. If you > read the blog entries, I do comment on the apparently scary basic bugs > taht are still being fixed on the Indilinx controllers. I do expect some > basic level of data integrity from a consumer product and at least some > interest in resolving weird corruption issues if things go wrong. Since > OCZ cannot provide anything like that, I have a hard time recommending > these drives for anything but very casual use. Fast, cheap, reliable. > Pick any two. > > My drive was running 1.10 at the time of the problem.
It looks like we need a small tool which performs patterned block I/O to the device, updating a checksum as it goes, and performing integrity sweeps at intervals, lower level than fsx. It must be trusted or not. I had a problem like this with nVidia CK804/MCP55 chipsets corrupting data under a triple-edge case workload. -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html