On Wed, Sep 9, 2009 at 9:26 AM, Jens Axboe<jens.ax...@oracle.com> wrote:
> On Wed, Sep 09 2009, Daniel J Blueman wrote:
>> On Wed, Sep 9, 2009 at 8:01 AM, Jens Axboe<jens.ax...@oracle.com> wrote:
>> > On Wed, Sep 09 2009, Markus Trippelsdorf wrote:
>> >> On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:
>> >> > On Tue, Sep 08 2009, Markus Trippelsdorf wrote:
>> >> > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:
>> >> > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote:
>> >> > > > > Just got this error today in my dmesg:
>> >> > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 
>> >> > > > > private 43905798
>> >> > > > >
>> >> > > > > linux % find . -inum 1483065
>> >> > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack
>> >> > > > >
>> >> > > > > It's the main pack file from my git linux kernel tree:
>> >> > > > >
>> >> > > >
>> >> > > > Hmm, I ran into something very similar. Care to check what the 
>> >> > > > corrupted
>> >> > > > block of data looks like (and how big it is)?
>> >> > >
>> >> > > I've already deleted the file in question unfortunately.
>> >> > > On IRC Chris decided that either bad RAM or a harddrive error was the
>> >> > > most likely reason for this chechsum mismatch.
>> >> >
>> >> > Darn, that's too bad. The corruption issue I had was also in a git pack
>> >> > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff
>> >> > in the file, and I blamed it on the (cheap) SSD drive that hosted the
>> >> > local git repo. It's still the most likely explanation given the nature
>> >> > of the problem, however it would have been really interesting to see
>> >> > what corruption you had.
>> >>
>> >> If by cheap SSD drive you mean an Indilinx Barefoot based one, we might
>> >> be using the same hardware (30GB Vertex in my case).
>> >
>> > Spooky, yes indeed that's the very same drive I'm using. Also see my
>> > postings on this very issue here, top two entries:
>> >
>> > http://axboe.livejournal.com/
>> >
>> > So that pretty much looks like it reaffirms some of my suspicions. Is
>> > the drive in a laptop that you suspend and resume?
>>
>> If you're on firmware < 1.30, the changlog includes some fixes which
>> may be relevant, eg if "block 0" is relative, or you're
>> suspending/resuming:
>>
>> - Race condition occurred during soft reset handler
>> - If read fail occurs during reading stamp information, firmware
>> corrupted block 0.
>> - Power off recovery had bug in certain circumstances
>>
>> http://www.ocztechnologyforum.com/forum/showthread.php?t=57516
>
> The issue is pretty much moot at this point, since OCZ support were not
> really interested in providing any sort of real technical support to
> find out what really caused this issue. My main worry was reliability of
> these cheaper SSD drives, and that worry is still not resolved. If you
> read the blog entries, I do comment on the apparently scary basic bugs
> taht are still being fixed on the Indilinx controllers. I do expect some
> basic level of data integrity from a consumer product and at least some
> interest in resolving weird corruption issues if things go wrong. Since
> OCZ cannot provide anything like that, I have a hard time recommending
> these drives for anything but very casual use. Fast, cheap, reliable.
> Pick any two.
>
> My drive was running 1.10 at the time of the problem.

It looks like we need a small tool which performs patterned block I/O
to the device, updating a checksum as it goes, and performing
integrity sweeps at intervals, lower level than fsx. It must be
trusted or not.

I had a problem like this with nVidia CK804/MCP55 chipsets corrupting
data under a triple-edge case workload.
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to