Re: Oops when mounting btrfs partition

2013-02-11 Thread Arnd Bergmann
On Friday 08 February 2013, David Sterba wrote:
> On Mon, Feb 04, 2013 at 09:55:50PM +, Arnd Bergmann wrote:
> > On Saturday 02 February 2013, Chris Mason wrote:
> > 
> > I've done a full backup of all data now, without any further Ooops 
> > messages, but
> > I did get these:
> > 
> > [66155.429029] btrfs no csum found for inode 1212139 start 23707648
> 
> The missing csums were caused by a bug introcuded in 3.8-rc1 and fixed
> in rc5.
> 

Ok, thanks for the information.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oops when mounting btrfs partition

2013-02-08 Thread David Sterba
On Mon, Feb 04, 2013 at 09:55:50PM +, Arnd Bergmann wrote:
> On Saturday 02 February 2013, Chris Mason wrote:
> 
> I've done a full backup of all data now, without any further Ooops messages, 
> but
> I did get these:
> 
> [66155.429029] btrfs no csum found for inode 1212139 start 23707648

The missing csums were caused by a bug introcuded in 3.8-rc1 and fixed
in rc5.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oops when mounting btrfs partition

2013-02-04 Thread Arnd Bergmann
On Saturday 02 February 2013, Chris Mason wrote:

> > Feb  1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at 
> > a01fdcf7 [verbose debug info unavailable]
> 
> > Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 
> > 15619835 off 454656 csum 2755731641 private 864823192
> > Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 
> > errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
> > ...
> > Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify 
> > failed on 17006399488 wanted 54700 found 54764
> 
> These aren't good.  With a few exceptions for really tight races in fsx
> use cases, csum errors are bad data from the disk.  The transid verify
> failed shows we wanted to find a metadata block from generation 54700
> but found 54764 instead:
> 

I've done a full backup of all data now, without any further Ooops messages, but
I did get these:

[66155.429029] btrfs no csum found for inode 1212139 start 23707648
[66155.429035] btrfs no csum found for inode 1212139 start 23711744
[66155.429039] btrfs no csum found for inode 1212139 start 23715840
[66155.429042] btrfs no csum found for inode 1212139 start 23719936
[66155.452298] btrfs csum failed ino 1212139 off 23707648 csum 4112094897 
private 0
[66155.452310] btrfs csum failed ino 1212139 off 23711744 csum 3308812742 
private 0
[66155.452316] btrfs csum failed ino 1212139 off 23715840 csum 2566472073 
private 0
[66155.452322] btrfs csum failed ino 1212139 off 23719936 csum 2290008602 
private 0
[66159.876785] btrfs no csum found for inode 1212139 start 69992448
[66159.876792] btrfs no csum found for inode 1212139 start 69996544
[66159.876797] btrfs no csum found for inode 1212139 start 7640
[66159.876801] btrfs no csum found for inode 1212139 start 70004736
[66159.921506] btrfs csum failed ino 1212139 off 69992448 csum 2290360822 
private 0
[66159.921517] btrfs csum failed ino 1212139 off 69996544 csum 954182507 
private 0
[66159.921524] btrfs csum failed ino 1212139 off 7640 csum 2594579850 
private 0
[66159.921532] btrfs csum failed ino 1212139 off 70004736 csum 25334750 private 0
[66932.289905] btrfs csum failed ino 2461761 off 94208 csum 3824674580 private 
1950015541
[92042.101540] btrfs csum failed ino 687755 off 7048040448 csum 2502110259 
private 2186199747
[110952.542245] btrfs csum failed ino 5423479 off 475136 csum 490948044 private 
3797189576
[122692.216371] btrfs csum failed ino 7959218 off 2818048 csum 1904746846 
private 2392844122
[138205.726897] btrfs: sdb1 checksum verify failed on 20495056896 wanted 
8C9759CB found 9BFAB73B level 0

Inode 1212139 is the akonadi database that was used by kmail, so it constantly
got written to during the crashes. The file was completely corrupt. The
other inodes are mostly files that were backed up from the other machine
and have been on the drive I started using it, without ever being accessed.
I've probably had a few bit flips the entire time I was using the machine,
but never noticed before I started using a checksumming file system.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oops when mounting btrfs partition

2013-02-02 Thread Arnd Bergmann
On Saturday 02 February 2013 10:20:35 Chris Mason wrote:
> Hi Arnd,
> 
> First things first, nospace_cache is a safe thing to use.  It is slow
> because it's finding free extents, but it's just a cache and always safe
> to discard.  With your other errors, I'd just mount it readonly
> and then you won't waste time on atime updates.

Ok, I see. Thanks for taking a look so quickly.

> I'll take a look at the BUG you got during log recovery.  We've fixed a
> few of those during the 3.8 rc cycle.

Well, it happened on 3.8-rc4 and on 3.5 here, so I'd guess it's a
different one.

> > Feb  1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at 
> > a01fdcf7 [verbose debug info unavailable]
> 
> > Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 
> > 15619835 off 454656 csum 2755731641 private 864823192
> > Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 
> > errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
> > ...
> > Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify 
> > failed on 17006399488 wanted 54700 found 54764
> 
> These aren't good.  With a few exceptions for really tight races in fsx
> use cases, csum errors are bad data from the disk.  The transid verify
> failed shows we wanted to find a metadata block from generation 54700
> but found 54764 instead:
> 
> 54700 = 0xD5AC
> 54764 = 0xD5EC
> 
> This same bad block comes up a few different times.

The machine has had problems with data consistency in the past, so
I'm not too surprised with getting a single-bit error, although this
is the first time in a year that I've seen problems, and I replaced
the faulty memory modules some time ago.

Anyway, I already ordered a replacement box a few weeks ago, and that
one will have ECC memory besides being a modern Opteron system to replace
the aging Core 2.

> > Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error 
> > corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288)
> 
> This shows we pulled from the second copy of this block and got the
> right answer, and then wrote the right answer to the duplicate.
> Inode 1 means it was metadata.
> 
> But for some reason still aborted the transaction.  It could have been
> an EIO on the correction, but the auto correction code in 3.5 did work
> well.
> 
> I think your plan to pull the data off and reformat is a good one.  I'd
> also look hard at your ram since drives don't usually send back single bit
> errors.

Ok. I'll wait before reformmatting though, in case you need to take
a look at the data later to find out why it crashed without fsck finding
a problem.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Oops when mounting btrfs partition

2013-02-02 Thread Chris Mason
Hi Arnd,

First things first, nospace_cache is a safe thing to use.  It is slow
because it's finding free extents, but it's just a cache and always safe
to discard.  With your other errors, I'd just mount it readonly
and then you won't waste time on atime updates.

I'll take a look at the BUG you got during log recovery.  We've fixed a
few of those during the 3.8 rc cycle.

> Feb  1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at 
> a01fdcf7 [verbose debug info unavailable]

> Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 
> 15619835 off 454656 csum 2755731641 private 864823192
> Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 
> errs: wr 0, rd 0, flush 0, corrupt 17, gen 0
> ...
> Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify 
> failed on 17006399488 wanted 54700 found 54764

These aren't good.  With a few exceptions for really tight races in fsx
use cases, csum errors are bad data from the disk.  The transid verify
failed shows we wanted to find a metadata block from generation 54700
but found 54764 instead:

54700 = 0xD5AC
54764 = 0xD5EC

This same bad block comes up a few different times.

> Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error 
> corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288)

This shows we pulled from the second copy of this block and got the
right answer, and then wrote the right answer to the duplicate.
Inode 1 means it was metadata.

But for some reason still aborted the transaction.  It could have been
an EIO on the correction, but the auto correction code in 3.5 did work
well.

I think your plan to pull the data off and reformat is a good one.  I'd
also look hard at your ram since drives don't usually send back single bit
errors.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Oops when mounting btrfs partition

2013-02-02 Thread Arnd Bergmann
As mentioned on Google+, I have a partition that I can no longer mount
normally, containing a lot of my personal data and all backups from
my laptop.

I found now that I am still able to mount it using the 'nospace_cache'
option, but it takes a couple of minutes and I get "INFO: task
btrfs-transacti:1698 blocked for more than 120 seconds." messages
reporting the thread to be stuck in

[] wait_for_completion+0x1d/0x20
[] btrfs_wait_and_free_delalloc_work+0x16/0x30 [btrfs]
[] btrfs_run_ordered_operations+0x290/0x2f0 [btrfs]
[] btrfs_commit_transaction+0x5f/0xab0 [btrfs]
[] transaction_kthread+0x1bd/0x240 [btrfs]
[ 2040.620221]  [] kthread+0xc0/0xd0

but that seems harmless otherwise. The system seems slow but usable after
this, but I have seen a corrupt "akonadi" database after that, so I'm
still suspicious and will try to rescue the data and reformat the
partition before I do any real work on this again.

Looking back at the log I actually found multiple errors. There was
a btrfs checksum error a couple of weeks ago, but fsck did not find
anything, so I blamed that on a spurious RAM error.
This is the earliest WARNING I found in the logs I still have:

Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 
15619835 off 454656 csum 2755731641 private 864823192
Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 errs: 
wr 0, rd 0, flush 0, corrupt 17, gen 0
...
Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify failed 
on 17006399488 wanted 54700 found 54764
Jan 21 16:35:40 localhost kernel: [1655047.752517] parent transid verify failed 
on 17006399488 wanted 54700 found 54764
Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error corrected: 
ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288)
Jan 21 16:35:40 localhost kernel: [1655047.752704] btrfs: run_one_delayed_ref 
returned -5
Jan 21 16:35:40 localhost kernel: [1655047.752706] [ cut here 
]
Jan 21 16:35:40 localhost kernel: [1655047.752746] WARNING: at 
/build/buildd/linux-3.5.0/fs/btrfs/super.c:221 
__btrfs_abort_transaction+0xad/0xc0 [btrfs]()
Jan 21 16:35:40 localhost kernel: [1655047.752747] Hardware name: P5Q-EM
Jan 21 16:35:40 localhost kernel: [1655047.752748] btrfs: Transaction aborted
Jan 21 16:35:40 localhost kernel: [1655047.752750] Modules linked in: ufs qnx4 
hfsplus hfs minix ntfs msdos jfs xfs reiserfs ext2 bnep rfcomm parport_pc ppdev 
bluetooth lp parport binfmt_misc dm_crypt coretemp kvm_intel kvm 
snd_hda_codec_hdmi snd_hda_codec_realtek microcode snd_hda_intel snd_hda_codec 
snd_hwdep snd_pcm snd_seq_midi snd_rawmidi serio_raw snd_seq_midi_event lpc_ich 
snd_seq snd_timer snd_seq_device snd soundcore snd_page_alloc asus_atk0110 
mac_hid btrfs zlib_deflate libcrc32c hid_generic i915 firewire_ohci 
drm_kms_helper firewire_core crc_itu_t r8169 drm i2c_algo_bit video 
pata_marvell usbhid hid usb_storage
Jan 21 16:35:40 localhost kernel: [1655047.752789] Pid: 7652, comm: 
btrfs-endio-wri Tainted: GW3.5.0-7-generic #7-Ubuntu
Jan 21 16:35:40 localhost kernel: [1655047.752791] Call Trace:
Jan 21 16:35:40 localhost kernel: [1655047.752800]  [] 
warn_slowpath_common+0x7f/0xc0
Jan 21 16:35:40 localhost kernel: [1655047.752803]  [] 
warn_slowpath_fmt+0x46/0x50
Jan 21 16:35:40 localhost kernel: [1655047.752814]  [] 
__btrfs_abort_transaction+0xad/0xc0 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752826]  [] 
btrfs_run_delayed_refs+0x3f4/0x440 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752840]  [] ? 
merge_state+0xd9/0x150 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752845]  [] ? 
_raw_spin_lock+0xe/0x20
Jan 21 16:35:40 localhost kernel: [1655047.752856]  [] ? 
block_rsv_release_bytes+0x131/0x190 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752869]  [] 
__btrfs_end_transaction+0x9f/0x350 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752882]  [] 
btrfs_end_transaction+0x15/0x20 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752895]  [] 
btrfs_finish_ordered_io+0x17d/0x410 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752907]  [] 
finish_ordered_fn+0x15/0x20 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752921]  [] 
worker_loop+0x15f/0x5a0 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752923]  [] ? 
__schedule+0x3cf/0x7c0
Jan 21 16:35:40 localhost kernel: [1655047.752937]  [] ? 
btrfs_queue_worker+0x330/0x330 [btrfs]
Jan 21 16:35:40 localhost kernel: [1655047.752941]  [] 
kthread+0x93/0xa0
Jan 21 16:35:40 localhost kernel: [1655047.752943]  [] 
kernel_thread_helper+0x4/0x10
Jan 21 16:35:40 localhost kernel: [1655047.752946]  [] ? 
flush_kthread_worker+0x80/0x80
Jan 21 16:35:40 localhost kernel: [1655047.752948]  [] ? 
gs_change+0x13/0x13
Jan 21 16:35:40 localhost kernel: [1655047.752950] ---[ end trace 
5ec1dfe1fd2d ]---
Jan 21 16:35:40 localhost kernel: [1655047.752952] BTRFS error (device sdb1) in 
btrfs_run_delayed_refs:2455: IO failure
Jan 21 16:35:40 localhost kernel: [1655047.75