Re: Oops when mounting btrfs partition
On Friday 08 February 2013, David Sterba wrote: > On Mon, Feb 04, 2013 at 09:55:50PM +, Arnd Bergmann wrote: > > On Saturday 02 February 2013, Chris Mason wrote: > > > > I've done a full backup of all data now, without any further Ooops > > messages, but > > I did get these: > > > > [66155.429029] btrfs no csum found for inode 1212139 start 23707648 > > The missing csums were caused by a bug introcuded in 3.8-rc1 and fixed > in rc5. > Ok, thanks for the information. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops when mounting btrfs partition
On Mon, Feb 04, 2013 at 09:55:50PM +, Arnd Bergmann wrote: > On Saturday 02 February 2013, Chris Mason wrote: > > I've done a full backup of all data now, without any further Ooops messages, > but > I did get these: > > [66155.429029] btrfs no csum found for inode 1212139 start 23707648 The missing csums were caused by a bug introcuded in 3.8-rc1 and fixed in rc5. david -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops when mounting btrfs partition
On Saturday 02 February 2013, Chris Mason wrote: > > Feb 1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at > > a01fdcf7 [verbose debug info unavailable] > > > Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino > > 15619835 off 454656 csum 2755731641 private 864823192 > > Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 > > errs: wr 0, rd 0, flush 0, corrupt 17, gen 0 > > ... > > Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify > > failed on 17006399488 wanted 54700 found 54764 > > These aren't good. With a few exceptions for really tight races in fsx > use cases, csum errors are bad data from the disk. The transid verify > failed shows we wanted to find a metadata block from generation 54700 > but found 54764 instead: > I've done a full backup of all data now, without any further Ooops messages, but I did get these: [66155.429029] btrfs no csum found for inode 1212139 start 23707648 [66155.429035] btrfs no csum found for inode 1212139 start 23711744 [66155.429039] btrfs no csum found for inode 1212139 start 23715840 [66155.429042] btrfs no csum found for inode 1212139 start 23719936 [66155.452298] btrfs csum failed ino 1212139 off 23707648 csum 4112094897 private 0 [66155.452310] btrfs csum failed ino 1212139 off 23711744 csum 3308812742 private 0 [66155.452316] btrfs csum failed ino 1212139 off 23715840 csum 2566472073 private 0 [66155.452322] btrfs csum failed ino 1212139 off 23719936 csum 2290008602 private 0 [66159.876785] btrfs no csum found for inode 1212139 start 69992448 [66159.876792] btrfs no csum found for inode 1212139 start 69996544 [66159.876797] btrfs no csum found for inode 1212139 start 7640 [66159.876801] btrfs no csum found for inode 1212139 start 70004736 [66159.921506] btrfs csum failed ino 1212139 off 69992448 csum 2290360822 private 0 [66159.921517] btrfs csum failed ino 1212139 off 69996544 csum 954182507 private 0 [66159.921524] btrfs csum failed ino 1212139 off 7640 csum 2594579850 private 0 [66159.921532] btrfs csum failed ino 1212139 off 70004736 csum 25334750 private 0 [66932.289905] btrfs csum failed ino 2461761 off 94208 csum 3824674580 private 1950015541 [92042.101540] btrfs csum failed ino 687755 off 7048040448 csum 2502110259 private 2186199747 [110952.542245] btrfs csum failed ino 5423479 off 475136 csum 490948044 private 3797189576 [122692.216371] btrfs csum failed ino 7959218 off 2818048 csum 1904746846 private 2392844122 [138205.726897] btrfs: sdb1 checksum verify failed on 20495056896 wanted 8C9759CB found 9BFAB73B level 0 Inode 1212139 is the akonadi database that was used by kmail, so it constantly got written to during the crashes. The file was completely corrupt. The other inodes are mostly files that were backed up from the other machine and have been on the drive I started using it, without ever being accessed. I've probably had a few bit flips the entire time I was using the machine, but never noticed before I started using a checksumming file system. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops when mounting btrfs partition
On Saturday 02 February 2013 10:20:35 Chris Mason wrote: > Hi Arnd, > > First things first, nospace_cache is a safe thing to use. It is slow > because it's finding free extents, but it's just a cache and always safe > to discard. With your other errors, I'd just mount it readonly > and then you won't waste time on atime updates. Ok, I see. Thanks for taking a look so quickly. > I'll take a look at the BUG you got during log recovery. We've fixed a > few of those during the 3.8 rc cycle. Well, it happened on 3.8-rc4 and on 3.5 here, so I'd guess it's a different one. > > Feb 1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at > > a01fdcf7 [verbose debug info unavailable] > > > Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino > > 15619835 off 454656 csum 2755731641 private 864823192 > > Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 > > errs: wr 0, rd 0, flush 0, corrupt 17, gen 0 > > ... > > Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify > > failed on 17006399488 wanted 54700 found 54764 > > These aren't good. With a few exceptions for really tight races in fsx > use cases, csum errors are bad data from the disk. The transid verify > failed shows we wanted to find a metadata block from generation 54700 > but found 54764 instead: > > 54700 = 0xD5AC > 54764 = 0xD5EC > > This same bad block comes up a few different times. The machine has had problems with data consistency in the past, so I'm not too surprised with getting a single-bit error, although this is the first time in a year that I've seen problems, and I replaced the faulty memory modules some time ago. Anyway, I already ordered a replacement box a few weeks ago, and that one will have ECC memory besides being a modern Opteron system to replace the aging Core 2. > > Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error > > corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288) > > This shows we pulled from the second copy of this block and got the > right answer, and then wrote the right answer to the duplicate. > Inode 1 means it was metadata. > > But for some reason still aborted the transaction. It could have been > an EIO on the correction, but the auto correction code in 3.5 did work > well. > > I think your plan to pull the data off and reformat is a good one. I'd > also look hard at your ram since drives don't usually send back single bit > errors. Ok. I'll wait before reformmatting though, in case you need to take a look at the data later to find out why it crashed without fsck finding a problem. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops when mounting btrfs partition
Hi Arnd, First things first, nospace_cache is a safe thing to use. It is slow because it's finding free extents, but it's just a cache and always safe to discard. With your other errors, I'd just mount it readonly and then you won't waste time on atime updates. I'll take a look at the BUG you got during log recovery. We've fixed a few of those during the 3.8 rc cycle. > Feb 1 22:57:37 localhost kernel: [ 8561.599482] Kernel BUG at > a01fdcf7 [verbose debug info unavailable] > Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino > 15619835 off 454656 csum 2755731641 private 864823192 > Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 > errs: wr 0, rd 0, flush 0, corrupt 17, gen 0 > ... > Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify > failed on 17006399488 wanted 54700 found 54764 These aren't good. With a few exceptions for really tight races in fsx use cases, csum errors are bad data from the disk. The transid verify failed shows we wanted to find a metadata block from generation 54700 but found 54764 instead: 54700 = 0xD5AC 54764 = 0xD5EC This same bad block comes up a few different times. > Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error > corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288) This shows we pulled from the second copy of this block and got the right answer, and then wrote the right answer to the duplicate. Inode 1 means it was metadata. But for some reason still aborted the transaction. It could have been an EIO on the correction, but the auto correction code in 3.5 did work well. I think your plan to pull the data off and reformat is a good one. I'd also look hard at your ram since drives don't usually send back single bit errors. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Oops when mounting btrfs partition
As mentioned on Google+, I have a partition that I can no longer mount normally, containing a lot of my personal data and all backups from my laptop. I found now that I am still able to mount it using the 'nospace_cache' option, but it takes a couple of minutes and I get "INFO: task btrfs-transacti:1698 blocked for more than 120 seconds." messages reporting the thread to be stuck in [] wait_for_completion+0x1d/0x20 [] btrfs_wait_and_free_delalloc_work+0x16/0x30 [btrfs] [] btrfs_run_ordered_operations+0x290/0x2f0 [btrfs] [] btrfs_commit_transaction+0x5f/0xab0 [btrfs] [] transaction_kthread+0x1bd/0x240 [btrfs] [ 2040.620221] [] kthread+0xc0/0xd0 but that seems harmless otherwise. The system seems slow but usable after this, but I have seen a corrupt "akonadi" database after that, so I'm still suspicious and will try to rescue the data and reformat the partition before I do any real work on this again. Looking back at the log I actually found multiple errors. There was a btrfs checksum error a couple of weeks ago, but fsck did not find anything, so I blamed that on a spurious RAM error. This is the earliest WARNING I found in the logs I still have: Jan 14 19:18:42 localhost kernel: [1060055.746373] btrfs csum failed ino 15619835 off 454656 csum 2755731641 private 864823192 Jan 14 19:18:42 localhost kernel: [1060055.746381] btrfs: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 17, gen 0 ... Jan 21 16:35:40 localhost kernel: [1655047.701147] parent transid verify failed on 17006399488 wanted 54700 found 54764 Jan 21 16:35:40 localhost kernel: [1655047.752517] parent transid verify failed on 17006399488 wanted 54700 found 54764 Jan 21 16:35:40 localhost kernel: [1655047.752692] btrfs read error corrected: ino 1 off 17006399488 (dev /dev/sdb1 sector 64689288) Jan 21 16:35:40 localhost kernel: [1655047.752704] btrfs: run_one_delayed_ref returned -5 Jan 21 16:35:40 localhost kernel: [1655047.752706] [ cut here ] Jan 21 16:35:40 localhost kernel: [1655047.752746] WARNING: at /build/buildd/linux-3.5.0/fs/btrfs/super.c:221 __btrfs_abort_transaction+0xad/0xc0 [btrfs]() Jan 21 16:35:40 localhost kernel: [1655047.752747] Hardware name: P5Q-EM Jan 21 16:35:40 localhost kernel: [1655047.752748] btrfs: Transaction aborted Jan 21 16:35:40 localhost kernel: [1655047.752750] Modules linked in: ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs reiserfs ext2 bnep rfcomm parport_pc ppdev bluetooth lp parport binfmt_misc dm_crypt coretemp kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek microcode snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer snd_seq_device snd soundcore snd_page_alloc asus_atk0110 mac_hid btrfs zlib_deflate libcrc32c hid_generic i915 firewire_ohci drm_kms_helper firewire_core crc_itu_t r8169 drm i2c_algo_bit video pata_marvell usbhid hid usb_storage Jan 21 16:35:40 localhost kernel: [1655047.752789] Pid: 7652, comm: btrfs-endio-wri Tainted: GW3.5.0-7-generic #7-Ubuntu Jan 21 16:35:40 localhost kernel: [1655047.752791] Call Trace: Jan 21 16:35:40 localhost kernel: [1655047.752800] [] warn_slowpath_common+0x7f/0xc0 Jan 21 16:35:40 localhost kernel: [1655047.752803] [] warn_slowpath_fmt+0x46/0x50 Jan 21 16:35:40 localhost kernel: [1655047.752814] [] __btrfs_abort_transaction+0xad/0xc0 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752826] [] btrfs_run_delayed_refs+0x3f4/0x440 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752840] [] ? merge_state+0xd9/0x150 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752845] [] ? _raw_spin_lock+0xe/0x20 Jan 21 16:35:40 localhost kernel: [1655047.752856] [] ? block_rsv_release_bytes+0x131/0x190 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752869] [] __btrfs_end_transaction+0x9f/0x350 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752882] [] btrfs_end_transaction+0x15/0x20 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752895] [] btrfs_finish_ordered_io+0x17d/0x410 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752907] [] finish_ordered_fn+0x15/0x20 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752921] [] worker_loop+0x15f/0x5a0 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752923] [] ? __schedule+0x3cf/0x7c0 Jan 21 16:35:40 localhost kernel: [1655047.752937] [] ? btrfs_queue_worker+0x330/0x330 [btrfs] Jan 21 16:35:40 localhost kernel: [1655047.752941] [] kthread+0x93/0xa0 Jan 21 16:35:40 localhost kernel: [1655047.752943] [] kernel_thread_helper+0x4/0x10 Jan 21 16:35:40 localhost kernel: [1655047.752946] [] ? flush_kthread_worker+0x80/0x80 Jan 21 16:35:40 localhost kernel: [1655047.752948] [] ? gs_change+0x13/0x13 Jan 21 16:35:40 localhost kernel: [1655047.752950] ---[ end trace 5ec1dfe1fd2d ]--- Jan 21 16:35:40 localhost kernel: [1655047.752952] BTRFS error (device sdb1) in btrfs_run_delayed_refs:2455: IO failure Jan 21 16:35:40 localhost kernel: [1655047.75