btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
Hi, I just got this oops on a computer running 3.4.2. A few minutes before I had started "btrfs device scrub /" and had a watcher process running "btrfs scrub status /" every 5 seconds. After a few gigabytes of scrubbing, I got this crash. The oops is transcribed from photos, so it may contain some errors. I tried to be careful, and double checked the backtrace. Sami general protection fault: [#1] SMP CPU 4 Modules linked in: tcp_diag inet_diag nfnetlink_log nfnetlink ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs ext3 jbd ext2 ip6_tables ebtable_nat ebtables cn rfcomm bnep parport_pc ppdev lp parport tun cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc fuse nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc iptable_filter ipt_MASQUERADE ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ip_tables x_tables xfs ext4 jbd2 mbcache radeon drm_kms_helper ttm drm i2c_algo_bit loop kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio snd_usbmidi_lib snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm acpi_cpufreq snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmi di ath3k snd_seq snd_seq_device snd_timer iTCO_wdt bluetooth eeepci_wmi asus_wmi sparse_keymap crc16 rfkill pcspkr psmouse coretemp serio_raw evdev mperf pci_hotplug i2c_i801 i2c_core processor button intel_agp snd mxm_wmi video wmi intel_gtt microcode soundcore sha256_generic dm_crypt dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq md_mod nbd btrfs libcrc32c zlib_deflate sd_mod crc_t10dif crc32c_intel ghash_cmulni_intel firewire_ohci r8196 firewire_core ahci aesni_intel libahci mii crc_itu_t aes_x86_64 libata aes_generic cryptd scsi_mod e1000e thermal fa n thermal_sys [last unloaded: scsi_wait_scan] Pid: 30863, comm: btrfs-endio-met Tainted: GW3.4.2 #1 System manufacturer System Product Name/P8P67 EVO RIP: 0010:[] [] memcpy+0xd/0x110 RSP: :88003174dba8 EFLAGS: 00010202 RAX: 88003174dc8f RBX: 0011 RCX: 0002 RDX: 0001 RSI: 00050803 RDI: 88003174dc8f RBP: 88003174dbf0 R08: 000a R09: R10: R11: R12: 88003174dca0 R13: 8800659f42b0 R14: 0048 R15: 0011 FS: () GS:88021ed0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0973c000 CR3: 000167ef3000 CR4: 000407e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process btrfs-endio-met (pid: 30863, threadinfo 88003174c000, task 88006f818000) Stack: a026bd6b 8801960f5000 8003 1000 88003174dc58 03dd 88000ac13c60 88003174dc58 696f70203a61685f 88003174dc00 a026904d 88003174dcd0 Call Trace: [] ? read_extent_buffer+0xbb/0x110 [btrfs] [] btrfs_node_key+0x1d/0x20 [btrfs] [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs] [] btree_readahead_hook+0x1f/0x40 [btrfs] [] btree_readpage_end_io_hook+0x111/0x260 [btrfs] [] ? find_first_extent_bit_state+0x22/0x80 [btrfs] [] end_bio_extent_readpage+0xcb/0xa30 [btrfs] [] ? end_workqueue_fn+0x31/0x50 [btrfs] [] bio_endio+0x18/0x30 [] end_workqueue_fn+0x3c/0x50 [btrfs] [] worker_loop+0x157/0x560 [btrfs] [] ? btrfs_queue_worker+0x310/0x310 [btrfs] [] kthread+0x8e/0xa0 [] kernel_thread_helper+0x4/0x10 [] ? flush_kthread_worker+0x70/0x70 [] ? gs_change+0x13/0x13 Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 eb fb ff ff eb e1 90 90 90 90 90 90 90... 8 4c 8b 56 10 4c RIP [] memcpy+0xd/0x110 RSP signature.asc Description: Digital signature
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Tue, 3 Jul 2012 02:01:21 +0300, Sami Liedes wrote: > Hi, > > I just got this oops on a computer running 3.4.2. > > A few minutes before I had started "btrfs device scrub /" and had a > watcher process running "btrfs scrub status /" every 5 seconds. After > a few gigabytes of scrubbing, I got this crash. > > The oops is transcribed from photos, so it may contain some errors. I > tried to be careful, and double checked the backtrace. > > Sami > > > general protection fault: [#1] SMP > CPU 4 > Modules linked in: tcp_diag inet_diag nfnetlink_log nfnetlink ufs qnx4 > hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs ext3 jbd ext2 ip6_tables > ebtable_nat ebtables cn rfcomm bnep > parport_pc ppdev lp parport tun cpufreq_userspace cpufreq_stats > cpufreq_powersave cpufreq_conservative binfmt_misc fuse nfsd nfs nfs_acl > auth_rpcgss fscache lockd sunrpc iptable_filter ipt_MASQUERADE > ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack > ip_tables x_tables xfs ext4 jbd2 mbcache radeon drm_kms_helper ttm drm > i2c_algo_bit loop kvm_intel kvm snd_hda_codec_hdmi > snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio > snd_usbmidi_lib snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm > acpi_cpufreq snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmi > di ath3k snd_seq snd_seq_device snd_timer iTCO_wdt bluetooth eeepci_wmi > asus_wmi sparse_keymap crc16 rfkill pcspkr psmouse coretemp serio_raw evdev > mperf pci_hotplug i2c_i801 i2c_core processor button > intel_agp snd mxm_wmi video wmi intel_gtt microcode soundcore sha256_generic > dm_crypt dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor > async_tx raid6_pq md_mod nbd btrfs libcrc32c > zlib_deflate sd_mod crc_t10dif crc32c_intel ghash_cmulni_intel firewire_ohci > r8196 firewire_core ahci aesni_intel libahci mii crc_itu_t aes_x86_64 libata > aes_generic cryptd scsi_mod e1000e thermal fa > n thermal_sys [last unloaded: scsi_wait_scan] > > Pid: 30863, comm: btrfs-endio-met Tainted: GW3.4.2 #1 System > manufacturer System Product Name/P8P67 EVO > RIP: 0010:[] [] memcpy+0xd/0x110 > RSP: :88003174dba8 EFLAGS: 00010202 > RAX: 88003174dc8f RBX: 0011 RCX: 0002 > RDX: 0001 RSI: 00050803 RDI: 88003174dc8f > RBP: 88003174dbf0 R08: 000a R09: > R10: R11: R12: 88003174dca0 > R13: 8800659f42b0 R14: 0048 R15: 0011 > FS: () GS:88021ed0() knlGS: > CS: 0010 DS: ES: CR0: 8005003b > CR2: 0973c000 CR3: 000167ef3000 CR4: 000407e0 > DR0: DR1: DR2: > DR3: DR6: 0ff0 DR7: 0400 > Process btrfs-endio-met (pid: 30863, threadinfo 88003174c000, task > 88006f818000) > Stack: > a026bd6b 8801960f5000 8003 1000 > 88003174dc58 03dd 88000ac13c60 88003174dc58 > 696f70203a61685f 88003174dc00 a026904d 88003174dcd0 > Call Trace: > [] ? read_extent_buffer+0xbb/0x110 [btrfs] > [] btrfs_node_key+0x1d/0x20 [btrfs] > [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs] > [] btree_readahead_hook+0x1f/0x40 [btrfs] > [] btree_readpage_end_io_hook+0x111/0x260 [btrfs] > [] ? find_first_extent_bit_state+0x22/0x80 [btrfs] > [] end_bio_extent_readpage+0xcb/0xa30 [btrfs] > [] ? end_workqueue_fn+0x31/0x50 [btrfs] > [] bio_endio+0x18/0x30 > [] end_workqueue_fn+0x3c/0x50 [btrfs] > [] worker_loop+0x157/0x560 [btrfs] > [] ? btrfs_queue_worker+0x310/0x310 [btrfs] > [] kthread+0x8e/0xa0 > [] kernel_thread_helper+0x4/0x10 > [] ? flush_kthread_worker+0x70/0x70 > [] ? gs_change+0x13/0x13 > Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 eb fb ff ff eb e1 90 90 90 > 90 90 90 90... > 8 4c 8b 56 10 4c > RIP [] memcpy+0xd/0x110 > RSP > Tonight my box had the same on a system running 3.7.10. Also with a Btrfs scrub running. general protection fault: [#1] SMP Modules linked in: xt_LOG xt_limit xt_multiport iptable_nat nf_nat_ipv4 nf_nat ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_mangle ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth sp5100_tco edac_core edac_mce_amd k8temp i2c_piix4 kvm_amd kvm e1000e serio_raw mac_hid shpchp lp parport btrfs zlib_deflate libcrc32c raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath linear ahci libahci [last unloaded: ipmi_msghandler] CPU 1 Pid: 10376, comm: btrfs-endio-met Not tainted 3.7.10-030710-generic #201302271235 MICRO-STAR INTERNATIONAL CO., LTD MS-96B3/MS-96B3 RIP: 0010:[] [] memc
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Tue, Jul 03, 2012 at 02:01:21AM +0300, Sami Liedes wrote: > The oops is transcribed from photos, so it may contain some errors. I > tried to be careful, and double checked the backtrace. Forgot to menion, the fs is a two-partition filesystem, with both data and metadata mirrored. Label: none uuid: f26f08b1-89e6-4f2d-b7f9-857010dd2517 Total devices 2 FS bytes used 778.31GB devid2 size 1.07TB used 1.05TB path /dev/dm-6 devid1 size 1.07TB used 1.05TB path /dev/dm-5 Sami signature.asc Description: Digital signature
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
> I just got this oops on a computer running 3.4.2. > > A few minutes before I had started "btrfs device scrub /" and had a > watcher process running "btrfs scrub status /" every 5 seconds. After > a few gigabytes of scrubbing, I got this crash. > > The oops is transcribed from photos, so it may contain some errors. I You did *what*? :-) Uploading a photo would be fine, just in case that's easier for you the next time. > tried to be careful, and double checked the backtrace. > > Sami > > > general protection fault: [#1] SMP > CPU 4 > Modules linked in: tcp_diag inet_diag nfnetlink_log nfnetlink ufs qnx4 > hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs ext3 jbd ext2 ip6_tables > ebtable_nat ebtables cn rfcomm bnep > parport_pc ppdev lp parport tun cpufreq_userspace cpufreq_stats > cpufreq_powersave cpufreq_conservative binfmt_misc fuse nfsd nfs nfs_acl > auth_rpcgss fscache lockd sunrpc iptable_filter ipt_MASQUERADE > ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack > ip_tables x_tables xfs ext4 jbd2 mbcache radeon drm_kms_helper ttm drm > i2c_algo_bit loop kvm_intel kvm snd_hda_codec_hdmi > snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio > snd_usbmidi_lib snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm > acpi_cpufreq snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmi > di ath3k snd_seq snd_seq_device snd_timer iTCO_wdt bluetooth eeepci_wmi > asus_wmi sparse_keymap crc16 rfkill pcspkr psmouse coretemp serio_raw evdev > mperf pci_hotplug i2c_i801 i2c_core processor button > intel_agp snd mxm_wmi video wmi intel_gtt microcode soundcore sha256_generic > dm_crypt dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor > async_tx raid6_pq md_mod nbd btrfs libcrc32c > zlib_deflate sd_mod crc_t10dif crc32c_intel ghash_cmulni_intel firewire_ohci > r8196 firewire_core ahci aesni_intel libahci mii crc_itu_t aes_x86_64 libata > aes_generic cryptd scsi_mod e1000e thermal fa > n thermal_sys [last unloaded: scsi_wait_scan] > > Pid: 30863, comm: btrfs-endio-met Tainted: GW3.4.2 #1 System > manufacturer System Product Name/P8P67 EVO > RIP: 0010:[] [] memcpy+0xd/0x110 > RSP: :88003174dba8 EFLAGS: 00010202 > RAX: 88003174dc8f RBX: 0011 RCX: 0002 > RDX: 0001 RSI: 00050803 RDI: 88003174dc8f > RBP: 88003174dbf0 R08: 000a R09: > R10: R11: R12: 88003174dca0 > R13: 8800659f42b0 R14: 0048 R15: 0011 > FS: () GS:88021ed0() knlGS: > CS: 0010 DS: ES: CR0: 8005003b > CR2: 0973c000 CR3: 000167ef3000 CR4: 000407e0 > DR0: DR1: DR2: > DR3: DR6: 0ff0 DR7: 0400 > Process btrfs-endio-met (pid: 30863, threadinfo 88003174c000, task > 88006f818000) > Stack: > a026bd6b 8801960f5000 8003 1000 > 88003174dc58 03dd 88000ac13c60 88003174dc58 > 696f70203a61685f 88003174dc00 a026904d 88003174dcd0 > Call Trace: > [] ? read_extent_buffer+0xbb/0x110 [btrfs] > [] btrfs_node_key+0x1d/0x20 [btrfs] > [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs] > [] btree_readahead_hook+0x1f/0x40 [btrfs] > [] btree_readpage_end_io_hook+0x111/0x260 [btrfs] > [] ? find_first_extent_bit_state+0x22/0x80 [btrfs] > [] end_bio_extent_readpage+0xcb/0xa30 [btrfs] > [] ? end_workqueue_fn+0x31/0x50 [btrfs] > [] bio_endio+0x18/0x30 > [] end_workqueue_fn+0x3c/0x50 [btrfs] > [] worker_loop+0x157/0x560 [btrfs] > [] ? btrfs_queue_worker+0x310/0x310 [btrfs] > [] kthread+0x8e/0xa0 > [] kernel_thread_helper+0x4/0x10 > [] ? flush_kthread_worker+0x70/0x70 > [] ? gs_change+0x13/0x13 > Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 eb fb ff ff eb e1 90 90 90 > 90 90 90 90... > 8 4c 8b 56 10 4c > RIP [] memcpy+0xd/0x110 > RSP That's looking strange. I checked the readahead code again: It deliberately skips locking and uses btrfs_node_key with a counter variable. This means, we might end up reading a key that's no longer actually there. However, it only operates on nodes of trees, not leaves. Node entries have a fixed size, so no matter what changes in the node, you won't reach behind the end of that node with an index that was valid the moment before. As far as I see it, that algorithm is safe. It could miss some keys or do some extra work that's not strictly required, but it should never reach a GPF from btrfs_node_key. If no other ideas come up, I'd try memtesting that machine. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Tue, Jul 03, 2012 at 02:01:21AM +0300, Sami Liedes wrote: > I just got this oops on a computer running 3.4.2. I now repeated this on 3.4.4. Merely running a "btrfs scrub start /" causes this after a couple of minutes of running. This time I didn't run "btrfs scrub status /" in a loop, so that's not the cause. And judging from the two tries and two crashes, this seems very repeatable. The backtrace is basically exactly the same as with 3.4.4. Sami signature.asc Description: Digital signature
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Tue, Jul 03, 2012 at 03:11:04PM +0200, Jan Schmidt wrote: > > The oops is transcribed from photos, so it may contain some errors. I > > You did *what*? :-) Uploading a photo would be fine, just in case that's > easier > for you the next time. Nah, it's sometimes refreshing to do something that doesn't take too much thinking for a while. Besides I fault myself for not having had netconsole logging on that machine and needed to punish myself for that omission ;) > That's looking strange. > > I checked the readahead code again: It deliberately skips locking and uses > btrfs_node_key with a counter variable. This means, we might end up reading a > key that's no longer actually there. However, it only operates on nodes of > trees, not leaves. Node entries have a fixed size, so no matter what changes > in > the node, you won't reach behind the end of that node with an index that was > valid the moment before. > > As far as I see it, that algorithm is safe. It could miss some keys or do some > extra work that's not strictly required, but it should never reach a GPF from > btrfs_node_key. > > If no other ideas come up, I'd try memtesting that machine. Ran a full pass of memtest86+ v4.20, with no errors found. Also the machine works very well in all other respects under heavy load. I think I might try setting up that netconsole to see if there are any interesting console messages before the oops... As I said, I also was able to reproduce this on 3.4.4, so ATM I assume I'm able to reproduce this at will. Sami signature.asc Description: Digital signature
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On 03.07.2012 15:58, Sami Liedes wrote: > I think I might try setting up that netconsole to see if there are any > interesting console messages before the oops... As I said, I also was > able to reproduce this on 3.4.4, so ATM I assume I'm able to reproduce > this at will. That would be helpful. It's good that it's reproducible. Please paste the entire output if possible. If there's nothing immediately useful in there, I'll try to make up some debugging patches for that problem. Thanks! -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Tue, Jul 03, 2012 at 04:35:25PM +0200, Jan Schmidt wrote: > On 03.07.2012 15:58, Sami Liedes wrote: > > I think I might try setting up that netconsole to see if there are any > > interesting console messages before the oops... As I said, I also was > > able to reproduce this on 3.4.4, so ATM I assume I'm able to reproduce > > this at will. > > That would be helpful. It's good that it's reproducible. Please paste the > entire > output if possible. > > If there's nothing immediately useful in there, I'll try to make up some > debugging patches for that problem. There's certainly something there; unfortunately netconsole eats most of it. (Is there a better way to capture full dmesg from panic? Assuming a computer without a serial port.) I've seen this before: An overly long "Modules linked in:" line causes a large gap in netconsole output. Hovever perhaps the most interesting piece is there: [30846.822395] WARNING: at fs/btrfs/extent_io.c:4522 read_extent_buffer+0xe6/0x110 [btrfs]() Unfortunately, no backtrace for that :( Ideas? Sami [30638.258946] netpoll: netconsole: local port [30638.258950] netpoll: netconsole: local IP 192.168.1.2 [30638.258952] netpoll: netconsole: interface 'eth0' [30638.258953] netpoll: netconsole: remote port 1194 [30638.258954] netpoll: netconsole: remote IP 192.168.1.73 [30638.258956] netpoll: netconsole: remote ethernet address 00:1c:10:44:47:2c [30638.259002] console [netcon0] enabled [30638.259004] netconsole: network logging started [30846.822356] [ cut here ] [30846.822395] WARNING: at fs/btrfs/extent_io.c:4522 read_extent_buffer+0xe6/0x110 [btrfs]() [30846.822400] Hardware name: System Product Name [30846.822402] Modules linked in: netconsole configfs tun cn ip6_tables ebtable_nat ebtables parport_pc ppdev lp parport bnep rfcomm cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc iptable_filter ipt_MASQUERADE ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ip_tables x_tables xfs ext4 jbd2 mbcache radeon drm_kms_helper ttm drm i2c_algo_bit loop fuse kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_usb_audio snd_hda_intel snd_usbmidi_lib snd_hda_codec snd_hwdep snd_pcm_oss joydev snd_mixer_oss snd_pcm snd_page_alloc eeepc_wmi90 90 90 48 89 f8 48 89 90 48 c1 e9 03 83 e2 07 48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c [30846.825219] RIP [] memcpy+0xd/0x110 [30846.825248] RSP xord1 raid6_pq md_mod nbd btrfs libcrc32c zlib_deflate sd_mod crc_t10dif ahci libahci crc32c_intel r8169 pci_hotplug bluetooth snd_seq_device snd_timer snd intel_agp acpi_cpufreq mperf mxm_wmi psmouse wmi coretemp soundcore processor serio_raw i2c_i801 iTCO_wdt video rfkill microcode crc16 evdev i2c_core button intel_gtt[30846.822679] [] btree_readpage_end_io_hook+0x111/0x260 [btrfs] sha256_generic dm_crypt dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx[30846.822773] [] ? gs_change+0x13/0x13 md_mod nbd btrfs libcrc32c zlib_deflate sd_mod crc_t10dif ahci libahci raid6_pq r8169 mii firewire_ohci ghash_clmulni_intel firewire_core aesni_intel crc_itu_t aes_x86_64 bnep aes_generic cryptd scsi_mod e1000e fan thermal nfsd thermal_sys [last unloaded: netconsole] auth_rpcgss[30846.823704] Pid: 16224, comm: btrfs-endio-met Tainted: GW3.4.4 #1 System manufacturer System Product Name/P8P67 EVO [30846.823767] RIP: 0010:[] [] memcpy+0xd/0x110 [30846.823805] RSP: :88009ecebba8 EFLAGS: 00010202 [30846.823830] RAX: 88009ecebc8f RBX: 0011 RCX: 0002 [30846.823865] RDX: 0001 RSI: 00050803 RDI: 88009ecebc8f [30846.823933] R10: 88001d2395c0 R11: 8801f198c780 R12: 88009ecebca0 [30846.823968] R13: 88008eb01178 R14: 0048 R15: 0011 [30846.824002] FS: () GS:88021ec8() knlGS: [30846.824042] CS: 0010 DS: ES: CR0: 8005003b [30846.824069] CR2: 0add813c CR3: 000163305000 CR4: 000407e0 [30846.823694] [30846.824101] DR0: DR1: DR2: [30846.824136] DR3: DR6: 0ff0 DR7: 0400 [30846.824171] Process btrfs-endio-met (pid: 16224, threadinfo 88009ecea000, task 88020f3142f0) [30846.824225] a0275d5b 880167d3 8003 1000 [30846.824273] 88009ecebc58 03dd[30846.824214] Stack: 880099cdea20 [30846.824318] 6172745f69706d20 88009ecebc00 a026d03d 88009ecebc58 88009ecebcd0 [30846.824363] Call Trace: [30846.824434] [] btrfs_node_key+0x1d/0x20 [btrfs] [30846.824474] [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs] [30846.824517] [] btree_readahead_hook+0x1f/0x40 [btrfs]
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Wed, Jul 04, 2012 at 01:47:56AM +0300, Sami Liedes wrote: > I've seen this before: An overly long "Modules linked in:" line causes > a large gap in netconsole output. I managed to capture the entire output using netconsole by modifying the kernel to not output the list of modules. Sami [ 125.827919] netpoll: netconsole: local port [ 125.827946] netpoll: netconsole: local IP 192.168.1.2 [ 125.827969] netpoll: netconsole: interface 'eth0' [ 125.827990] netpoll: netconsole: remote port 1194 [ 125.828011] netpoll: netconsole: remote IP 192.168.1.73 [ 125.828034] netpoll: netconsole: remote ethernet address 00:1c:10:44:47:2c [ 125.828169] console [netcon0] enabled [ 125.828193] netconsole: network logging started [ 247.787472] [ cut here ] [ 247.787536] WARNING: at fs/btrfs/extent_io.c:4522 read_extent_buffer+0xe6/0x110 [btrfs]() [ 247.787573] Hardware name: System Product Name [ 247.787594] Modules linked in: [last unloaded: scsi_wait_scan] [ 247.787632] Pid: 1146, comm: btrfs-endio-met Tainted: GW 3.4.4-modded-oops+ #1 [ 247.787674] Call Trace: [ 247.787692] [] warn_slowpath_common+0x7a/0xb0 [ 247.787794] [] warn_slowpath_null+0x15/0x20 [ 247.787835] [] read_extent_buffer+0xe6/0x110 [btrfs] [ 247.787877] [] btrfs_node_key+0x1d/0x20 [btrfs] [ 247.787917] [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs] [ 247.787959] [] btree_readahead_hook+0x1f/0x40 [btrfs] [ 247.787999] [] btree_readpage_end_io_hook+0x111/0x260 [btrfs] [ 247.788043] [] ? find_first_extent_bit_state+0x22/0x80 [btrfs] [ 247.788087] [] end_bio_extent_readpage+0xcb/0xa30 [btrfs] [ 247.788129] [] ? end_workqueue_fn+0x31/0x50 [btrfs] [ 247.788161] [] bio_endio+0x18/0x30 [ 247.788194] [] end_workqueue_fn+0x3c/0x50 [btrfs] [ 247.788234] [] worker_loop+0x157/0x560 [btrfs] [ 247.788272] [] ? btrfs_queue_worker+0x310/0x310 [btrfs] [ 247.788306] [] kthread+0x8e/0xa0 [ 247.788332] [] kernel_thread_helper+0x4/0x10 [ 247.788360] [] ? flush_kthread_worker+0x70/0x70 [ 247.788389] [] ? gs_change+0x13/0x13 [ 247.788413] ---[ end trace e93713a9d40cd06e ]--- [ 247.788444] general protection fault: [#1] SMP [ 247.788473] CPU 5 [ 247.788484] Modules linked in: [last unloaded: scsi_wait_scan] [ 247.788523] [ 247.788533] Pid: 1146, comm: btrfs-endio-met Tainted: GW 3.4.4-modded-oops+ #1 System manufacturer System Product Name/P8P67 EVO [ 247.788595] RIP: 0010:[] [] memcpy+0xd/0x110 [ 247.788632] RSP: :8802152f3ba8 EFLAGS: 00010202 [ 247.788656] RAX: 8802152f3c8f RBX: 0011 RCX: 0002 [ 247.788687] RDX: 0001 RSI: 00050803 RDI: 8802152f3c8f [ 247.788719] RBP: 8802152f3bf0 R08: R09: [ 247.788750] R10: 880207dd89c0 R11: 8801f3a90780 R12: 8802152f3ca0 [ 247.788782] R13: 8801c1d5b830 R14: 0048 R15: 0011 [ 247.788814] FS: () GS:88021ed4() knlGS: [ 247.788850] CS: 0010 DS: ES: CR0: 8005003b [ 247.788876] CR2: 038942a0 CR3: 0001c26ce000 CR4: 000407e0 [ 247.788907] DR0: DR1: DR2: [ 247.788939] DR3: DR6: 0ff0 DR7: 0400 [ 247.788971] Process btrfs-endio-met (pid: 1146, threadinfo 8802152f2000, task 8801f4192ca0) [ 247.789009] Stack: [ 247.789021] a01d7d5b 8802010b7000 8003 1000 [ 247.789066] 8802152f3c58 03dd 88021032fa20 8802152f3c58 [ 247.789112] 6669646e65230a29 8802152f3c00 a01cf03d 8802152f3cd0 [ 247.789157] Call Trace: [ 247.789184] [] ? read_extent_buffer+0xbb/0x110 [btrfs] [ 247.789226] [] btrfs_node_key+0x1d/0x20 [btrfs] [ 247.789265] [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs] [ 247.789308] [] btree_readahead_hook+0x1f/0x40 [btrfs] [ 247.789348] [] btree_readpage_end_io_hook+0x111/0x260 [btrfs] [ 247.789391] [] ? find_first_extent_bit_state+0x22/0x80 [btrfs] [ 247.789434] [] end_bio_extent_readpage+0xcb/0xa30 [btrfs] [ 247.789475] [] ? end_workqueue_fn+0x31/0x50 [btrfs] [ 247.789506] [] bio_endio+0x18/0x30 [ 247.789539] [] end_workqueue_fn+0x3c/0x50 [btrfs] [ 247.789578] [] worker_loop+0x157/0x560 [btrfs] [ 247.789616] [] ? btrfs_queue_worker+0x310/0x310 [btrfs] [ 247.789648] [] kthread+0x8e/0xa0 [ 247.789672] [] kernel_thread_helper+0x4/0x10 [ 247.789700] [] ? flush_kthread_worker+0x70/0x70 [ 247.789729] [] ? gs_change+0x13/0x13 [ 247.789753] Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 eb fb ff ff eb e1 90 90 90 90 90 90 90 90 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c [ 247.790092] RIP [] memcpy+0xd/0x110 [ 247.790120] RSP [ 247.790138] ---
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On 04.07.2012 02:17, Sami Liedes wrote: > On Wed, Jul 04, 2012 at 01:47:56AM +0300, Sami Liedes wrote: >> I've seen this before: An overly long "Modules linked in:" line causes >> a large gap in netconsole output. > > I managed to capture the entire output using netconsole by modifying > the kernel to not output the list of modules. Okay, thanks for the output. Can you please apply the patch below and capture especially the line printed before the "cut here" line? Thanks! -Jan diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c9018a0..beabe99 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4519,7 +4519,14 @@ void read_extent_buffer(struct extent_buffer *eb, void *dstv, size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1); unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT; - WARN_ON(start > eb->len); + if (start > eb->len) { + printk(KERN_ERR "btrfs: invalid parameters for read_extent_buffer: start (%lu) > eb->len (%lu). eb start is %llu, level %d, generation %llu, nritems %d. len param %lu. debug %llu/%llu/%llu/%llu\n", + start, eb->len, eb->start, btrfs_header_level(eb), + btrfs_header_generation(eb), btrfs_header_nritems(eb), + len, + eb->debug[0], eb->debug[1], eb->debug[2], eb->debug[3]); + WARN_ON(1); + } WARN_ON(start + len > eb->start + eb->len); offset = (start_offset + start) & ((unsigned long)PAGE_CACHE_SIZE - 1); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index b516c3b..1bbf823 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -164,6 +164,8 @@ struct extent_buffer { wait_queue_head_t lock_wq; struct page *inline_pages[INLINE_EXTENT_BUFFER_PAGES]; struct page **pages; + + u64 debug[4]; }; static inline void extent_set_compress_type(unsigned long *bio_flags, diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index ac5d010..d9c1146 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -168,10 +168,15 @@ static int __readahead_hook(struct btrfs_root *root, struct extent_buffer *eb, struct btrfs_key key; struct btrfs_key next_key; + eb->debug[0] = 1; + eb->debug[1] = i; + eb->debug[2] = nritems; + eb->debug[3] = generation; btrfs_node_key_to_cpu(eb, &key, i); - if (i + 1 < nritems) + if (i + 1 < nritems) { + eb->debug[0] = 2; btrfs_node_key_to_cpu(eb, &next_key, i + 1); - else + } else next_key = re->top; bytenr = btrfs_node_blockptr(eb, i); n_gen = btrfs_node_ptr_generation(eb, i); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Wed, Jul 04, 2012 at 01:26:46PM +0200, Jan Schmidt wrote: > On 04.07.2012 02:17, Sami Liedes wrote: > > On Wed, Jul 04, 2012 at 01:47:56AM +0300, Sami Liedes wrote: > >> I've seen this before: An overly long "Modules linked in:" line causes > >> a large gap in netconsole output. > > > > I managed to capture the entire output using netconsole by modifying > > the kernel to not output the list of modules. > > Okay, thanks for the output. Can you please apply the patch below and capture > especially the line printed before the "cut here" line? Here you go. Sami [ 121.524803] netpoll: netconsole: local port [ 121.524831] netpoll: netconsole: local IP 192.168.1.2 [ 121.524853] netpoll: netconsole: interface 'eth0' [ 121.524874] netpoll: netconsole: remote port 1194 [ 121.524894] netpoll: netconsole: remote IP 192.168.1.73 [ 121.524917] netpoll: netconsole: remote ethernet address 00:1c:10:44:47:2c [ 121.525055] console [netcon0] enabled [ 121.525074] netconsole: network logging started [ 200.980496] btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2243489562624, level 26, generation 3144240307695375391, nritems 620178657. len param 17. debug 2/989/620178657/3144240307695375391 [ 200.980594] [ cut here ] [ 200.980644] WARNING: at fs/btrfs/extent_io.c:4528 read_extent_buffer+0x167/0x1a0 [btrfs]() [ 200.980681] Hardware name: System Product Name [ 200.980701] Modules linked in: [last unloaded: scsi_wait_scan] [ 200.980739] Pid: 1145, comm: btrfs-endio-met Tainted: GW 3.4.4-modded-oops+ #2 [ 200.980774] Call Trace: [ 200.980792] [] warn_slowpath_common+0x7a/0xb0 [ 200.980821] [] warn_slowpath_null+0x15/0x20 [ 200.980860] [] read_extent_buffer+0x167/0x1a0 [btrfs] [ 200.980902] [] btrfs_node_key+0x1d/0x20 [btrfs] [ 200.980941] [] __readahead_hook.isra.5+0x3ff/0x460 [btrfs] [ 200.980982] [] btree_readahead_hook+0x1f/0x40 [btrfs] [ 200.981022] [] btree_readpage_end_io_hook+0x111/0x260 [btrfs] [ 200.981065] [] ? find_first_extent_bit_state+0x22/0x80 [btrfs] [ 200.981109] [] end_bio_extent_readpage+0xcb/0xa30 [btrfs] [ 200.981150] [] ? end_workqueue_fn+0x31/0x50 [btrfs] [ 200.981182] [] bio_endio+0x18/0x30 [ 200.981214] [] end_workqueue_fn+0x3c/0x50 [btrfs] [ 200.981253] [] worker_loop+0x157/0x560 [btrfs] [ 200.981291] [] ? btrfs_queue_worker+0x310/0x310 [btrfs] [ 200.981323] [] kthread+0x8e/0xa0 [ 200.981348] [] kernel_thread_helper+0x4/0x10 [ 200.981377] [] ? flush_kthread_worker+0x70/0x70 [ 200.981406] [] ? gs_change+0x13/0x13 [ 200.981430] ---[ end trace e93713a9d40cd06e ]--- [ 200.981459] general protection fault: [#1] SMP [ 200.981487] CPU 2 [ 200.981498] Modules linked in: [last unloaded: scsi_wait_scan] [ 200.981540] [ 200.981550] Pid: 1145, comm: btrfs-endio-met Tainted: GW 3.4.4-modded-oops+ #2 System manufacturer System Product Name/P8P67 EVO [ 200.981612] RIP: 0010:[] [] memcpy+0xd/0x110 [ 200.981650] RSP: :8801f4bf7b68 EFLAGS: 00010202 [ 200.981675] RAX: 8801f4bf7c8f RBX: 0011 RCX: 0002 [ 200.981707] RDX: 0001 RSI: 00050803 RDI: 8801f4bf7c8f [ 200.981738] RBP: 8801f4bf7be0 R08: R09: [ 200.981769] R10: 8801f4bf7c8f R11: 8801f3930780 R12: 8801f4bf7ca0 [ 200.981800] R13: 8801f7286178 R14: 0048 R15: 0011 [ 200.981832] FS: () GS:88021ec8() knlGS: [ 200.981868] CS: 0010 DS: ES: CR0: 8005003b [ 200.981894] CR2: f773a000 CR3: 00020d0de000 CR4: 000407e0 [ 200.981925] DR0: DR1: DR2: [ 200.981956] DR3: DR6: 0ff0 DR7: 0400 [ 200.981988] Process btrfs-endio-met (pid: 1145, threadinfo 8801f4bf6000, task 8801f4ff42f0) [ 200.982026] Stack: [ 200.982038] a01d6d5b 880124f72ce1 0011 0002 [ 200.982084] 03dd 24f72ce1 2ba295e6a4a5d41f 88020dc65000 [ 200.982130] 8801f4bf7c8f 1000 8801f4bf7c58 03dd [ 200.982175] Call Trace: [ 200.982201] [] ? read_extent_buffer+0xbb/0x1a0 [btrfs] [ 200.982243] [] btrfs_node_key+0x1d/0x20 [btrfs] [ 200.982281] [] __readahead_hook.isra.5+0x3ff/0x460 [btrfs] [ 200.982322] [] btree_readahead_hook+0x1f/0x40 [btrfs] [ 200.982362] [] btree_readpage_end_io_hook+0x111/0x260 [btrfs] [ 200.982405] [] ? find_first_extent_bit_state+0x22/0x80 [btrfs] [ 200.982448] [] end_bio_extent_readpage+0xcb/0xa30 [btrfs] [ 200.982489] [] ? end_workqueue_fn+0x31/0x50 [btrfs] [ 200.982519] [] bio_endio+0x18/0x30 [ 200.982552] [] end_workqueue_fn+0x3c/0x50 [btrfs] [ 200.982591] [] worker_loop+0x157/0x560 [btrfs] [ 200
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On 04.07.2012 18:03, Sami Liedes wrote: > Here you go. > > Sami > [...] > [ 200.980496] btrfs: invalid parameters for read_extent_buffer: start > (32771) > eb->len (32768). eb start is 2243489562624, level 26, generation > 3144240307695375391, nritems 620178657. len param 17. debug > 2/989/620178657/3144240307695375391 Wow, that's strange. Can you repeat your test once or twice and paste that line, please? I'd like to get a feeling if the values are completely random. Reading more of the readahead code now... Thanks, -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Wed, Jul 04, 2012 at 06:38:00PM +0200, Jan Schmidt wrote: > > [ 200.980496] btrfs: invalid parameters for read_extent_buffer: start > > (32771) > eb->len (32768). eb start is 2243489562624, level 26, generation > > 3144240307695375391, nritems 620178657. len param 17. debug > > 2/989/620178657/3144240307695375391 Let's call this try 1. I ran it three more times, so below we have tries 2, 3 and 4. > Wow, that's strange. Can you repeat your test once or twice and paste that > line, > please? I'd like to get a feeling if the values are completely random. Curiously, it clearly takes longer for it to crash after starting the scrub each time I run it. Also on try 4 I got an entirely different crash (backtrace below). Now it scrubs maybe the first 200G or so of both devices of the (raid-1) 2.2T filesystem before it crashes. start and eb->len seem to be the same (32771 and 32768) every time. eb start varies, but there's some pattern if you view them in hex: Try 1 20a5a66 Try 2 20bb0018000 Try 3 20a8bc28000 Try 4 (no output, different crash) The rest of the values seem to me to be completely different every time. Sami Try 2: [12961.870107] btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2249220784128, level 14, generation 2242260605927040034, nritems 117835525. len param 17. debug 2/989/117835525/2242260605927040034 [12961.870204] [ cut here ] [12961.870264] WARNING: at fs/btrfs/extent_io.c:4528 read_extent_buffer+0x167/0x1a0 [btrfs]() [12961.870302] Hardware name: System Product Name [12961.870322] Modules linked in: [last unloaded: scsi_wait_scan] [12961.870367] Pid: 1144, comm: btrfs-endio-met Tainted: GW 3.4.4-modded-oops+ #2 [12961.870403] Call Trace: [12961.870421] [] warn_slowpath_common+0x7a/0xb0 [12961.870449] [] warn_slowpath_null+0x15/0x20 [...] Try 3: [ 531.770984] btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2244317708288, level 170, generation 13639284858109917187, nritems 6171943. len param 17. debug 2/989/6171943/13639284858109917187 [ 531.771081] [ cut here ] [ 531.771133] WARNING: at fs/btrfs/extent_io.c:4528 read_extent_buffer+0x167/0x1a0 [btrfs]() [ 531.771169] Hardware name: System Product Name [ 531.771191] Modules linked in: [last unloaded: scsi_wait_scan] [ 531.771229] Pid: 1132, comm: btrfs-endio-met Tainted: GW 3.4.4-modded-oops+ #2 [ 531.771265] Call Trace: [ 531.771282] [] warn_slowpath_common+0x7a/0xb0 [...] Try 4: [ 95.933108] netconsole: network logging started [ 982.651987] unable to find logical 691402650139365534 len 32768 [ 982.652060] [ cut here ] [ 982.652085] kernel BUG at fs/btrfs/volumes.c:3725! [ 982.652109] invalid opcode: [#1] SMP [ 982.652138] CPU 4 [ 982.652149] Modules linked in: [last unloaded: scsi_wait_scan] [ 982.652190] [ 982.652201] Pid: 1127, comm: btrfs-endio-met Tainted: GW 3.4.4-modded-oops+ #2 System manufacturer System Product Name/P8P67 EVO [ 982.652264] RIP: 0010:[] [] __btrfs_map_block+0x668/0x680 [btrfs] [ 982.652323] RSP: :8801f43cfa70 EFLAGS: 00010286 [ 982.652347] RAX: 0049 RBX: 09985b0c0e52109e RCX: 0082 [ 982.652379] RDX: 00e8 RSI: 0046 RDI: 0246 [ 982.652411] RBP: 8801f43cfb10 R08: R09: [ 982.652443] R10: 8801f1fbf680 R11: 8801f3b00780 R12: 8801f0026108 [ 982.652475] R13: 8000 R14: 8801f0026fe0 R15: 8801f43cfbb0 [ 982.652507] FS: () GS:88021ed0() knlGS: [ 982.652542] CS: 0010 DS: ES: CR0: 8005003b [ 982.652568] CR2: f775c000 CR3: 0002110a1000 CR4: 000407e0 [ 982.652600] DR0: DR1: DR2: [ 982.652632] DR3: DR6: 0ff0 DR7: 0400 [ 982.652664] Process btrfs-endio-met (pid: 1127, threadinfo 8801f43ce000, task 8801f401c2f0) [ 982.652703] Stack: [ 982.652715] 8801f43cfa90 811dc553 88021ec0ce88 88021ed0ce80 [ 982.652761] 8801f43cfb00 0086 8801f43cfaf0 0086 [ 982.652807] 00fdf43cfad0 8801f43cfba8 0024 0004 [ 982.652852] Call Trace: [ 982.652869] [] ? cpumask_next_and+0x23/0x40 [ 982.652897] [] ? kmem_cache_alloc_trace+0xc2/0x100 [ 982.652940] [] btrfs_map_block+0x9/0x10 [btrfs] [ 982.652979] [] reada_add_block+0x1c2/0x890 [btrfs] [ 9
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On 04.07.2012 22:24, Sami Liedes wrote: > On Wed, Jul 04, 2012 at 06:38:00PM +0200, Jan Schmidt wrote: >>> [ 200.980496] btrfs: invalid parameters for read_extent_buffer: start >>> (32771) > eb->len (32768). eb start is 2243489562624, level 26, generation >>> 3144240307695375391, nritems 620178657. len param 17. debug >>> 2/989/620178657/3144240307695375391 > > Let's call this try 1. I ran it three more times, so below we have > tries 2, 3 and 4. > >> Wow, that's strange. Can you repeat your test once or twice and paste that >> line, >> please? I'd like to get a feeling if the values are completely random. > > Curiously, it clearly takes longer for it to crash after starting the > scrub each time I run it. Also on try 4 I got an entirely different > crash (backtrace below). Now it scrubs maybe the first 200G or so of > both devices of the (raid-1) 2.2T filesystem before it crashes. Can you double check that there's nothing about corrected errors in your logs? Scrub will correct errors along the way and log that. So maybe we've only a few tries left to find the root cause. > start and eb->len seem to be the same (32771 and 32768) every time. That's the tree block size in your setup. > eb start varies, but there's some pattern if you view them in hex: > > Try 1 20a5a66 > Try 2 20bb0018000 > Try 3 20a8bc28000 > Try 4 (no output, different crash) I fail to identify a real pattern there. > The rest of the values seem to me to be completely different every time. Which is itself interesting. > Try 4: That one is not that much different. We read some garbage from a tree block and started the next read ahead cycle for the alledged children. That way we came to a logical address that's out of bounds, instead of a logical address with even more garbage. I'd like to see if you corrupted your trees on disk in a really strange manner (with matching checksums?), or if data comes from the disk intact and becomes damaged thereafter. Could you store the output of btrfs-debug-tree /dev/[whatever] before try number 5 and afterwards? It will be quite a lot if you've got a lot of files in there. Don't send it anywhere right now, just store it away if possible. What I'd like to get in the next reply is the output of the attached patch, a single pass should do this time. NB: As we've already check_leaf doing exta leaf checks after reading them, we should probably add something like check_node as a general manner to make btrfs more robust. Thank you, -Jan --- diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a7ffc88..34122c2 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -316,6 +316,11 @@ static int csum_tree_block(struct btrfs_root *root, struct extent_buffer *buf, return 0; } +int btrfs_csum_tree_block(struct btrfs_root *root, struct extent_buffer *buf) +{ + return csum_tree_block(root, buf, 1); +} + /* * we can't consider a given block up to date unless the transid of the * block matches the transid in the parent node's pointer. This is how we @@ -471,6 +476,12 @@ static int check_tree_block_fsid(struct btrfs_root *root, (unsigned long long)btrfs_header_bytenr(eb), \ (unsigned long long)root->objectid, slot) +#define CORRUPT_NODE(root, node, reason, ...) \ + printk(KERN_CRIT "btrfs: corrupt node block=%llu," \ + "root=%llu: " reason,\ + (unsigned long long)btrfs_header_bytenr(node), \ + (unsigned long long)root->objectid, ##__VA_ARGS__) + static noinline int check_leaf(struct btrfs_root *root, struct extent_buffer *leaf) { @@ -532,6 +543,42 @@ static noinline int check_leaf(struct btrfs_root *root, return 0; } +static noinline int check_node(struct btrfs_root *root, + struct extent_buffer *node) +{ + int i; + u32 nritems = btrfs_header_nritems(node); + u64 generation; + + if (nritems == 0) + return 0; + + if (nritems > BTRFS_NODEPTRS_PER_BLOCK(root)) { + CORRUPT_NODE(root, node, "nritems (%lu) too large (%lu)\n", +(unsigned long)nritems, +BTRFS_NODEPTRS_PER_BLOCK(root)); + return -EIO; + } + + if (node->len > root->nodesize) { + CORRUPT_NODE(root, node, "length (%lu) too large (%lu)\n", +node->len, (unsigned long)root->nodesize); + return -EIO; + } + + generation = btrfs_super_generation(root->fs_info->super_copy); + for (i = 0; i < nritems; i++) { + if (btrfs_node_ptr_generation(node, i) > generation) { + CORRUPT_NODE(root, node, "generation (%llu) too new in slot %d (maximum expected %llu)\n", +btrfs_node_ptr_generation(node, i), i, +
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Thu, Jul 05, 2012 at 03:41:33PM +0200, Jan Schmidt wrote: > I'd like to see if you corrupted your trees on disk in a really strange > manner (with matching checksums?), or if data comes from the disk intact > and becomes damaged thereafter. > > Could you store the output of > btrfs-debug-tree /dev/[whatever] > before try number 5 and afterwards? It will be quite a lot if you've got > a lot of files in there. Don't send it anywhere right now, just store it > away if possible. Ok, I have it stored. I mounted the fs read-only to do that just to be safe, because when I did it on a live fs I got some (~15) "parent transid verify failed" messages. > Can you double check that there's nothing about corrected errors in your > logs? Scrub will correct errors along the way and log that. So maybe > we've only a few tries left to find the root cause. Nope, definitely nothing there. A few "unlinked orphans" messages at mount time, but that's all. I'm sure I would have captured any further messages on my netconsole. Also nothing in any earlier logs that I can see. Well, until now (try 6) anyway. I actually ran btrfs scrub / twice now after running btrfs-debug-tree (tries 5 and 6), because on the first try I ran on wrong kernel (without the debug patch). On try 5 it crashed very quickly, probably in less than a second. I also noticed that it doesn't actually crash the machine or even make the filesystem unusable in any way (apart from unmount blocking at shutdown), so I also have the same oopses in my log files stored on the same filesystem. But no, nothing about corrected errors or checksum failures. So: Try 5 (without your new patch, but after running btrfs-debug-tree); no checksum failures as before, similar crash to before: [ 2036.512656] btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2239379931136, level 102, generation 309534250463128, nritems 56187918. len param 17. debug 2/989/56187918/309534250463128 [ 2036.512770] [ cut here ] [ 2036.512834] WARNING: at fs/btrfs/extent_io.c:4528 read_extent_buffer+0x167/0x1a0 [btrfs]() [ 2036.512880] Hardware name: System Product Name [ 2036.512906] Modules linked in: [last unloaded: netconsole] [ 2036.512954] Pid: 1135, comm: btrfs-endio-met Tainted: GW 3.4.4-modded-oops+ #2 [...] Try 6, with your new patch - as a new thing, got checksum errors, including one very slightly before the crash and with the same "eb start" as in the "invalid parameters" message. Crashed in a couple of minutes. Now that I look at the the log below it actually has two crashes in it, one of btrfs-endio-met pid 1212 and another one of btrfs-endio-met pid some 72 seconds later. I didn't notice that until after the fact. dm-6 (the device with checksum failures) is on a disk on which I don't have anything else besides this partition, so I suppose it could be corrupting data without me having noticed. I know it used to work at one time, but that doesn't mean too much. Anyway, I'm sure I haven't seen any "checksum verify failed" messages until now. None in the logs either. Ah, one thing I have forgotten to mention. I think I've been using this filesystem for maybe a month now. It started as a single-device filesystem, but I did restriping to raid-1 on June 12. dm-5 is the original device, while dm-6 (the one with the checksum failures in try 6) is the added one. I still have logs of the restriping and everything after that; grep reveals no unusual lines containing "btrfs" in all the logs since then, apart from the crashes. Besides the oopses and lines that mention btrfs just because of the kernel version 3.4.4+btrfsdebug, this is a summary of all the lines that mention btrfs: btrfs: disk space caching is enabled btrfs flagging fs with big metadata feature btrfs: new size for /dev/mapper/rootvg-scratch_crypt is 751617179648 * this device is not on the filesystem with problems - although I haven't tried scrubbing this filesystem btrfs: unlinked 104 orphans (10 times with 4 different numbers) And during restriping btrfs: found 10008 extents (and similar) btrfs: relocating block group 1019010351104 flags 1 (and similar) And two different warnings in the last few days, but I think this is only because the debug patches changed the code: WARNING: at fs/btrfs/extent_io.c:4522 read_extent_buffer+0xe6/0x110 [btrfs]() WARNING: at fs/btrfs/extent_io.c:4528 read_extent_buffer+0x167/0x1a0 [btrfs]() Another thing that comes to mind that I might well be the only one that uses btrfs with raid-1 on two separate dm-crypted devices (contrary to the way suggested on Btrfs FAQ - in my experience that can actually speed up things, especially on computers without hardware crypto, because the kernel has only one kcryptd thread per dm-crypt device and therefore only can u
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On 06.07.2012 01:47, Sami Liedes wrote: > On Thu, Jul 05, 2012 at 03:41:33PM +0200, Jan Schmidt wrote: >> I'd like to see if you corrupted your trees on disk in a really strange >> manner (with matching checksums?), or if data comes from the disk intact >> and becomes damaged thereafter. >> >> Could you store the output of >> btrfs-debug-tree /dev/[whatever] >> before try number 5 and afterwards? It will be quite a lot if you've got >> a lot of files in there. Don't send it anywhere right now, just store it >> away if possible. > > Ok, I have it stored. I mounted the fs read-only to do that just to be > safe, because when I did it on a live fs I got some (~15) "parent > transid verify failed" messages. You can drop that, your data on disk is fine. Furthermore, I'm getting the feeling that this isn't really related to readahead, but we will find out later if that's true. >> Can you double check that there's nothing about corrected errors in your >> logs? Scrub will correct errors along the way and log that. So maybe >> we've only a few tries left to find the root cause. > > Nope, definitely nothing there. A few "unlinked orphans" messages at > mount time, but that's all. I'm sure I would have captured any further > messages on my netconsole. Also nothing in any earlier logs that I can > see. Okay. At least that fits with the "everything is okay on disk" statement. > Try 6, with your new patch - as a new thing, got checksum errors, > including one very slightly before the crash and with the same "eb > start" as in the "invalid parameters" message. Crashed in a couple of > minutes. > > Now that I look at the the log below it actually has two crashes in > it, one of btrfs-endio-met pid 1212 and another one of btrfs-endio-met > pid some 72 seconds later. I didn't notice that until after the fact. > > dm-6 (the device with checksum failures) is on a disk on which I don't > have anything else besides this partition, so I suppose it could be > corrupting data without me having noticed. I know it used to work at > one time, but that doesn't mean too much. Anyway, I'm sure I haven't > seen any "checksum verify failed" messages until now. None in the logs > either. > > Ah, one thing I have forgotten to mention. I think I've been using > this filesystem for maybe a month now. It started as a single-device > filesystem, but I did restriping to raid-1 on June 12. dm-5 is the > original device, while dm-6 (the one with the checksum failures in try > 6) is the added one. I still have logs of the restriping and > everything after that; grep reveals no unusual lines containing > "btrfs" in all the logs since then, apart from the crashes. Besides > the oopses and lines that mention btrfs just because of the kernel > version 3.4.4+btrfsdebug, this is a summary of all the lines that > mention btrfs: > > btrfs: disk space caching is enabled > btrfs flagging fs with big metadata feature > btrfs: new size for /dev/mapper/rootvg-scratch_crypt is 751617179648 > * this device is not on the filesystem with problems - although I >haven't tried scrubbing this filesystem > btrfs: unlinked 104 orphans (10 times with 4 different numbers) > > And during restriping > > btrfs: found 10008 extents (and similar) > btrfs: relocating block group 1019010351104 flags 1 (and similar) > > And two different warnings in the last few days, but I think this is > only because the debug patches changed the code: > > WARNING: at fs/btrfs/extent_io.c:4522 read_extent_buffer+0xe6/0x110 > [btrfs]() > WARNING: at fs/btrfs/extent_io.c:4528 read_extent_buffer+0x167/0x1a0 > [btrfs]() > > Another thing that comes to mind that I might well be the only one > that uses btrfs with raid-1 on two separate dm-crypted devices > (contrary to the way suggested on Btrfs FAQ - in my experience that > can actually speed up things, especially on computers without hardware > crypto, because the kernel has only one kcryptd thread per dm-crypt > device and therefore only can use a single core per device. Plus I get > the benefits of btrfs "smart raid".). > > So, try 6. The three first lines are from me starting btrfs-debug-tree > again just as a quick way to verify that my netconsole setup was > working. The first checksum failure came pretty much instantly after > starting scrub: > > > [ 168.207151] device fsid f26f08b1-89e6-4f2d-b7f9-857010dd2517 devid 1 > transid 151087 /dev/dm-5 > [ 168.208541] device fsid f26f08b1-89e6-4f2d-b7f9-857010dd2517 devid 2 > transid 151087 /dev/dm-6 > [ 168.284668] device fsid ad75beab-b52b-4f63-9d14-956cc8e95fc7 devid 1 > transid 5001 /dev/dm-9 > [ 188.507789] btrfs: dm-6 checksum verify failed on 2237998628864 wanted > C3D64F found DCBF0869 level 88 > [ 188.507852] btrfs: node seems invalid now. checksum ok = 1 > [ 262.706585] btrfs: dm-6 checksum verify failed on 2246859587584 wanted > 3D013CF8 found AD06874A level 77 > [
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Fri, Jul 06, 2012 at 04:42:50AM -0600, Jan Schmidt wrote: > ... down here in the stack. The warning is printed from two levels above, > __readahead_hook. > > Either I'm absolutely blind and there's code along the (rather short) road > between those two that might do this I haven't seen. Or someone else messes > with our extent buffers or the underlying pages. What really confuses me is > that it happens so reproducibly. > > I've no good idea at the moment how to go on. It might help to get a feeling > if > it's shifting around at least a little bit or really constant in the timing of > occurrence. So can you please apply the next patch on top of the other two and > give it some more failure tries? The "checksum mismatch [1234]" line will be > of > most interest. I'm also curious what the additional debug variables will say > in > the extended version of the very first printk. You can leave out the stack > traces if you like, they won't matter much anyway. I would suggest turning on slab debug and CONFIG_DEBUG_PAGEALLOC. Something really strange is happening here. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: > I've no good idea at the moment how to go on. It might help to get a feeling > if > it's shifting around at least a little bit or really constant in the timing of > occurrence. So can you please apply the next patch on top of the other two and > give it some more failure tries? The "checksum mismatch [1234]" line will be > of > most interest. I'm also curious what the additional debug variables will say > in > the extended version of the very first printk. You can leave out the stack > traces if you like, they won't matter much anyway. Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as suggested by Chris Mason. With those and the latest patch, there's an oops already at boot. I don't have netconsole yet at that point, but here's the important parts (sure I can capture it fully if you need). By the way, something seems to be untabifying your patches. I don't know if it's on my side or yours, but at least some other patches I receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs mostly makes them apply cleanly for me. Sami btrfs: disk space caching is enabled BUG: unable to handle kernel NULL pointer dereference at 0150 IP: [] check_node+0x138/0x250 [btrfs] PGD 0 Oops: [#1] SMP DEBUG_PAGEALLOC CPU 6 Modules linked in: [last unloaded: scsi_wait_scan] Pid: 1176, comm: btrfs-endio-met Tainted: GW 3.4.4+btrfsdebug2 #2 System Product Name/P8P67 EVO RIP: 0010:[] [] check_node+0x138/0x250 [btrfs [...] Process btrfs-endio-met (pid: 1176, [...]) Call trace: [...] btree_readpage_end_io_hook+0x1e5/0x2d0 [btrfs] [...] end_bio_extent_readpage+0xcb/0xa30 [btrfs] [...] ? end_workqueue_fn+0x31/0x50 [btrfs] [...] bio_endio+0x18/0x30 [...] end_workqueue_fn+0x3c/0x50 [btrfs] [...] worker_loop+0x157/0x560 [btrfs] [...] ? btrfs_queue_worker+0x310/0x310 [btrfs] [...] kthead+0x8e/0xa0 [...] kernel_thread_helper+0x4/0x10 [...] ? flush_kthread_worker+0x70/0x70 [...] ? gs_change+0x13/0x13 Code: [...] RIP [] check_node+0x138/0x250 [btrfs] RSP signature.asc Description: Digital signature
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote: > On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: > > I've no good idea at the moment how to go on. It might help to get a > > feeling if > > it's shifting around at least a little bit or really constant in the timing > > of > > occurrence. So can you please apply the next patch on top of the other two > > and > > give it some more failure tries? The "checksum mismatch [1234]" line will > > be of > > most interest. I'm also curious what the additional debug variables will > > say in > > the extended version of the very first printk. You can leave out the stack > > traces if you like, they won't matter much anyway. > > Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as > suggested by Chris Mason. > > With those and the latest patch, there's an oops already at boot. I > don't have netconsole yet at that point, but here's the important > parts (sure I can capture it fully if you need). > > By the way, something seems to be untabifying your patches. I don't > know if it's on my side or yours, but at least some other patches I > receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs > mostly makes them apply cleanly for me. > > Sami > > > > btrfs: disk space caching is enabled > BUG: unable to handle kernel NULL pointer dereference at 0150 > IP: [] check_node+0x138/0x250 [btrfs] This isn't from any of the new debugging. Can you please try it on an unpatched kernel? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Fri, July 06, 2012 at 16:40 (+0200), Chris Mason wrote: > On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote: >> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: >>> I've no good idea at the moment how to go on. It might help to get a >>> feeling if >>> it's shifting around at least a little bit or really constant in the timing >>> of >>> occurrence. So can you please apply the next patch on top of the other two >>> and >>> give it some more failure tries? The "checksum mismatch [1234]" line will >>> be of >>> most interest. I'm also curious what the additional debug variables will >>> say in >>> the extended version of the very first printk. You can leave out the stack >>> traces if you like, they won't matter much anyway. >> >> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as >> suggested by Chris Mason. >> >> With those and the latest patch, there's an oops already at boot. I >> don't have netconsole yet at that point, but here's the important >> parts (sure I can capture it fully if you need). >> >> By the way, something seems to be untabifying your patches. I don't >> know if it's on my side or yours, but at least some other patches I >> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs >> mostly makes them apply cleanly for me. >> >> Sami >> >> >> >> btrfs: disk space caching is enabled >> BUG: unable to handle kernel NULL pointer dereference at 0150 >> IP: [] check_node+0x138/0x250 [btrfs] > > This isn't from any of the new debugging. Can you please try it on an > unpatched kernel? You're confusing that with check_leaf. I added check_node along the way, see my mail from Thu, July 05, 2012 at 15:41 (+0200). I'd really like to add something similar for the 3.6 series. Checking for the null pointer dereference. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Fri, July 06, 2012 at 16:33 (+0200), Sami Liedes wrote: > On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: >> I've no good idea at the moment how to go on. It might help to get a feeling >> if >> it's shifting around at least a little bit or really constant in the timing >> of >> occurrence. So can you please apply the next patch on top of the other two >> and >> give it some more failure tries? The "checksum mismatch [1234]" line will be >> of >> most interest. I'm also curious what the additional debug variables will say >> in >> the extended version of the very first printk. You can leave out the stack >> traces if you like, they won't matter much anyway. > > Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as > suggested by Chris Mason. > > With those and the latest patch, there's an oops already at boot. I > don't have netconsole yet at that point, but here's the important > parts (sure I can capture it fully if you need). Oh I see. root->node can be NULL during mount. Please add this on top: -- diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index df0b347..22838a3 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -578,7 +578,8 @@ static noinline int check_node(struct btrfs_root *root, } node->debug[5] = node->start; node->debug[6] = btrfs_header_level(node); - node->debug[6] |= btrfs_header_level(root->node) << 16; + if (root->node) + node->debug[6] |= btrfs_header_level(root->node) << 16; node->debug[7] = 0xb22f50b22f5; return 0; -- > By the way, something seems to be untabifying your patches. I don't > know if it's on my side or yours, but at least some other patches I > receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs > mostly makes them apply cleanly for me. Oh, I'm sorry. Should have been on my side. I hope it's better with the current diff? -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Fri, Jul 06, 2012 at 09:02:34AM -0600, Jan Schmidt wrote: > On Fri, July 06, 2012 at 16:40 (+0200), Chris Mason wrote: > > On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote: > >> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote: > >>> I've no good idea at the moment how to go on. It might help to get a > >>> feeling if > >>> it's shifting around at least a little bit or really constant in the > >>> timing of > >>> occurrence. So can you please apply the next patch on top of the other > >>> two and > >>> give it some more failure tries? The "checksum mismatch [1234]" line will > >>> be of > >>> most interest. I'm also curious what the additional debug variables will > >>> say in > >>> the extended version of the very first printk. You can leave out the stack > >>> traces if you like, they won't matter much anyway. > >> > >> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as > >> suggested by Chris Mason. > >> > >> With those and the latest patch, there's an oops already at boot. I > >> don't have netconsole yet at that point, but here's the important > >> parts (sure I can capture it fully if you need). > >> > >> By the way, something seems to be untabifying your patches. I don't > >> know if it's on my side or yours, but at least some other patches I > >> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs > >> mostly makes them apply cleanly for me. > >> > >>Sami > >> > >> > >> > >> btrfs: disk space caching is enabled > >> BUG: unable to handle kernel NULL pointer dereference at 0150 > >> IP: [] check_node+0x138/0x250 [btrfs] > > > > This isn't from any of the new debugging. Can you please try it on an > > unpatched kernel? > > You're confusing that with check_leaf. I added check_node along the way, see > my > mail from Thu, July 05, 2012 at 15:41 (+0200). I'd really like to add > something > similar for the 3.6 series. > > Checking for the null pointer dereference. Sorry, I wasn't clear. I meant it wasn't from slab debug or DEBUG_PAGEALLOC, so it must be new in your patches ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Fri, Jul 06, 2012 at 10:59:24PM +0300, Sami Liedes wrote: > I think I might try running it overnight with KMEMCHECK to see if it > reports something. But for now, what there's in the log: My KMEMCHECK kernel didn't even boot (due to some weird KMEMCHECK/ACPI interaction), so I won't pursue this idea further at the moment... > * lots of checksum mismatch [234], no 1s One thing to notice from the logs, too, is that the device seems to always be dm-6, the second device of the filesystem. This never seems to happen to dm-5. There are 1583 lines of "btrfs: dm-6 checksum verify failed". Sami signature.asc Description: Digital signature
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
[Retry: I think this mail didn't make it to the list, probably because of the 73 kilobyte attached log. Here's a URL to the file:] http://www.niksula.hut.fi/~sliedes/btrfs-scrub-debug.log.gz Sami On Fri, Jul 06, 2012 at 05:09:00PM +0200, Jan Schmidt wrote: > Oh I see. root->node can be NULL during mount. Please add this on top: Ok. So, ran it with DEBUG_PAGEALLOC and slub debugging on. This time it took half an hour to crash, and there's _lots_ of checksum mismatch [234] messages even before the crash. gzipped dmesg attached. At 781 seconds there's an "irq 17: nobody cared". That's a known bug with this (and other Asus) motherboards and happens every now and then. I doubt it has anything to do with this. I think I might try running it overnight with KMEMCHECK to see if it reports something. But for now, what there's in the log: * lots of checksum mismatch [234], no 1s * a fair number of "csum_tree_block: [0-9]+ callbacks suppressed" lines * two "btrfs: node seems invalid now. checksum ok = 1" messages, one at 1499 seconds and another just before the crash at 1973 * Just before the crash: btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2261163409408, level 100, generation 4412718571037421157, nritems 538968254. len param 17. debug 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0 * the oopses > > By the way, something seems to be untabifying your patches. I don't > > know if it's on my side or yours, but at least some other patches I > > receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs > > mostly makes them apply cleanly for me. > > Oh, I'm sorry. Should have been on my side. I hope it's better with the > current > diff? Yes. No problem :) [See attachment for dmesg log.] Sami -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Mon, Jul 09, 2012 at 11:05:47AM +0200, Arne Jansen wrote: > > * Just before the crash: > > btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len > > (32768). eb start is 2261163409408, level 100, generation > > 4412718571037421157, nritems 538968254. len param 17. debug > > 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0 > > > > At a first glance: the generation converted to ascii is: "ent() ==", > so someone is patching the memory with ascii text, possibly C source. > It might be interesting to dump the full contents of the eb, to get > a clue on the source of the data. I changed the code to dump the contents of the eb struct at the point where that error (btrfs: invalid parameters...) is printed, at the "checksum mismatch 4" site and at the "node seems invalid now" site. Now I have a big log of 1795 corrupted ebs. So far nothing that looks remotely like ascii text, though. But I have two different versions of the eb that caused that warning, a less corrupted one and a more corrupted one: btrfs: --- start eb contents at 8801b13cc4c8 --- btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00 ...f btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff ...0... btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00 btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc558: 58 c5 3c b1 01 88 ff ff 58 c5 3c b1 01 88 ff ff X.<.X.<. btrfs: 8801b13cc568: 00 00 00 00 00 00 00 00 70 c5 3c b1 01 88 ff ff p.<. btrfs: 8801b13cc578: 70 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00 p.<. btrfs: 8801b13cc588: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc598: 80 5f 9a 06 00 ea ff ff 00 86 9b 06 00 ea ff ff ._.. btrfs: 8801b13cc5a8: 40 4c 9a 06 00 ea ff ff 80 66 9a 06 00 ea ff ff @L...f.. btrfs: 8801b13cc5b8: 80 eb 9b 06 00 ea ff ff 40 05 a2 06 00 ea ff ff @... btrfs: 8801b13cc5c8: 40 e1 9b 06 00 ea ff ff 80 c4 9c 06 00 ea ff ff @... btrfs: 8801b13cc5d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc5e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc5f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc608: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc618: 98 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00 ..<. btrfs: 8801b13cc628: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc638: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc648: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc658: 00 00 00 00 00 00 00 00 btrfs: --- end eb contents at 8801b13cc4c8 --- btrfs: dm-6 checksum verify failed on 2239404212224 wanted B5F632BC found 3579FB59 level 160 btrfs: node seems invalid now. checksum ok = 1 btrfs: --- start eb contents at 8801b13cc4c8 --- [... identical dump to above ...] btrfs: --- end eb contents at 8801b13cc4c8 --- btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2239404212224, level 160, generation 4716553384049587249, nritems 295705211. len param 17. debug 2/989/295705211/4716553384049587249/0x0/0/0x0/0x0 btrfs: --- start eb contents at 8801b13cc4c8 --- btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00 ...f btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff ...0... btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00 btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 btrfs:
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
Thanks for doing this, I'll start looking into it right away. One thing though: it's probably not the contents of struct eb that's being corrupted, but the data the eb points to. See for example read_extent_buffer to see how to access it. Sorry for being unclear on this. Thanks, Arne On 10.07.2012 06:16, Sami Liedes wrote: > On Mon, Jul 09, 2012 at 11:05:47AM +0200, Arne Jansen wrote: >>> * Just before the crash: >>> btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len >>> (32768). eb start is 2261163409408, level 100, generation >>> 4412718571037421157, nritems 538968254. len param 17. debug >>> 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0 >>> >> >> At a first glance: the generation converted to ascii is: "ent() ==", >> so someone is patching the memory with ascii text, possibly C source. >> It might be interesting to dump the full contents of the eb, to get >> a clue on the source of the data. > > I changed the code to dump the contents of the eb struct at the point > where that error (btrfs: invalid parameters...) is printed, at the > "checksum mismatch 4" site and at the "node seems invalid now" site. > Now I have a big log of 1795 corrupted ebs. So far nothing that looks > remotely like ascii text, though. But I have two different versions of > the eb that caused that warning, a less corrupted one and a more > corrupted one: > > > btrfs: --- start eb contents at 8801b13cc4c8 --- > btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00 > ...f > btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff > ...0... > btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00 > > btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc558: 58 c5 3c b1 01 88 ff ff 58 c5 3c b1 01 88 ff ff > X.<.X.<. > btrfs: 8801b13cc568: 00 00 00 00 00 00 00 00 70 c5 3c b1 01 88 ff ff > p.<. > btrfs: 8801b13cc578: 70 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00 > p.<. > btrfs: 8801b13cc588: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc598: 80 5f 9a 06 00 ea ff ff 00 86 9b 06 00 ea ff ff > ._.. > btrfs: 8801b13cc5a8: 40 4c 9a 06 00 ea ff ff 80 66 9a 06 00 ea ff ff > @L...f.. > btrfs: 8801b13cc5b8: 80 eb 9b 06 00 ea ff ff 40 05 a2 06 00 ea ff ff > @... > btrfs: 8801b13cc5c8: 40 e1 9b 06 00 ea ff ff 80 c4 9c 06 00 ea ff ff > @... > btrfs: 8801b13cc5d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc5e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc5f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc608: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc618: 98 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00 > ..<. > btrfs: 8801b13cc628: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc638: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc648: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc658: 00 00 00 00 00 00 00 00 > > btrfs: --- end eb contents at 8801b13cc4c8 --- > btrfs: dm-6 checksum verify failed on 2239404212224 wanted B5F632BC found > 3579FB59 level 160 > btrfs: node seems invalid now. checksum ok = 1 > btrfs: --- start eb contents at 8801b13cc4c8 --- > [... identical dump to above ...] > btrfs: --- end eb contents at 8801b13cc4c8 --- > btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len > (32768). eb start is 2239404212224, level 160, generation > 4716553384049587249, nritems 295705211. len param 17. debug > 2/989/295705211/4716553384049587249/0x0/0/0x0/0x0 > btrfs: --- start eb contents at 8801b13cc4c8 --- > btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00 > ...f > btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff > ...0... > btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 0
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
Any news on this? I you give me some hints, I can try to reproduce it here. -Arne On 10.07.2012 08:57, Arne Jansen wrote: > On 10.07.2012 06:16, Sami Liedes wrote: >> On Mon, Jul 09, 2012 at 11:05:47AM +0200, Arne Jansen wrote: * Just before the crash: btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len (32768). eb start is 2261163409408, level 100, generation 4412718571037421157, nritems 538968254. len param 17. debug 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0 >>> >>> At a first glance: the generation converted to ascii is: "ent() ==", >>> so someone is patching the memory with ascii text, possibly C source. >>> It might be interesting to dump the full contents of the eb, to get >>> a clue on the source of the data. >> >> I changed the code to dump the contents of the eb struct at the point >> where that error (btrfs: invalid parameters...) is printed, at the >> "checksum mismatch 4" site and at the "node seems invalid now" site. >> Now I have a big log of 1795 corrupted ebs. So far nothing that looks >> remotely like ascii text, though. But I have two different versions of >> the eb that caused that warning, a less corrupted one and a more >> corrupted one: >> >> >> btrfs: --- start eb contents at 8801b13cc4c8 --- >> btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00 >> ...f >> btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff >> ...0... >> btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc558: 58 c5 3c b1 01 88 ff ff 58 c5 3c b1 01 88 ff ff >> X.<.X.<. >> btrfs: 8801b13cc568: 00 00 00 00 00 00 00 00 70 c5 3c b1 01 88 ff ff >> p.<. >> btrfs: 8801b13cc578: 70 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00 >> p.<. >> btrfs: 8801b13cc588: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc598: 80 5f 9a 06 00 ea ff ff 00 86 9b 06 00 ea ff ff >> ._.. >> btrfs: 8801b13cc5a8: 40 4c 9a 06 00 ea ff ff 80 66 9a 06 00 ea ff ff >> @L...f.. >> btrfs: 8801b13cc5b8: 80 eb 9b 06 00 ea ff ff 40 05 a2 06 00 ea ff ff >> @... >> btrfs: 8801b13cc5c8: 40 e1 9b 06 00 ea ff ff 80 c4 9c 06 00 ea ff ff >> @... >> btrfs: 8801b13cc5d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc5e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc5f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc608: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc618: 98 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00 >> ..<. >> btrfs: 8801b13cc628: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc638: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc648: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc658: 00 00 00 00 00 00 00 00 >> > > this one looks good so far. > >> btrfs: --- end eb contents at 8801b13cc4c8 --- >> btrfs: dm-6 checksum verify failed on 2239404212224 wanted B5F632BC found >> 3579FB59 level 160 >> btrfs: node seems invalid now. checksum ok = 1 >> btrfs: --- start eb contents at 8801b13cc4c8 --- >> [... identical dump to above ...] >> btrfs: --- end eb contents at 8801b13cc4c8 --- >> btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len >> (32768). eb start is 2239404212224, level 160, generation >> 4716553384049587249, nritems 295705211. len param 17. debug >> 2/989/295705211/4716553384049587249/0x0/0/0x0/0x0 >> btrfs: --- start eb contents at 8801b13cc4c8 --- >> btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00 >> ...f >> btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> >> btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff >> ...0... >> btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00 >>
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Mon, Jul 16, 2012 at 10:20:28AM +0200, Arne Jansen wrote: > Any news on this? I you give me some hints, I can try to reproduce > it here. I've been planning for a few days now to try if it's reproducible in a virtual machine with the same filesystem images. However I haven't gotten around to doing this yet... I tried to get KMEMCHECK working on my computer, but failed (and generated a bug report). It seems to work in KVM though. So, currently my idea is to boot the machine with a live USB stick, install kvm and make qemu qcow images backed by the real (2*1.1T) devices, but writing changes to the qcow images (I dare not mess with the actual devices, and don't happen to have quite 2.2T extra space outside of them...), and try to run scrub there. If that succeeds and the bug happens there too, debugging *should* be easier, and it *should* be possible to run it under KMEMCHECK too. If the bug doesn't happen inside a virtual machine, that would be interesting information too. Another idea might be to use LVM snapshots of the actual devices, but recent messages have made be wary of that approach - plus it's somewhat of a pain, because the snapshotting would have to be done before mounting anyway, since I have two devices and I doubt LVM supports snapshotting two separate devices at the exact same moment. However, I'm a bit ill at the moment, so I assume it might be at least a few days until I get to actually implementing this... Sami signature.asc Description: Digital signature
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Tue, Jul 17, 2012 at 12:29:33AM +0300, Sami Liedes wrote: > So, currently my idea is to boot the machine with a live USB stick, > install kvm and make qemu qcow images backed by the real (2*1.1T) > devices, but writing changes to the qcow images (I dare not mess with > the actual devices, and don't happen to have quite 2.2T extra space > outside of them...), and try to run scrub there. If that succeeds and > the bug happens there too, debugging *should* be easier, and it > *should* be possible to run it under KMEMCHECK too. If the bug doesn't > happen inside a virtual machine, that would be interesting information > too. I have now been able to reproduce the bug in KVM with the setup described above. I think it's safe to say now that the bug depends on some kind of interaction between btrfs and dm-crypt. With the following setup, the bug does NOT happen: * kvm, single cpu * sees 3 disks, /dev/vda=root, /dev/vdb=btrfs-dev1, /dev/vdc=btrfs-dev2 * The btrfs devices are essentially snapshots of the real btrfs devices in raid-1 configuration (2*1.1T). As the real devices are encrypted, the decryption is done outside the KVM, i.e. the KVM snapshots are backed by the decrypted devices. With the following setup, the bug DOES happen: * kvm, single cpu * sees 3 disks, /dev/vda=root, /dev/vdb=part1, /dev/vdc=part2, where part[12] is are LUKS containers containing the individual btrfs devices * inside kvm, they are opened using cryptsetup luksOpen /dev/vdb root1 cryptsetup luksOpen /dev/vdc root2 * after this, the filesystem is mounted with mount /dev/mapper/root1 /media -o device=/dev/mapper/root1,device=/dev/mapper/root2 * The devices are snapshots of the actual physical encrypted partitions containing the btrfs devices. I have not yet figured out if this can be reproduced using a pristine, smaller btrfs filesystem in raid-1 configuration inside KVM or if there's something about my specific filesystem that causes this. I can investigate that too; it's easier to do for me than the above testing, as I don't need to have continuous physical access to the computer to do that. Here's the .config of the kernel I used inside KVM to reproduce this: http://www.niksula.hut.fi/~sliedes/btrfs/config.3.4.4 I also ran the same tests with KMEMCHECK. Both with and without crypto, there were quite a number of (of course possibly false) warnings from btrfs code. I doubt any of them are related to this bug - there were no KMEMCHECK warnings during the scrub operation. Here are the logs, anyway: http://www.niksula.hut.fi/~sliedes/btrfs/screenlog.nocrypto.gz http://www.niksula.hut.fi/~sliedes/btrfs/screenlog.crypto.gz Sami signature.asc Description: Digital signature
Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On Sat, Jul 28, 2012 at 03:08:47PM +0300, Sami Liedes wrote: > I have not yet figured out if this can be reproduced using a pristine, > smaller btrfs filesystem in raid-1 configuration inside KVM or if > there's something about my specific filesystem that causes this. I can > investigate that too; it's easier to do for me than the above testing, > as I don't need to have continuous physical access to the computer to > do that. It seems the bug doesn't happen with a new filesystem made with mkfs.btrfs -mraid1 -draid1 /dev/mapper/btrfs_crypt /dev/mapper/btrfs2_crypt and subsequently filled with data... Unfortunate. I started to wonder what's so special about the second device of my filesystem. All the errors always seem to come from that device. The only thing that comes to mind is that that device was not originally part of the filesystem; it started as a single-device filesystem formatted with mkfs.btrfs with default options and was subsequently rebalanced under a 3.4.2 kernel. So I started to play with btrfs fi balance in my KVM instance with my two-device filesystem, and hit this oops, which may or may not be related to my previous bug and/or dm-crypt... [ 15.195342] device fsid a844eb60-eb9c-4e24-ae91-c1627bf2d439 devid 1 transid 176 /dev/mapper/btrfs_crypt [ 15.196606] device fsid a844eb60-eb9c-4e24-ae91-c1627bf2d439 devid 2 transid 176 /dev/mapper/btrfs2_crypt [ 15.197895] device fsid a844eb60-eb9c-4e24-ae91-c1627bf2d439 devid 1 transid 176 /dev/mapper/btrfs_crypt [ 15.200202] btrfs: disk space caching is enabled # btrfs device delete [something...] [ 1462.242456] btrfs: unable to go below two devices on raid1 # btrfs fi balance start -mconvert=dup -dconvert=raid0 /media [ 1895.048075] btrfs: unable to start balance with target metadata profile 32 # btrfs fi balance start -dconvert=raid0 /media [ 1917.106536] btrfs: relocating block group 10229907456 flags 17 [ 1929.188609] btrfs: relocating block group 8887730176 flags 17 [ 1944.690916] btrfs: found 2152 extents [ 1947.016210] btrfs: found 2152 extents [ 1947.421397] btrfs: relocating block group 7545552896 flags 17 [ 2024.225203] btrfs: found 36762 extents [ 2094.983055] btrfs: corrupt node block=8830455808,root=1: generation (197) too new in slot 0 (maximum expected 196) [ 2094.984858] [ cut here ] [ 2094.986076] WARNING: at fs/btrfs/super.c:219 __btrfs_abort_transaction+0xa5/0xc0() [ 2094.987912] Hardware name: Bochs [ 2094.988735] btrfs: Transaction abortedPid: 1623, comm: btrfs-transacti Not tainted 3.4.4 #9 [ 2094.988741] Call Trace: [ 2094.989361] [] warn_slowpath_common+0x75/0xb0 [ 2094.990829] [] warn_slowpath_fmt+0x41/0x50 [ 2094.992243] [] __btrfs_abort_transaction+0xa5/0xc0 [ 2094.993521] [] __btrfs_free_extent+0x213/0x7b0 [ 2094.994176] [] ? run_clustered_refs+0xd7/0xa10 [ 2094.994835] [] ? do_raw_spin_unlock+0x5d/0xb0 [ 2094.995481] [] run_clustered_refs+0x415/0xa10 [ 2094.996143] [] ? find_ref_head+0xb8/0xe0 [ 2094.996806] [] ? btrfs_run_delayed_refs+0x99/0x430 [ 2094.997505] [] btrfs_run_delayed_refs+0x16d/0x430 [ 2094.998188] [] ? next_block_group.isra.65+0x29/0x80 [ 2094.998890] [] ? _raw_spin_unlock+0x26/0x30 [ 2094.999520] [] btrfs_write_dirty_block_groups+0xb0/0x630 [ 2095.000281] [] ? free_extent_buffer+0x1a/0x70 [ 2095.000931] [] commit_cowonly_roots+0xe7/0x1c1 [ 2095.001610] [] btrfs_commit_transaction+0x519/0xa40 [ 2095.002317] [] ? abort_exclusive_wait+0xb0/0xb0 [ 2095.002997] [] transaction_kthread+0x245/0x2c0 [ 2095.003673] [] ? check_leaf.isra.105+0x300/0x300 [ 2095.004372] [] kthread+0x8e/0xa0 [ 2095.004902] [] kernel_thread_helper+0x4/0x10 [ 2095.005549] [] ? kthread_flush_work_fn+0x10/0x10 [ 2095.006222] [] ? gs_change+0x13/0x13 [ 2095.006787] ---[ end trace 8341f112debcf176 ]--- [ 2095.007287] BTRFS error (device dm-1) in __btrfs_free_extent:5134: IO failure [ 2095.008059] btrfs is forced readonly [ 2095.008454] btrfs: run_one_delayed_ref returned -5 [ 2095.008455] BTRFS error (device dm-1) in btrfs_run_delayed_refs:2454: IO failure [ 2176.876382] [ cut here ] [ 2176.877217] kernel BUG at fs/btrfs/relocation.c:3733! [ 2176.878139] invalid opcode: [#1] SMP [ 2176.879049] CPU 5 [ 2176.879435] Pid: 1676, comm: btrfs Tainted: GW3.4.4 #9 Bochs Bochs [ 2176.880383] RIP: 0010:[] [] relocate_block_group+0x643/0x690 [ 2176.880383] RSP: :880002bfdb08 EFLAGS: 00010206 [ 2176.880383] RAX: ffe2 RBX: ffe2 RCX: 880007aade10 [ 2176.880383] RDX: RSI: RDI: 880006fbe800 [ 2176.880383] RBP: 880002bfdb88 R08: 0003 R09: [ 2176.880383] R10: 000100072905 R11: 0001 R12: 8800037a8020 [ 2176.880383] R13: 5ee7 R14: 880006bbae10 R15: 8800037a8000 [ 2176.880383] FS: () GS:880007d4(0063) knlGS:0
Re: *** GMX Spamverdacht *** Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On 07.07.2012 01:44, Sami Liedes wrote: > [Retry: I think this mail didn't make it to the list, probably because > of the 73 kilobyte attached log. Here's a URL to the file:] > >http://www.niksula.hut.fi/~sliedes/btrfs-scrub-debug.log.gz > > Sami > > > > On Fri, Jul 06, 2012 at 05:09:00PM +0200, Jan Schmidt wrote: >> Oh I see. root->node can be NULL during mount. Please add this on top: > > Ok. So, ran it with DEBUG_PAGEALLOC and slub debugging on. This time > it took half an hour to crash, and there's _lots_ of checksum mismatch > [234] messages even before the crash. gzipped dmesg attached. > > At 781 seconds there's an "irq 17: nobody cared". That's a known bug > with this (and other Asus) motherboards and happens every now and > then. I doubt it has anything to do with this. > > I think I might try running it overnight with KMEMCHECK to see if it > reports something. But for now, what there's in the log: > > * lots of checksum mismatch [234], no 1s > > * a fair number of "csum_tree_block: [0-9]+ callbacks suppressed" > lines > > * two "btrfs: node seems invalid now. checksum ok = 1" messages, one > at 1499 seconds and another just before the crash at 1973 > > * Just before the crash: > btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len > (32768). eb start is 2261163409408, level 100, generation > 4412718571037421157, nritems 538968254. len param 17. debug > 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0 > At a first glance: the generation converted to ascii is: "ent() ==", so someone is patching the memory with ascii text, possibly C source. It might be interesting to dump the full contents of the eb, to get a clue on the source of the data. > * the oopses > >>> By the way, something seems to be untabifying your patches. I don't >>> know if it's on my side or yours, but at least some other patches I >>> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs >>> mostly makes them apply cleanly for me. >> >> Oh, I'm sorry. Should have been on my side. I hope it's better with the >> current >> diff? > > Yes. No problem :) > > [See attachment for dmesg log.] > > Sami > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: *** GMX Spamverdacht *** Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2
On 10.07.2012 06:16, Sami Liedes wrote: > On Mon, Jul 09, 2012 at 11:05:47AM +0200, Arne Jansen wrote: >>> * Just before the crash: >>> btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len >>> (32768). eb start is 2261163409408, level 100, generation >>> 4412718571037421157, nritems 538968254. len param 17. debug >>> 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0 >>> >> >> At a first glance: the generation converted to ascii is: "ent() ==", >> so someone is patching the memory with ascii text, possibly C source. >> It might be interesting to dump the full contents of the eb, to get >> a clue on the source of the data. > > I changed the code to dump the contents of the eb struct at the point > where that error (btrfs: invalid parameters...) is printed, at the > "checksum mismatch 4" site and at the "node seems invalid now" site. > Now I have a big log of 1795 corrupted ebs. So far nothing that looks > remotely like ascii text, though. But I have two different versions of > the eb that caused that warning, a less corrupted one and a more > corrupted one: > > > btrfs: --- start eb contents at 8801b13cc4c8 --- > btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00 > ...f > btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff > ...0... > btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00 > > btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc558: 58 c5 3c b1 01 88 ff ff 58 c5 3c b1 01 88 ff ff > X.<.X.<. > btrfs: 8801b13cc568: 00 00 00 00 00 00 00 00 70 c5 3c b1 01 88 ff ff > p.<. > btrfs: 8801b13cc578: 70 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00 > p.<. > btrfs: 8801b13cc588: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc598: 80 5f 9a 06 00 ea ff ff 00 86 9b 06 00 ea ff ff > ._.. > btrfs: 8801b13cc5a8: 40 4c 9a 06 00 ea ff ff 80 66 9a 06 00 ea ff ff > @L...f.. > btrfs: 8801b13cc5b8: 80 eb 9b 06 00 ea ff ff 40 05 a2 06 00 ea ff ff > @... > btrfs: 8801b13cc5c8: 40 e1 9b 06 00 ea ff ff 80 c4 9c 06 00 ea ff ff > @... > btrfs: 8801b13cc5d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc5e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc5f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc608: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc618: 98 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00 > ..<. > btrfs: 8801b13cc628: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc638: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc648: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc658: 00 00 00 00 00 00 00 00 > this one looks good so far. > btrfs: --- end eb contents at 8801b13cc4c8 --- > btrfs: dm-6 checksum verify failed on 2239404212224 wanted B5F632BC found > 3579FB59 level 160 > btrfs: node seems invalid now. checksum ok = 1 > btrfs: --- start eb contents at 8801b13cc4c8 --- > [... identical dump to above ...] > btrfs: --- end eb contents at 8801b13cc4c8 --- > btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len > (32768). eb start is 2239404212224, level 160, generation > 4716553384049587249, nritems 295705211. len param 17. debug > 2/989/295705211/4716553384049587249/0x0/0/0x0/0x0 > btrfs: --- start eb contents at 8801b13cc4c8 --- > btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00 > ...f > btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff > ...0... > btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00 > > btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > btrfs: 8801b13cc528: 00 00