btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-02 Thread Sami Liedes
Hi,

I just got this oops on a computer running 3.4.2.

A few minutes before I had started "btrfs device scrub /" and had a
watcher process running "btrfs scrub status /" every 5 seconds. After
a few gigabytes of scrubbing, I got this crash.

The oops is transcribed from photos, so it may contain some errors. I
tried to be careful, and double checked the backtrace.

Sami


general protection fault:  [#1] SMP
CPU 4
Modules linked in: tcp_diag inet_diag nfnetlink_log nfnetlink ufs qnx4 hfsplus 
hfs minix ntfs vfat msdos fat jfs reiserfs ext3 jbd ext2 ip6_tables ebtable_nat 
ebtables cn rfcomm bnep
parport_pc ppdev lp parport tun cpufreq_userspace cpufreq_stats 
cpufreq_powersave cpufreq_conservative binfmt_misc fuse nfsd nfs nfs_acl 
auth_rpcgss fscache lockd sunrpc iptable_filter ipt_MASQUERADE
ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack 
ip_tables x_tables xfs ext4 jbd2 mbcache radeon drm_kms_helper ttm drm 
i2c_algo_bit loop kvm_intel kvm snd_hda_codec_hdmi
snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio snd_usbmidi_lib 
snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm acpi_cpufreq snd_page_alloc 
snd_seq_midi snd_seq_midi_event snd_rawmi
di ath3k snd_seq snd_seq_device snd_timer iTCO_wdt bluetooth eeepci_wmi 
asus_wmi sparse_keymap crc16 rfkill pcspkr psmouse coretemp serio_raw evdev 
mperf pci_hotplug i2c_i801 i2c_core processor button
intel_agp snd mxm_wmi video wmi intel_gtt microcode soundcore sha256_generic 
dm_crypt dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor 
async_tx raid6_pq md_mod nbd btrfs libcrc32c
zlib_deflate sd_mod crc_t10dif crc32c_intel ghash_cmulni_intel firewire_ohci 
r8196 firewire_core ahci aesni_intel libahci mii crc_itu_t aes_x86_64 libata 
aes_generic cryptd scsi_mod e1000e thermal fa
n thermal_sys [last unloaded: scsi_wait_scan]

Pid: 30863, comm: btrfs-endio-met Tainted: GW3.4.2 #1 System 
manufacturer System Product Name/P8P67 EVO
RIP: 0010:[]  [] memcpy+0xd/0x110
RSP: :88003174dba8  EFLAGS: 00010202
RAX: 88003174dc8f RBX: 0011 RCX: 0002
RDX: 0001 RSI: 00050803 RDI: 88003174dc8f
RBP: 88003174dbf0 R08: 000a R09: 
R10:  R11:  R12: 88003174dca0
R13: 8800659f42b0 R14: 0048 R15: 0011
FS:  () GS:88021ed0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 0973c000 CR3: 000167ef3000 CR4: 000407e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process btrfs-endio-met (pid: 30863, threadinfo 88003174c000, task 
88006f818000)
Stack:
 a026bd6b 8801960f5000 8003 1000
 88003174dc58 03dd 88000ac13c60 88003174dc58
 696f70203a61685f 88003174dc00 a026904d 88003174dcd0
Call Trace:
 [] ? read_extent_buffer+0xbb/0x110 [btrfs]
 [] btrfs_node_key+0x1d/0x20 [btrfs]
 [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs]
 [] btree_readahead_hook+0x1f/0x40 [btrfs]
 [] btree_readpage_end_io_hook+0x111/0x260 [btrfs]
 [] ? find_first_extent_bit_state+0x22/0x80 [btrfs]
 [] end_bio_extent_readpage+0xcb/0xa30 [btrfs]
 [] ? end_workqueue_fn+0x31/0x50 [btrfs]
 [] bio_endio+0x18/0x30
 [] end_workqueue_fn+0x3c/0x50 [btrfs]
 [] worker_loop+0x157/0x560 [btrfs]
 [] ? btrfs_queue_worker+0x310/0x310 [btrfs]
 [] kthread+0x8e/0xa0
 [] kernel_thread_helper+0x4/0x10
 [] ? flush_kthread_worker+0x70/0x70
 [] ? gs_change+0x13/0x13
Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 eb fb ff ff eb e1 90 90 90 
90 90 90 90...
8 4c 8b 56 10 4c
RIP  [] memcpy+0xd/0x110
 RSP 


signature.asc
Description: Digital signature


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2013-03-27 Thread Stefan Behrens
On Tue, 3 Jul 2012 02:01:21 +0300, Sami Liedes wrote:
> Hi,
> 
> I just got this oops on a computer running 3.4.2.
> 
> A few minutes before I had started "btrfs device scrub /" and had a
> watcher process running "btrfs scrub status /" every 5 seconds. After
> a few gigabytes of scrubbing, I got this crash.
> 
> The oops is transcribed from photos, so it may contain some errors. I
> tried to be careful, and double checked the backtrace.
> 
>   Sami
> 
> 
> general protection fault:  [#1] SMP
> CPU 4
> Modules linked in: tcp_diag inet_diag nfnetlink_log nfnetlink ufs qnx4 
> hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs ext3 jbd ext2 ip6_tables 
> ebtable_nat ebtables cn rfcomm bnep
> parport_pc ppdev lp parport tun cpufreq_userspace cpufreq_stats 
> cpufreq_powersave cpufreq_conservative binfmt_misc fuse nfsd nfs nfs_acl 
> auth_rpcgss fscache lockd sunrpc iptable_filter ipt_MASQUERADE
> ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack 
> ip_tables x_tables xfs ext4 jbd2 mbcache radeon drm_kms_helper ttm drm 
> i2c_algo_bit loop kvm_intel kvm snd_hda_codec_hdmi
> snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio 
> snd_usbmidi_lib snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm 
> acpi_cpufreq snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmi
> di ath3k snd_seq snd_seq_device snd_timer iTCO_wdt bluetooth eeepci_wmi 
> asus_wmi sparse_keymap crc16 rfkill pcspkr psmouse coretemp serio_raw evdev 
> mperf pci_hotplug i2c_i801 i2c_core processor button
> intel_agp snd mxm_wmi video wmi intel_gtt microcode soundcore sha256_generic 
> dm_crypt dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor 
> async_tx raid6_pq md_mod nbd btrfs libcrc32c
> zlib_deflate sd_mod crc_t10dif crc32c_intel ghash_cmulni_intel firewire_ohci 
> r8196 firewire_core ahci aesni_intel libahci mii crc_itu_t aes_x86_64 libata 
> aes_generic cryptd scsi_mod e1000e thermal fa
> n thermal_sys [last unloaded: scsi_wait_scan]
> 
> Pid: 30863, comm: btrfs-endio-met Tainted: GW3.4.2 #1 System 
> manufacturer System Product Name/P8P67 EVO
> RIP: 0010:[]  [] memcpy+0xd/0x110
> RSP: :88003174dba8  EFLAGS: 00010202
> RAX: 88003174dc8f RBX: 0011 RCX: 0002
> RDX: 0001 RSI: 00050803 RDI: 88003174dc8f
> RBP: 88003174dbf0 R08: 000a R09: 
> R10:  R11:  R12: 88003174dca0
> R13: 8800659f42b0 R14: 0048 R15: 0011
> FS:  () GS:88021ed0() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 0973c000 CR3: 000167ef3000 CR4: 000407e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process btrfs-endio-met (pid: 30863, threadinfo 88003174c000, task 
> 88006f818000)
> Stack:
>  a026bd6b 8801960f5000 8003 1000
>  88003174dc58 03dd 88000ac13c60 88003174dc58
>  696f70203a61685f 88003174dc00 a026904d 88003174dcd0
> Call Trace:
>  [] ? read_extent_buffer+0xbb/0x110 [btrfs]
>  [] btrfs_node_key+0x1d/0x20 [btrfs]
>  [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs]
>  [] btree_readahead_hook+0x1f/0x40 [btrfs]
>  [] btree_readpage_end_io_hook+0x111/0x260 [btrfs]
>  [] ? find_first_extent_bit_state+0x22/0x80 [btrfs]
>  [] end_bio_extent_readpage+0xcb/0xa30 [btrfs]
>  [] ? end_workqueue_fn+0x31/0x50 [btrfs]
>  [] bio_endio+0x18/0x30
>  [] end_workqueue_fn+0x3c/0x50 [btrfs]
>  [] worker_loop+0x157/0x560 [btrfs]
>  [] ? btrfs_queue_worker+0x310/0x310 [btrfs]
>  [] kthread+0x8e/0xa0
>  [] kernel_thread_helper+0x4/0x10
>  [] ? flush_kthread_worker+0x70/0x70
>  [] ? gs_change+0x13/0x13
> Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 eb fb ff ff eb e1 90 90 90 
> 90 90 90 90...
> 8 4c 8b 56 10 4c
> RIP  [] memcpy+0xd/0x110
>  RSP 
> 


Tonight my box had the same on a system running 3.7.10. Also with a Btrfs scrub 
running.

general protection fault:  [#1] SMP
Modules linked in: xt_LOG xt_limit xt_multiport iptable_nat nf_nat_ipv4 nf_nat 
ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter 
ip6_tables iptable_mangle ipt_REJECT xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 
xt_state nf_conntrack iptable_filter ip_tables x_tables bnep rfcomm bluetooth 
sp5100_tco edac_core edac_mce_amd k8temp i2c_piix4 kvm_amd kvm e1000e serio_raw 
mac_hid shpchp lp parport btrfs zlib_deflate libcrc32c raid10 raid456 async_pq 
async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 
multipath linear ahci libahci [last unloaded: ipmi_msghandler]
CPU 1
Pid: 10376, comm: btrfs-endio-met Not tainted 3.7.10-030710-generic 
#201302271235 MICRO-STAR INTERNATIONAL CO., LTD MS-96B3/MS-96B3
RIP: 0010:[]  [] memc

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-02 Thread Sami Liedes
On Tue, Jul 03, 2012 at 02:01:21AM +0300, Sami Liedes wrote:
> The oops is transcribed from photos, so it may contain some errors. I
> tried to be careful, and double checked the backtrace.

Forgot to menion, the fs is a two-partition filesystem, with both data
and metadata mirrored.


Label: none  uuid: f26f08b1-89e6-4f2d-b7f9-857010dd2517
Total devices 2 FS bytes used 778.31GB
devid2 size 1.07TB used 1.05TB path /dev/dm-6
devid1 size 1.07TB used 1.05TB path /dev/dm-5


Sami


signature.asc
Description: Digital signature


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-03 Thread Jan Schmidt
> I just got this oops on a computer running 3.4.2.
> 
> A few minutes before I had started "btrfs device scrub /" and had a
> watcher process running "btrfs scrub status /" every 5 seconds. After
> a few gigabytes of scrubbing, I got this crash.
> 
> The oops is transcribed from photos, so it may contain some errors. I

You did *what*? :-) Uploading a photo would be fine, just in case that's easier
for you the next time.

> tried to be careful, and double checked the backtrace.
> 
>   Sami
> 
> 
> general protection fault:  [#1] SMP
> CPU 4
> Modules linked in: tcp_diag inet_diag nfnetlink_log nfnetlink ufs qnx4 
> hfsplus hfs minix ntfs vfat msdos fat jfs reiserfs ext3 jbd ext2 ip6_tables 
> ebtable_nat ebtables cn rfcomm bnep
> parport_pc ppdev lp parport tun cpufreq_userspace cpufreq_stats 
> cpufreq_powersave cpufreq_conservative binfmt_misc fuse nfsd nfs nfs_acl 
> auth_rpcgss fscache lockd sunrpc iptable_filter ipt_MASQUERADE
> ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack 
> ip_tables x_tables xfs ext4 jbd2 mbcache radeon drm_kms_helper ttm drm 
> i2c_algo_bit loop kvm_intel kvm snd_hda_codec_hdmi
> snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio 
> snd_usbmidi_lib snd_hwdep snd_pcm_oss snd_mixer_oss joydev snd_pcm 
> acpi_cpufreq snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmi
> di ath3k snd_seq snd_seq_device snd_timer iTCO_wdt bluetooth eeepci_wmi 
> asus_wmi sparse_keymap crc16 rfkill pcspkr psmouse coretemp serio_raw evdev 
> mperf pci_hotplug i2c_i801 i2c_core processor button
> intel_agp snd mxm_wmi video wmi intel_gtt microcode soundcore sha256_generic 
> dm_crypt dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor xor 
> async_tx raid6_pq md_mod nbd btrfs libcrc32c
> zlib_deflate sd_mod crc_t10dif crc32c_intel ghash_cmulni_intel firewire_ohci 
> r8196 firewire_core ahci aesni_intel libahci mii crc_itu_t aes_x86_64 libata 
> aes_generic cryptd scsi_mod e1000e thermal fa
> n thermal_sys [last unloaded: scsi_wait_scan]
> 
> Pid: 30863, comm: btrfs-endio-met Tainted: GW3.4.2 #1 System 
> manufacturer System Product Name/P8P67 EVO
> RIP: 0010:[]  [] memcpy+0xd/0x110
> RSP: :88003174dba8  EFLAGS: 00010202
> RAX: 88003174dc8f RBX: 0011 RCX: 0002
> RDX: 0001 RSI: 00050803 RDI: 88003174dc8f
> RBP: 88003174dbf0 R08: 000a R09: 
> R10:  R11:  R12: 88003174dca0
> R13: 8800659f42b0 R14: 0048 R15: 0011
> FS:  () GS:88021ed0() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 0973c000 CR3: 000167ef3000 CR4: 000407e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process btrfs-endio-met (pid: 30863, threadinfo 88003174c000, task 
> 88006f818000)
> Stack:
>  a026bd6b 8801960f5000 8003 1000
>  88003174dc58 03dd 88000ac13c60 88003174dc58
>  696f70203a61685f 88003174dc00 a026904d 88003174dcd0
> Call Trace:
>  [] ? read_extent_buffer+0xbb/0x110 [btrfs]
>  [] btrfs_node_key+0x1d/0x20 [btrfs]
>  [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs]
>  [] btree_readahead_hook+0x1f/0x40 [btrfs]
>  [] btree_readpage_end_io_hook+0x111/0x260 [btrfs]
>  [] ? find_first_extent_bit_state+0x22/0x80 [btrfs]
>  [] end_bio_extent_readpage+0xcb/0xa30 [btrfs]
>  [] ? end_workqueue_fn+0x31/0x50 [btrfs]
>  [] bio_endio+0x18/0x30
>  [] end_workqueue_fn+0x3c/0x50 [btrfs]
>  [] worker_loop+0x157/0x560 [btrfs]
>  [] ? btrfs_queue_worker+0x310/0x310 [btrfs]
>  [] kthread+0x8e/0xa0
>  [] kernel_thread_helper+0x4/0x10
>  [] ? flush_kthread_worker+0x70/0x70
>  [] ? gs_change+0x13/0x13
> Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 eb fb ff ff eb e1 90 90 90 
> 90 90 90 90...
> 8 4c 8b 56 10 4c
> RIP  [] memcpy+0xd/0x110
>  RSP 

That's looking strange.

I checked the readahead code again: It deliberately skips locking and uses
btrfs_node_key with a counter variable. This means, we might end up reading a
key that's no longer actually there. However, it only operates on nodes of
trees, not leaves. Node entries have a fixed size, so no matter what changes in
the node, you won't reach behind the end of that node with an index that was
valid the moment before.

As far as I see it, that algorithm is safe. It could miss some keys or do some
extra work that's not strictly required, but it should never reach a GPF from
btrfs_node_key.

If no other ideas come up, I'd try memtesting that machine.

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-03 Thread Sami Liedes
On Tue, Jul 03, 2012 at 02:01:21AM +0300, Sami Liedes wrote:
> I just got this oops on a computer running 3.4.2.

I now repeated this on 3.4.4. Merely running a "btrfs scrub start /"
causes this after a couple of minutes of running. This time I didn't
run "btrfs scrub status /" in a loop, so that's not the cause. And
judging from the two tries and two crashes, this seems very
repeatable.

The backtrace is basically exactly the same as with 3.4.4.

Sami


signature.asc
Description: Digital signature


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-03 Thread Sami Liedes
On Tue, Jul 03, 2012 at 03:11:04PM +0200, Jan Schmidt wrote:
> > The oops is transcribed from photos, so it may contain some errors. I
> 
> You did *what*? :-) Uploading a photo would be fine, just in case that's 
> easier
> for you the next time.

Nah, it's sometimes refreshing to do something that doesn't take too
much thinking for a while. Besides I fault myself for not having had
netconsole logging on that machine and needed to punish myself for
that omission ;)

> That's looking strange.
> 
> I checked the readahead code again: It deliberately skips locking and uses
> btrfs_node_key with a counter variable. This means, we might end up reading a
> key that's no longer actually there. However, it only operates on nodes of
> trees, not leaves. Node entries have a fixed size, so no matter what changes 
> in
> the node, you won't reach behind the end of that node with an index that was
> valid the moment before.
> 
> As far as I see it, that algorithm is safe. It could miss some keys or do some
> extra work that's not strictly required, but it should never reach a GPF from
> btrfs_node_key.
> 
> If no other ideas come up, I'd try memtesting that machine.

Ran a full pass of memtest86+ v4.20, with no errors found. Also the
machine works very well in all other respects under heavy load.

I think I might try setting up that netconsole to see if there are any
interesting console messages before the oops... As I said, I also was
able to reproduce this on 3.4.4, so ATM I assume I'm able to reproduce
this at will.

Sami


signature.asc
Description: Digital signature


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-03 Thread Jan Schmidt
On 03.07.2012 15:58, Sami Liedes wrote:
> I think I might try setting up that netconsole to see if there are any
> interesting console messages before the oops... As I said, I also was
> able to reproduce this on 3.4.4, so ATM I assume I'm able to reproduce
> this at will.

That would be helpful. It's good that it's reproducible. Please paste the entire
output if possible.

If there's nothing immediately useful in there, I'll try to make up some
debugging patches for that problem.

Thanks!
-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-03 Thread Sami Liedes
On Tue, Jul 03, 2012 at 04:35:25PM +0200, Jan Schmidt wrote:
> On 03.07.2012 15:58, Sami Liedes wrote:
> > I think I might try setting up that netconsole to see if there are any
> > interesting console messages before the oops... As I said, I also was
> > able to reproduce this on 3.4.4, so ATM I assume I'm able to reproduce
> > this at will.
> 
> That would be helpful. It's good that it's reproducible. Please paste the 
> entire
> output if possible.
> 
> If there's nothing immediately useful in there, I'll try to make up some
> debugging patches for that problem.

There's certainly something there; unfortunately netconsole eats most
of it. (Is there a better way to capture full dmesg from panic?
Assuming a computer without a serial port.)

I've seen this before: An overly long "Modules linked in:" line causes
a large gap in netconsole output.

Hovever perhaps the most interesting piece is there:

   [30846.822395] WARNING: at fs/btrfs/extent_io.c:4522 
read_extent_buffer+0xe6/0x110 [btrfs]()

Unfortunately, no backtrace for that :( Ideas?

Sami



[30638.258946] netpoll: netconsole: local port 
[30638.258950] netpoll: netconsole: local IP 192.168.1.2
[30638.258952] netpoll: netconsole: interface 'eth0'
[30638.258953] netpoll: netconsole: remote port 1194
[30638.258954] netpoll: netconsole: remote IP 192.168.1.73
[30638.258956] netpoll: netconsole: remote ethernet address 00:1c:10:44:47:2c
[30638.259002] console [netcon0] enabled
[30638.259004] netconsole: network logging started
[30846.822356] [ cut here ]
[30846.822395] WARNING: at fs/btrfs/extent_io.c:4522 
read_extent_buffer+0xe6/0x110 [btrfs]()
[30846.822400] Hardware name: System Product Name
[30846.822402] Modules linked in: netconsole configfs tun cn ip6_tables 
ebtable_nat ebtables parport_pc ppdev lp parport bnep rfcomm cpufreq_userspace 
cpufreq_stats cpufreq_powersave cpufreq_conservative binfmt_misc nfsd nfs 
nfs_acl auth_rpcgss fscache lockd sunrpc iptable_filter ipt_MASQUERADE 
ipt_REDIRECT iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack 
ip_tables x_tables xfs ext4 jbd2 mbcache radeon drm_kms_helper ttm drm 
i2c_algo_bit loop fuse kvm_intel kvm snd_hda_codec_hdmi snd_hda_codec_realtek 
snd_usb_audio snd_hda_intel snd_usbmidi_lib snd_hda_codec snd_hwdep snd_pcm_oss 
joydev snd_mixer_oss snd_pcm snd_page_alloc eeepc_wmi90 90 90 48 89 f8 48 89 90 
48 c1 e9 03 83 e2 07  48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 
10 4c 
[30846.825219] RIP  [] memcpy+0xd/0x110
[30846.825248]  RSP 
 xord1  raid6_pq md_mod nbd btrfs libcrc32c zlib_deflate sd_mod crc_t10dif ahci 
libahci crc32c_intel r8169 pci_hotplug bluetooth snd_seq_device snd_timer snd 
intel_agp acpi_cpufreq mperf mxm_wmi psmouse wmi coretemp soundcore processor 
serio_raw i2c_i801 iTCO_wdt video rfkill microcode crc16 evdev i2c_core button 
intel_gtt[30846.822679]  [] 
btree_readpage_end_io_hook+0x111/0x260 [btrfs]
 sha256_generic dm_crypt dm_mod raid456 async_raid6_recov async_memcpy async_pq 
async_xor xor async_tx[30846.822773]  [] ? gs_change+0x13/0x13
 md_mod nbd btrfs libcrc32c zlib_deflate sd_mod crc_t10dif ahci libahci 
raid6_pq r8169 mii firewire_ohci ghash_clmulni_intel firewire_core aesni_intel 
crc_itu_t aes_x86_64 bnep aes_generic cryptd scsi_mod e1000e fan thermal nfsd 
thermal_sys [last unloaded: netconsole] auth_rpcgss[30846.823704] Pid: 16224, 
comm: btrfs-endio-met Tainted: GW3.4.4 #1 System manufacturer 
System Product Name/P8P67 EVO
[30846.823767] RIP: 0010:[]  [] 
memcpy+0xd/0x110
[30846.823805] RSP: :88009ecebba8  EFLAGS: 00010202
[30846.823830] RAX: 88009ecebc8f RBX: 0011 RCX: 0002
[30846.823865] RDX: 0001 RSI: 00050803 RDI: 88009ecebc8f

[30846.823933] R10: 88001d2395c0 R11: 8801f198c780 R12: 88009ecebca0
[30846.823968] R13: 88008eb01178 R14: 0048 R15: 0011
[30846.824002] FS:  () GS:88021ec8() 
knlGS:
[30846.824042] CS:  0010 DS:  ES:  CR0: 8005003b
[30846.824069] CR2: 0add813c CR3: 000163305000 CR4: 000407e0
[30846.823694] 
[30846.824101] DR0:  DR1:  DR2: 
[30846.824136] DR3:  DR6: 0ff0 DR7: 0400
[30846.824171] Process btrfs-endio-met (pid: 16224, threadinfo 
88009ecea000, task 88020f3142f0)
[30846.824225]  a0275d5b 880167d3 8003 
1000
[30846.824273]  88009ecebc58 03dd[30846.824214] Stack:
 880099cdea20
[30846.824318]  6172745f69706d20 88009ecebc00 a026d03d 
88009ecebc58 88009ecebcd0
[30846.824363] Call Trace:
[30846.824434]  [] btrfs_node_key+0x1d/0x20 [btrfs]
[30846.824474]  [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs]
[30846.824517]  [] btree_readahead_hook+0x1f/0x40 [btrfs]

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-03 Thread Sami Liedes
On Wed, Jul 04, 2012 at 01:47:56AM +0300, Sami Liedes wrote:
> I've seen this before: An overly long "Modules linked in:" line causes
> a large gap in netconsole output.

I managed to capture the entire output using netconsole by modifying
the kernel to not output the list of modules.

Sami



[  125.827919] netpoll: netconsole: local port 
[  125.827946] netpoll: netconsole: local IP 192.168.1.2
[  125.827969] netpoll: netconsole: interface 'eth0'
[  125.827990] netpoll: netconsole: remote port 1194
[  125.828011] netpoll: netconsole: remote IP 192.168.1.73
[  125.828034] netpoll: netconsole: remote ethernet address 00:1c:10:44:47:2c
[  125.828169] console [netcon0] enabled
[  125.828193] netconsole: network logging started
[  247.787472] [ cut here ]
[  247.787536] WARNING: at fs/btrfs/extent_io.c:4522 
read_extent_buffer+0xe6/0x110 [btrfs]()
[  247.787573] Hardware name: System Product Name
[  247.787594] Modules linked in:   [last unloaded: scsi_wait_scan]
[  247.787632] Pid: 1146, comm: btrfs-endio-met Tainted: GW
3.4.4-modded-oops+ #1
[  247.787674] Call Trace:
[  247.787692]  [] warn_slowpath_common+0x7a/0xb0
[  247.787794]  [] warn_slowpath_null+0x15/0x20
[  247.787835]  [] read_extent_buffer+0xe6/0x110 [btrfs]
[  247.787877]  [] btrfs_node_key+0x1d/0x20 [btrfs]
[  247.787917]  [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs]
[  247.787959]  [] btree_readahead_hook+0x1f/0x40 [btrfs]
[  247.787999]  [] btree_readpage_end_io_hook+0x111/0x260 
[btrfs]
[  247.788043]  [] ? find_first_extent_bit_state+0x22/0x80 
[btrfs]
[  247.788087]  [] end_bio_extent_readpage+0xcb/0xa30 [btrfs]
[  247.788129]  [] ? end_workqueue_fn+0x31/0x50 [btrfs]
[  247.788161]  [] bio_endio+0x18/0x30
[  247.788194]  [] end_workqueue_fn+0x3c/0x50 [btrfs]
[  247.788234]  [] worker_loop+0x157/0x560 [btrfs]
[  247.788272]  [] ? btrfs_queue_worker+0x310/0x310 [btrfs]
[  247.788306]  [] kthread+0x8e/0xa0
[  247.788332]  [] kernel_thread_helper+0x4/0x10
[  247.788360]  [] ? flush_kthread_worker+0x70/0x70
[  247.788389]  [] ? gs_change+0x13/0x13
[  247.788413] ---[ end trace e93713a9d40cd06e ]---
[  247.788444] general protection fault:  [#1] SMP 
[  247.788473] CPU 5 
[  247.788484] Modules linked in:   [last unloaded: scsi_wait_scan]
[  247.788523] 
[  247.788533] Pid: 1146, comm: btrfs-endio-met Tainted: GW
3.4.4-modded-oops+ #1 System manufacturer System Product Name/P8P67 EVO
[  247.788595] RIP: 0010:[]  [] 
memcpy+0xd/0x110
[  247.788632] RSP: :8802152f3ba8  EFLAGS: 00010202
[  247.788656] RAX: 8802152f3c8f RBX: 0011 RCX: 0002
[  247.788687] RDX: 0001 RSI: 00050803 RDI: 8802152f3c8f
[  247.788719] RBP: 8802152f3bf0 R08:  R09: 
[  247.788750] R10: 880207dd89c0 R11: 8801f3a90780 R12: 8802152f3ca0
[  247.788782] R13: 8801c1d5b830 R14: 0048 R15: 0011
[  247.788814] FS:  () GS:88021ed4() 
knlGS:
[  247.788850] CS:  0010 DS:  ES:  CR0: 8005003b
[  247.788876] CR2: 038942a0 CR3: 0001c26ce000 CR4: 000407e0
[  247.788907] DR0:  DR1:  DR2: 
[  247.788939] DR3:  DR6: 0ff0 DR7: 0400
[  247.788971] Process btrfs-endio-met (pid: 1146, threadinfo 8802152f2000, 
task 8801f4192ca0)
[  247.789009] Stack:
[  247.789021]  a01d7d5b 8802010b7000 8003 
1000
[  247.789066]  8802152f3c58 03dd 88021032fa20 
8802152f3c58
[  247.789112]  6669646e65230a29 8802152f3c00 a01cf03d 
8802152f3cd0
[  247.789157] Call Trace:
[  247.789184]  [] ? read_extent_buffer+0xbb/0x110 [btrfs]
[  247.789226]  [] btrfs_node_key+0x1d/0x20 [btrfs]
[  247.789265]  [] __readahead_hook.isra.5+0x3c0/0x420 [btrfs]
[  247.789308]  [] btree_readahead_hook+0x1f/0x40 [btrfs]
[  247.789348]  [] btree_readpage_end_io_hook+0x111/0x260 
[btrfs]
[  247.789391]  [] ? find_first_extent_bit_state+0x22/0x80 
[btrfs]
[  247.789434]  [] end_bio_extent_readpage+0xcb/0xa30 [btrfs]
[  247.789475]  [] ? end_workqueue_fn+0x31/0x50 [btrfs]
[  247.789506]  [] bio_endio+0x18/0x30
[  247.789539]  [] end_workqueue_fn+0x3c/0x50 [btrfs]
[  247.789578]  [] worker_loop+0x157/0x560 [btrfs]
[  247.789616]  [] ? btrfs_queue_worker+0x310/0x310 [btrfs]
[  247.789648]  [] kthread+0x8e/0xa0
[  247.789672]  [] kernel_thread_helper+0x4/0x10
[  247.789700]  [] ? flush_kthread_worker+0x70/0x70
[  247.789729]  [] ? gs_change+0x13/0x13
[  247.789753] Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 eb fb ff ff 
eb e1 90 90 90 90 90 90 90 90 90 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07  48 
a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 
[  247.790092] RIP  [] memcpy+0xd/0x110
[  247.790120]  RSP 
[  247.790138] ---

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-04 Thread Jan Schmidt
On 04.07.2012 02:17, Sami Liedes wrote:
> On Wed, Jul 04, 2012 at 01:47:56AM +0300, Sami Liedes wrote:
>> I've seen this before: An overly long "Modules linked in:" line causes
>> a large gap in netconsole output.
> 
> I managed to capture the entire output using netconsole by modifying
> the kernel to not output the list of modules.

Okay, thanks for the output. Can you please apply the patch below and capture
especially the line printed before the "cut here" line?

Thanks!
-Jan

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c9018a0..beabe99 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4519,7 +4519,14 @@ void read_extent_buffer(struct extent_buffer *eb, void 
*dstv,
size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
 
-   WARN_ON(start > eb->len);
+   if (start > eb->len) {
+   printk(KERN_ERR "btrfs: invalid parameters for 
read_extent_buffer: start (%lu) > eb->len (%lu). eb start is %llu, level %d, 
generation %llu, nritems %d. len param %lu. debug %llu/%llu/%llu/%llu\n",
+   start, eb->len, eb->start, btrfs_header_level(eb),
+   btrfs_header_generation(eb), btrfs_header_nritems(eb),
+   len,
+   eb->debug[0], eb->debug[1], eb->debug[2], eb->debug[3]);
+   WARN_ON(1);
+   }
WARN_ON(start + len > eb->start + eb->len);
 
offset = (start_offset + start) & ((unsigned long)PAGE_CACHE_SIZE - 1);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b516c3b..1bbf823 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -164,6 +164,8 @@ struct extent_buffer {
wait_queue_head_t lock_wq;
struct page *inline_pages[INLINE_EXTENT_BUFFER_PAGES];
struct page **pages;
+
+   u64 debug[4];
 };
 
 static inline void extent_set_compress_type(unsigned long *bio_flags,
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index ac5d010..d9c1146 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -168,10 +168,15 @@ static int __readahead_hook(struct btrfs_root *root, 
struct extent_buffer *eb,
struct btrfs_key key;
struct btrfs_key next_key;
 
+   eb->debug[0] = 1;
+   eb->debug[1] = i;
+   eb->debug[2] = nritems;
+   eb->debug[3] = generation;
btrfs_node_key_to_cpu(eb, &key, i);
-   if (i + 1 < nritems)
+   if (i + 1 < nritems) {
+   eb->debug[0] = 2;
btrfs_node_key_to_cpu(eb, &next_key, i + 1);
-   else
+   } else
next_key = re->top;
bytenr = btrfs_node_blockptr(eb, i);
n_gen = btrfs_node_ptr_generation(eb, i);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-04 Thread Sami Liedes
On Wed, Jul 04, 2012 at 01:26:46PM +0200, Jan Schmidt wrote:
> On 04.07.2012 02:17, Sami Liedes wrote:
> > On Wed, Jul 04, 2012 at 01:47:56AM +0300, Sami Liedes wrote:
> >> I've seen this before: An overly long "Modules linked in:" line causes
> >> a large gap in netconsole output.
> > 
> > I managed to capture the entire output using netconsole by modifying
> > the kernel to not output the list of modules.
> 
> Okay, thanks for the output. Can you please apply the patch below and capture
> especially the line printed before the "cut here" line?

Here you go.

Sami


[  121.524803] netpoll: netconsole: local port 
[  121.524831] netpoll: netconsole: local IP 192.168.1.2
[  121.524853] netpoll: netconsole: interface 'eth0'
[  121.524874] netpoll: netconsole: remote port 1194
[  121.524894] netpoll: netconsole: remote IP 192.168.1.73
[  121.524917] netpoll: netconsole: remote ethernet address 00:1c:10:44:47:2c
[  121.525055] console [netcon0] enabled
[  121.525074] netconsole: network logging started
[  200.980496] btrfs: invalid parameters for read_extent_buffer: start (32771) 
> eb->len (32768). eb start is 2243489562624, level 26, generation 
3144240307695375391, nritems 620178657. len param 17. debug 
2/989/620178657/3144240307695375391
[  200.980594] [ cut here ]
[  200.980644] WARNING: at fs/btrfs/extent_io.c:4528 
read_extent_buffer+0x167/0x1a0 [btrfs]()
[  200.980681] Hardware name: System Product Name
[  200.980701] Modules linked in:   [last unloaded: scsi_wait_scan]
[  200.980739] Pid: 1145, comm: btrfs-endio-met Tainted: GW
3.4.4-modded-oops+ #2
[  200.980774] Call Trace:
[  200.980792]  [] warn_slowpath_common+0x7a/0xb0
[  200.980821]  [] warn_slowpath_null+0x15/0x20
[  200.980860]  [] read_extent_buffer+0x167/0x1a0 [btrfs]
[  200.980902]  [] btrfs_node_key+0x1d/0x20 [btrfs]
[  200.980941]  [] __readahead_hook.isra.5+0x3ff/0x460 [btrfs]
[  200.980982]  [] btree_readahead_hook+0x1f/0x40 [btrfs]
[  200.981022]  [] btree_readpage_end_io_hook+0x111/0x260 
[btrfs]
[  200.981065]  [] ? find_first_extent_bit_state+0x22/0x80 
[btrfs]
[  200.981109]  [] end_bio_extent_readpage+0xcb/0xa30 [btrfs]
[  200.981150]  [] ? end_workqueue_fn+0x31/0x50 [btrfs]
[  200.981182]  [] bio_endio+0x18/0x30
[  200.981214]  [] end_workqueue_fn+0x3c/0x50 [btrfs]
[  200.981253]  [] worker_loop+0x157/0x560 [btrfs]
[  200.981291]  [] ? btrfs_queue_worker+0x310/0x310 [btrfs]
[  200.981323]  [] kthread+0x8e/0xa0
[  200.981348]  [] kernel_thread_helper+0x4/0x10
[  200.981377]  [] ? flush_kthread_worker+0x70/0x70
[  200.981406]  [] ? gs_change+0x13/0x13
[  200.981430] ---[ end trace e93713a9d40cd06e ]---
[  200.981459] general protection fault:  [#1] SMP 
[  200.981487] CPU 2 
[  200.981498] Modules linked in:   [last unloaded: scsi_wait_scan]
[  200.981540] 
[  200.981550] Pid: 1145, comm: btrfs-endio-met Tainted: GW
3.4.4-modded-oops+ #2 System manufacturer System Product Name/P8P67 EVO
[  200.981612] RIP: 0010:[]  [] 
memcpy+0xd/0x110
[  200.981650] RSP: :8801f4bf7b68  EFLAGS: 00010202
[  200.981675] RAX: 8801f4bf7c8f RBX: 0011 RCX: 0002
[  200.981707] RDX: 0001 RSI: 00050803 RDI: 8801f4bf7c8f
[  200.981738] RBP: 8801f4bf7be0 R08:  R09: 
[  200.981769] R10: 8801f4bf7c8f R11: 8801f3930780 R12: 8801f4bf7ca0
[  200.981800] R13: 8801f7286178 R14: 0048 R15: 0011
[  200.981832] FS:  () GS:88021ec8() 
knlGS:
[  200.981868] CS:  0010 DS:  ES:  CR0: 8005003b
[  200.981894] CR2: f773a000 CR3: 00020d0de000 CR4: 000407e0
[  200.981925] DR0:  DR1:  DR2: 
[  200.981956] DR3:  DR6: 0ff0 DR7: 0400
[  200.981988] Process btrfs-endio-met (pid: 1145, threadinfo 8801f4bf6000, 
task 8801f4ff42f0)
[  200.982026] Stack:
[  200.982038]  a01d6d5b 880124f72ce1 0011 
0002
[  200.982084]  03dd 24f72ce1 2ba295e6a4a5d41f 
88020dc65000
[  200.982130]  8801f4bf7c8f 1000 8801f4bf7c58 
03dd
[  200.982175] Call Trace:
[  200.982201]  [] ? read_extent_buffer+0xbb/0x1a0 [btrfs]
[  200.982243]  [] btrfs_node_key+0x1d/0x20 [btrfs]
[  200.982281]  [] __readahead_hook.isra.5+0x3ff/0x460 [btrfs]
[  200.982322]  [] btree_readahead_hook+0x1f/0x40 [btrfs]
[  200.982362]  [] btree_readpage_end_io_hook+0x111/0x260 
[btrfs]
[  200.982405]  [] ? find_first_extent_bit_state+0x22/0x80 
[btrfs]
[  200.982448]  [] end_bio_extent_readpage+0xcb/0xa30 [btrfs]
[  200.982489]  [] ? end_workqueue_fn+0x31/0x50 [btrfs]
[  200.982519]  [] bio_endio+0x18/0x30
[  200.982552]  [] end_workqueue_fn+0x3c/0x50 [btrfs]
[  200.982591]  [] worker_loop+0x157/0x560 [btrfs]
[  200

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-04 Thread Jan Schmidt
On 04.07.2012 18:03, Sami Liedes wrote:
> Here you go.
> 
>   Sami
> [...]
> [  200.980496] btrfs: invalid parameters for read_extent_buffer: start 
> (32771) > eb->len (32768). eb start is 2243489562624, level 26, generation 
> 3144240307695375391, nritems 620178657. len param 17. debug 
> 2/989/620178657/3144240307695375391

Wow, that's strange. Can you repeat your test once or twice and paste that line,
please? I'd like to get a feeling if the values are completely random.

Reading more of the readahead code now...

Thanks,
-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-04 Thread Sami Liedes
On Wed, Jul 04, 2012 at 06:38:00PM +0200, Jan Schmidt wrote:
> > [  200.980496] btrfs: invalid parameters for read_extent_buffer: start 
> > (32771) > eb->len (32768). eb start is 2243489562624, level 26, generation 
> > 3144240307695375391, nritems 620178657. len param 17. debug 
> > 2/989/620178657/3144240307695375391

Let's call this try 1. I ran it three more times, so below we have
tries 2, 3 and 4.

> Wow, that's strange. Can you repeat your test once or twice and paste that 
> line,
> please? I'd like to get a feeling if the values are completely random.

Curiously, it clearly takes longer for it to crash after starting the
scrub each time I run it. Also on try 4 I got an entirely different
crash (backtrace below). Now it scrubs maybe the first 200G or so of
both devices of the (raid-1) 2.2T filesystem before it crashes.

start and eb->len seem to be the same (32771 and 32768) every time. eb
start varies, but there's some pattern if you view them in hex:

  Try 1  20a5a66
  Try 2  20bb0018000
  Try 3  20a8bc28000
  Try 4  (no output, different crash)

The rest of the values seem to me to be completely different every time.

Sami


Try 2:


[12961.870107] btrfs: invalid parameters for read_extent_buffer: start (32771) 
> eb->len (32768). eb start is 2249220784128, level 14, generation 
2242260605927040034, nritems 117835525. len param 17. debug 
2/989/117835525/2242260605927040034
[12961.870204] [ cut here ]
[12961.870264] WARNING: at fs/btrfs/extent_io.c:4528 
read_extent_buffer+0x167/0x1a0 [btrfs]()
[12961.870302] Hardware name: System Product Name
[12961.870322] Modules linked in:   [last unloaded: scsi_wait_scan]
[12961.870367] Pid: 1144, comm: btrfs-endio-met Tainted: GW
3.4.4-modded-oops+ #2
[12961.870403] Call Trace:
[12961.870421]  [] warn_slowpath_common+0x7a/0xb0
[12961.870449]  [] warn_slowpath_null+0x15/0x20
[...]


Try 3:


[  531.770984] btrfs: invalid parameters for read_extent_buffer: start (32771) 
> eb->len (32768). eb start is 2244317708288, level 170, generation 
13639284858109917187, nritems 6171943. len param 17. debug 
2/989/6171943/13639284858109917187
[  531.771081] [ cut here ]
[  531.771133] WARNING: at fs/btrfs/extent_io.c:4528 
read_extent_buffer+0x167/0x1a0 [btrfs]()
[  531.771169] Hardware name: System Product Name
[  531.771191] Modules linked in:   [last unloaded: scsi_wait_scan]
[  531.771229] Pid: 1132, comm: btrfs-endio-met Tainted: GW
3.4.4-modded-oops+ #2
[  531.771265] Call Trace:
[  531.771282]  [] warn_slowpath_common+0x7a/0xb0
[...]


Try 4:


[   95.933108] netconsole: network logging started
[  982.651987] unable to find logical 691402650139365534 len 32768
[  982.652060] [ cut here ]
[  982.652085] kernel BUG at fs/btrfs/volumes.c:3725!
[  982.652109] invalid opcode:  [#1] SMP 
[  982.652138] CPU 4 
[  982.652149] Modules linked in:   [last unloaded: scsi_wait_scan]
[  982.652190] 
[  982.652201] Pid: 1127, comm: btrfs-endio-met Tainted: GW
3.4.4-modded-oops+ #2 System manufacturer System Product Name/P8P67 EVO
[  982.652264] RIP: 0010:[]  [] 
__btrfs_map_block+0x668/0x680 [btrfs]
[  982.652323] RSP: :8801f43cfa70  EFLAGS: 00010286
[  982.652347] RAX: 0049 RBX: 09985b0c0e52109e RCX: 0082
[  982.652379] RDX: 00e8 RSI: 0046 RDI: 0246
[  982.652411] RBP: 8801f43cfb10 R08:  R09: 
[  982.652443] R10: 8801f1fbf680 R11: 8801f3b00780 R12: 8801f0026108
[  982.652475] R13: 8000 R14: 8801f0026fe0 R15: 8801f43cfbb0
[  982.652507] FS:  () GS:88021ed0() 
knlGS:
[  982.652542] CS:  0010 DS:  ES:  CR0: 8005003b
[  982.652568] CR2: f775c000 CR3: 0002110a1000 CR4: 000407e0
[  982.652600] DR0:  DR1:  DR2: 
[  982.652632] DR3:  DR6: 0ff0 DR7: 0400
[  982.652664] Process btrfs-endio-met (pid: 1127, threadinfo 8801f43ce000, 
task 8801f401c2f0)
[  982.652703] Stack:
[  982.652715]  8801f43cfa90 811dc553 88021ec0ce88 
88021ed0ce80
[  982.652761]  8801f43cfb00 0086 8801f43cfaf0 
0086
[  982.652807]  00fdf43cfad0 8801f43cfba8 0024 
0004
[  982.652852] Call Trace:
[  982.652869]  [] ? cpumask_next_and+0x23/0x40
[  982.652897]  [] ? kmem_cache_alloc_trace+0xc2/0x100
[  982.652940]  [] btrfs_map_block+0x9/0x10 [btrfs]
[  982.652979]  [] reada_add_block+0x1c2/0x890 [btrfs]
[  9

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-05 Thread Jan Schmidt


On 04.07.2012 22:24, Sami Liedes wrote:
> On Wed, Jul 04, 2012 at 06:38:00PM +0200, Jan Schmidt wrote:
>>> [  200.980496] btrfs: invalid parameters for read_extent_buffer: start 
>>> (32771) > eb->len (32768). eb start is 2243489562624, level 26, generation 
>>> 3144240307695375391, nritems 620178657. len param 17. debug 
>>> 2/989/620178657/3144240307695375391
> 
> Let's call this try 1. I ran it three more times, so below we have
> tries 2, 3 and 4.
> 
>> Wow, that's strange. Can you repeat your test once or twice and paste that 
>> line,
>> please? I'd like to get a feeling if the values are completely random.
> 
> Curiously, it clearly takes longer for it to crash after starting the
> scrub each time I run it. Also on try 4 I got an entirely different
> crash (backtrace below). Now it scrubs maybe the first 200G or so of
> both devices of the (raid-1) 2.2T filesystem before it crashes.

Can you double check that there's nothing about corrected errors in your
logs? Scrub will correct errors along the way and log that. So maybe
we've only a few tries left to find the root cause.

> start and eb->len seem to be the same (32771 and 32768) every time.

That's the tree block size in your setup.

> eb start varies, but there's some pattern if you view them in hex:
> 
>   Try 1  20a5a66
>   Try 2  20bb0018000
>   Try 3  20a8bc28000
>   Try 4  (no output, different crash)

I fail to identify a real pattern there.

> The rest of the values seem to me to be completely different every time.

Which is itself interesting.

> Try 4:

That one is not that much different. We read some garbage from a tree
block and started the next read ahead cycle for the alledged children.
That way we came to a logical address that's out of bounds, instead of a
logical address with even more garbage.

I'd like to see if you corrupted your trees on disk in a really strange
manner (with matching checksums?), or if data comes from the disk intact
and becomes damaged thereafter.

Could you store the output of
btrfs-debug-tree /dev/[whatever]
before try number 5 and afterwards? It will be quite a lot if you've got
a lot of files in there. Don't send it anywhere right now, just store it
away if possible.

What I'd like to get in the next reply is the output of the attached
patch, a single pass should do this time.

NB: As we've already check_leaf doing exta leaf checks after reading
them, we should probably add something like check_node as a general
manner to make btrfs more robust.

Thank you,
-Jan

---
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..34122c2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -316,6 +316,11 @@ static int csum_tree_block(struct btrfs_root *root, struct 
extent_buffer *buf,
return 0;
 }

+int btrfs_csum_tree_block(struct btrfs_root *root, struct extent_buffer *buf)
+{
+   return csum_tree_block(root, buf, 1);
+}
+
 /*
  * we can't consider a given block up to date unless the transid of the
  * block matches the transid in the parent node's pointer.  This is how we
@@ -471,6 +476,12 @@ static int check_tree_block_fsid(struct btrfs_root *root,
   (unsigned long long)btrfs_header_bytenr(eb), \
   (unsigned long long)root->objectid, slot)

+#define CORRUPT_NODE(root, node, reason, ...)  \
+   printk(KERN_CRIT "btrfs: corrupt node block=%llu,"  \
+  "root=%llu: " reason,\
+  (unsigned long long)btrfs_header_bytenr(node),   \
+  (unsigned long long)root->objectid, ##__VA_ARGS__)
+
 static noinline int check_leaf(struct btrfs_root *root,
   struct extent_buffer *leaf)
 {
@@ -532,6 +543,42 @@ static noinline int check_leaf(struct btrfs_root *root,
return 0;
 }

+static noinline int check_node(struct btrfs_root *root,
+  struct extent_buffer *node)
+{
+   int i;
+   u32 nritems = btrfs_header_nritems(node);
+   u64 generation;
+
+   if (nritems == 0)
+   return 0;
+
+   if (nritems > BTRFS_NODEPTRS_PER_BLOCK(root)) {
+   CORRUPT_NODE(root, node, "nritems (%lu) too large (%lu)\n",
+(unsigned long)nritems,
+BTRFS_NODEPTRS_PER_BLOCK(root));
+   return -EIO;
+   }
+
+   if (node->len > root->nodesize) {
+   CORRUPT_NODE(root, node, "length (%lu) too large (%lu)\n",
+node->len, (unsigned long)root->nodesize);
+   return -EIO;
+   }
+
+   generation = btrfs_super_generation(root->fs_info->super_copy);
+   for (i = 0; i < nritems; i++) {
+   if (btrfs_node_ptr_generation(node, i) > generation) {
+   CORRUPT_NODE(root, node, "generation (%llu) too new in 
slot %d (maximum expected %llu)\n",
+btrfs_node_ptr_generation(node, i), i,
+ 

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-05 Thread Sami Liedes
On Thu, Jul 05, 2012 at 03:41:33PM +0200, Jan Schmidt wrote:
> I'd like to see if you corrupted your trees on disk in a really strange
> manner (with matching checksums?), or if data comes from the disk intact
> and becomes damaged thereafter.
> 
> Could you store the output of
>   btrfs-debug-tree /dev/[whatever]
> before try number 5 and afterwards? It will be quite a lot if you've got
> a lot of files in there. Don't send it anywhere right now, just store it
> away if possible.

Ok, I have it stored. I mounted the fs read-only to do that just to be
safe, because when I did it on a live fs I got some (~15) "parent
transid verify failed" messages.

> Can you double check that there's nothing about corrected errors in your
> logs? Scrub will correct errors along the way and log that. So maybe
> we've only a few tries left to find the root cause.

Nope, definitely nothing there. A few "unlinked orphans" messages at
mount time, but that's all. I'm sure I would have captured any further
messages on my netconsole. Also nothing in any earlier logs that I can
see.

Well, until now (try 6) anyway. I actually ran btrfs scrub / twice now
after running btrfs-debug-tree (tries 5 and 6), because on the first
try I ran on wrong kernel (without the debug patch). On try 5 it
crashed very quickly, probably in less than a second. I also noticed
that it doesn't actually crash the machine or even make the filesystem
unusable in any way (apart from unmount blocking at shutdown), so I
also have the same oopses in my log files stored on the same
filesystem. But no, nothing about corrected errors or checksum
failures. So:

Try 5 (without your new patch, but after running btrfs-debug-tree); no
checksum failures as before, similar crash to before:


[ 2036.512656] btrfs: invalid parameters for read_extent_buffer: start (32771) 
> eb->len (32768). eb start is 2239379931136, level 102, generation 
309534250463128, nritems 56187918. len param 17. debug 
2/989/56187918/309534250463128
[ 2036.512770] [ cut here ]
[ 2036.512834] WARNING: at fs/btrfs/extent_io.c:4528 
read_extent_buffer+0x167/0x1a0 [btrfs]()
[ 2036.512880] Hardware name: System Product Name
[ 2036.512906] Modules linked in:   [last unloaded: netconsole]
[ 2036.512954] Pid: 1135, comm: btrfs-endio-met Tainted: GW
3.4.4-modded-oops+ #2
[...]


Try 6, with your new patch - as a new thing, got checksum errors,
including one very slightly before the crash and with the same "eb
start" as in the "invalid parameters" message. Crashed in a couple of
minutes.

Now that I look at the the log below it actually has two crashes in
it, one of btrfs-endio-met pid 1212 and another one of btrfs-endio-met
pid some 72 seconds later. I didn't notice that until after the fact.

dm-6 (the device with checksum failures) is on a disk on which I don't
have anything else besides this partition, so I suppose it could be
corrupting data without me having noticed. I know it used to work at
one time, but that doesn't mean too much. Anyway, I'm sure I haven't
seen any "checksum verify failed" messages until now. None in the logs
either.

Ah, one thing I have forgotten to mention. I think I've been using
this filesystem for maybe a month now. It started as a single-device
filesystem, but I did restriping to raid-1 on June 12. dm-5 is the
original device, while dm-6 (the one with the checksum failures in try
6) is the added one. I still have logs of the restriping and
everything after that; grep reveals no unusual lines containing
"btrfs" in all the logs since then, apart from the crashes. Besides
the oopses and lines that mention btrfs just because of the kernel
version 3.4.4+btrfsdebug, this is a summary of all the lines that
mention btrfs:

  btrfs: disk space caching is enabled
  btrfs flagging fs with big metadata feature
  btrfs: new size for /dev/mapper/rootvg-scratch_crypt is 751617179648
 * this device is not on the filesystem with problems - although I
   haven't tried scrubbing this filesystem
  btrfs: unlinked 104 orphans (10 times with 4 different numbers)

And during restriping

  btrfs: found 10008 extents (and similar)
  btrfs: relocating block group 1019010351104 flags 1 (and similar)

And two different warnings in the last few days, but I think this is
only because the debug patches changed the code:

  WARNING: at fs/btrfs/extent_io.c:4522 read_extent_buffer+0xe6/0x110 [btrfs]()
  WARNING: at fs/btrfs/extent_io.c:4528 read_extent_buffer+0x167/0x1a0 [btrfs]()

Another thing that comes to mind that I might well be the only one
that uses btrfs with raid-1 on two separate dm-crypted devices
(contrary to the way suggested on Btrfs FAQ - in my experience that
can actually speed up things, especially on computers without hardware
crypto, because the kernel has only one kcryptd thread per dm-crypt
device and therefore only can u

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Jan Schmidt
On 06.07.2012 01:47, Sami Liedes wrote:
> On Thu, Jul 05, 2012 at 03:41:33PM +0200, Jan Schmidt wrote:
>> I'd like to see if you corrupted your trees on disk in a really strange
>> manner (with matching checksums?), or if data comes from the disk intact
>> and becomes damaged thereafter.
>>
>> Could you store the output of
>>  btrfs-debug-tree /dev/[whatever]
>> before try number 5 and afterwards? It will be quite a lot if you've got
>> a lot of files in there. Don't send it anywhere right now, just store it
>> away if possible.
> 
> Ok, I have it stored. I mounted the fs read-only to do that just to be
> safe, because when I did it on a live fs I got some (~15) "parent
> transid verify failed" messages.

You can drop that, your data on disk is fine. Furthermore, I'm getting the
feeling that this isn't really related to readahead, but we will find out
later if that's true.

>> Can you double check that there's nothing about corrected errors in your
>> logs? Scrub will correct errors along the way and log that. So maybe
>> we've only a few tries left to find the root cause.
> 
> Nope, definitely nothing there. A few "unlinked orphans" messages at
> mount time, but that's all. I'm sure I would have captured any further
> messages on my netconsole. Also nothing in any earlier logs that I can
> see.

Okay. At least that fits with the "everything is okay on disk" statement.

> Try 6, with your new patch - as a new thing, got checksum errors,
> including one very slightly before the crash and with the same "eb
> start" as in the "invalid parameters" message. Crashed in a couple of
> minutes.
> 
> Now that I look at the the log below it actually has two crashes in
> it, one of btrfs-endio-met pid 1212 and another one of btrfs-endio-met
> pid some 72 seconds later. I didn't notice that until after the fact.
> 
> dm-6 (the device with checksum failures) is on a disk on which I don't
> have anything else besides this partition, so I suppose it could be
> corrupting data without me having noticed. I know it used to work at
> one time, but that doesn't mean too much. Anyway, I'm sure I haven't
> seen any "checksum verify failed" messages until now. None in the logs
> either.
> 
> Ah, one thing I have forgotten to mention. I think I've been using
> this filesystem for maybe a month now. It started as a single-device
> filesystem, but I did restriping to raid-1 on June 12. dm-5 is the
> original device, while dm-6 (the one with the checksum failures in try
> 6) is the added one. I still have logs of the restriping and
> everything after that; grep reveals no unusual lines containing
> "btrfs" in all the logs since then, apart from the crashes. Besides
> the oopses and lines that mention btrfs just because of the kernel
> version 3.4.4+btrfsdebug, this is a summary of all the lines that
> mention btrfs:
> 
>   btrfs: disk space caching is enabled
>   btrfs flagging fs with big metadata feature
>   btrfs: new size for /dev/mapper/rootvg-scratch_crypt is 751617179648
>  * this device is not on the filesystem with problems - although I
>haven't tried scrubbing this filesystem
>   btrfs: unlinked 104 orphans (10 times with 4 different numbers)
> 
> And during restriping
> 
>   btrfs: found 10008 extents (and similar)
>   btrfs: relocating block group 1019010351104 flags 1 (and similar)
> 
> And two different warnings in the last few days, but I think this is
> only because the debug patches changed the code:
> 
>   WARNING: at fs/btrfs/extent_io.c:4522 read_extent_buffer+0xe6/0x110 
> [btrfs]()
>   WARNING: at fs/btrfs/extent_io.c:4528 read_extent_buffer+0x167/0x1a0 
> [btrfs]()
> 
> Another thing that comes to mind that I might well be the only one
> that uses btrfs with raid-1 on two separate dm-crypted devices
> (contrary to the way suggested on Btrfs FAQ - in my experience that
> can actually speed up things, especially on computers without hardware
> crypto, because the kernel has only one kcryptd thread per dm-crypt
> device and therefore only can use a single core per device. Plus I get
> the benefits of btrfs "smart raid".).
> 
> So, try 6. The three first lines are from me starting btrfs-debug-tree
> again just as a quick way to verify that my netconsole setup was
> working. The first checksum failure came pretty much instantly after
> starting scrub:
> 
> 
> [  168.207151] device fsid f26f08b1-89e6-4f2d-b7f9-857010dd2517 devid 1 
> transid 151087 /dev/dm-5
> [  168.208541] device fsid f26f08b1-89e6-4f2d-b7f9-857010dd2517 devid 2 
> transid 151087 /dev/dm-6
> [  168.284668] device fsid ad75beab-b52b-4f63-9d14-956cc8e95fc7 devid 1 
> transid 5001 /dev/dm-9
> [  188.507789] btrfs: dm-6 checksum verify failed on 2237998628864 wanted 
> C3D64F found DCBF0869 level 88
> [  188.507852] btrfs: node seems invalid now. checksum ok = 1
> [  262.706585] btrfs: dm-6 checksum verify failed on 2246859587584 wanted 
> 3D013CF8 found AD06874A level 77
> [  

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Chris Mason
On Fri, Jul 06, 2012 at 04:42:50AM -0600, Jan Schmidt wrote:
> ... down here in the stack. The warning is printed from two levels above,
> __readahead_hook.
> 
> Either I'm absolutely blind and there's code along the (rather short) road
> between those two that might do this I haven't seen. Or someone else messes
> with our extent buffers or the underlying pages. What really confuses me is
> that it happens so reproducibly.
> 
> I've no good idea at the moment how to go on. It might help to get a feeling 
> if
> it's shifting around at least a little bit or really constant in the timing of
> occurrence. So can you please apply the next patch on top of the other two and
> give it some more failure tries? The "checksum mismatch [1234]" line will be 
> of
> most interest. I'm also curious what the additional debug variables will say 
> in
> the extended version of the very first printk. You can leave out the stack
> traces if you like, they won't matter much anyway.

I would suggest turning on slab debug and CONFIG_DEBUG_PAGEALLOC.
Something really strange is happening here.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Sami Liedes
On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote:
> I've no good idea at the moment how to go on. It might help to get a feeling 
> if
> it's shifting around at least a little bit or really constant in the timing of
> occurrence. So can you please apply the next patch on top of the other two and
> give it some more failure tries? The "checksum mismatch [1234]" line will be 
> of
> most interest. I'm also curious what the additional debug variables will say 
> in
> the extended version of the very first printk. You can leave out the stack
> traces if you like, they won't matter much anyway.

Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as
suggested by Chris Mason.

With those and the latest patch, there's an oops already at boot. I
don't have netconsole yet at that point, but here's the important
parts (sure I can capture it fully if you need).

By the way, something seems to be untabifying your patches. I don't
know if it's on my side or yours, but at least some other patches I
receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
mostly makes them apply cleanly for me.

Sami



btrfs: disk space caching is enabled
BUG: unable to handle kernel NULL pointer dereference at 0150
IP: [] check_node+0x138/0x250 [btrfs]
PGD 0
Oops:  [#1] SMP DEBUG_PAGEALLOC
CPU 6
Modules linked in:  [last unloaded: scsi_wait_scan]

Pid: 1176, comm: btrfs-endio-met Tainted: GW 3.4.4+btrfsdebug2 #2 
System Product Name/P8P67 EVO
RIP: 0010:[]  [] check_node+0x138/0x250 
[btrfs
[...]
Process btrfs-endio-met (pid: 1176, [...])
Call trace:
 [...] btree_readpage_end_io_hook+0x1e5/0x2d0 [btrfs]
 [...] end_bio_extent_readpage+0xcb/0xa30 [btrfs]
 [...] ? end_workqueue_fn+0x31/0x50 [btrfs]
 [...] bio_endio+0x18/0x30
 [...] end_workqueue_fn+0x3c/0x50 [btrfs]
 [...] worker_loop+0x157/0x560 [btrfs]
 [...] ? btrfs_queue_worker+0x310/0x310 [btrfs]
 [...] kthead+0x8e/0xa0
 [...] kernel_thread_helper+0x4/0x10
 [...] ? flush_kthread_worker+0x70/0x70
 [...] ? gs_change+0x13/0x13
Code: [...]
RIP  [] check_node+0x138/0x250 [btrfs]
 RSP 



signature.asc
Description: Digital signature


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Chris Mason
On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote:
> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote:
> > I've no good idea at the moment how to go on. It might help to get a 
> > feeling if
> > it's shifting around at least a little bit or really constant in the timing 
> > of
> > occurrence. So can you please apply the next patch on top of the other two 
> > and
> > give it some more failure tries? The "checksum mismatch [1234]" line will 
> > be of
> > most interest. I'm also curious what the additional debug variables will 
> > say in
> > the extended version of the very first printk. You can leave out the stack
> > traces if you like, they won't matter much anyway.
> 
> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as
> suggested by Chris Mason.
> 
> With those and the latest patch, there's an oops already at boot. I
> don't have netconsole yet at that point, but here's the important
> parts (sure I can capture it fully if you need).
> 
> By the way, something seems to be untabifying your patches. I don't
> know if it's on my side or yours, but at least some other patches I
> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
> mostly makes them apply cleanly for me.
> 
>   Sami
> 
> 
> 
> btrfs: disk space caching is enabled
> BUG: unable to handle kernel NULL pointer dereference at 0150
> IP: [] check_node+0x138/0x250 [btrfs]

This isn't from any of the new debugging.  Can you please try it on an
unpatched kernel?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Jan Schmidt
On Fri, July 06, 2012 at 16:40 (+0200), Chris Mason wrote:
> On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote:
>> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote:
>>> I've no good idea at the moment how to go on. It might help to get a 
>>> feeling if
>>> it's shifting around at least a little bit or really constant in the timing 
>>> of
>>> occurrence. So can you please apply the next patch on top of the other two 
>>> and
>>> give it some more failure tries? The "checksum mismatch [1234]" line will 
>>> be of
>>> most interest. I'm also curious what the additional debug variables will 
>>> say in
>>> the extended version of the very first printk. You can leave out the stack
>>> traces if you like, they won't matter much anyway.
>>
>> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as
>> suggested by Chris Mason.
>>
>> With those and the latest patch, there's an oops already at boot. I
>> don't have netconsole yet at that point, but here's the important
>> parts (sure I can capture it fully if you need).
>>
>> By the way, something seems to be untabifying your patches. I don't
>> know if it's on my side or yours, but at least some other patches I
>> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
>> mostly makes them apply cleanly for me.
>>
>>  Sami
>>
>>
>> 
>> btrfs: disk space caching is enabled
>> BUG: unable to handle kernel NULL pointer dereference at 0150
>> IP: [] check_node+0x138/0x250 [btrfs]
> 
> This isn't from any of the new debugging.  Can you please try it on an
> unpatched kernel?

You're confusing that with check_leaf. I added check_node along the way, see my
mail from Thu, July 05, 2012 at 15:41 (+0200). I'd really like to add something
similar for the 3.6 series.

Checking for the null pointer dereference.

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Jan Schmidt
On Fri, July 06, 2012 at 16:33 (+0200), Sami Liedes wrote:
> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote:
>> I've no good idea at the moment how to go on. It might help to get a feeling 
>> if
>> it's shifting around at least a little bit or really constant in the timing 
>> of
>> occurrence. So can you please apply the next patch on top of the other two 
>> and
>> give it some more failure tries? The "checksum mismatch [1234]" line will be 
>> of
>> most interest. I'm also curious what the additional debug variables will say 
>> in
>> the extended version of the very first printk. You can leave out the stack
>> traces if you like, they won't matter much anyway.
> 
> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as
> suggested by Chris Mason.
> 
> With those and the latest patch, there's an oops already at boot. I
> don't have netconsole yet at that point, but here's the important
> parts (sure I can capture it fully if you need).

Oh I see. root->node can be NULL during mount. Please add this on top:

--
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index df0b347..22838a3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -578,7 +578,8 @@ static noinline int check_node(struct btrfs_root *root,
}
node->debug[5] = node->start;
node->debug[6] = btrfs_header_level(node);
-   node->debug[6] |= btrfs_header_level(root->node) << 16;
+   if (root->node)
+   node->debug[6] |= btrfs_header_level(root->node) << 16;
node->debug[7] = 0xb22f50b22f5;

return 0;
--


> By the way, something seems to be untabifying your patches. I don't
> know if it's on my side or yours, but at least some other patches I
> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
> mostly makes them apply cleanly for me.

Oh, I'm sorry. Should have been on my side. I hope it's better with the current
diff?

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Chris Mason
On Fri, Jul 06, 2012 at 09:02:34AM -0600, Jan Schmidt wrote:
> On Fri, July 06, 2012 at 16:40 (+0200), Chris Mason wrote:
> > On Fri, Jul 06, 2012 at 08:33:51AM -0600, Sami Liedes wrote:
> >> On Fri, Jul 06, 2012 at 12:42:50PM +0200, Jan Schmidt wrote:
> >>> I've no good idea at the moment how to go on. It might help to get a 
> >>> feeling if
> >>> it's shifting around at least a little bit or really constant in the 
> >>> timing of
> >>> occurrence. So can you please apply the next patch on top of the other 
> >>> two and
> >>> give it some more failure tries? The "checksum mismatch [1234]" line will 
> >>> be of
> >>> most interest. I'm also curious what the additional debug variables will 
> >>> say in
> >>> the extended version of the very first printk. You can leave out the stack
> >>> traces if you like, they won't matter much anyway.
> >>
> >> Ok. Also turned on CONFIG_DEBUG_PAGEALLOC and CONFIG_SLUB_DEBUG_ON as
> >> suggested by Chris Mason.
> >>
> >> With those and the latest patch, there's an oops already at boot. I
> >> don't have netconsole yet at that point, but here's the important
> >> parts (sure I can capture it fully if you need).
> >>
> >> By the way, something seems to be untabifying your patches. I don't
> >> know if it's on my side or yours, but at least some other patches I
> >> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
> >> mostly makes them apply cleanly for me.
> >>
> >>Sami
> >>
> >>
> >> 
> >> btrfs: disk space caching is enabled
> >> BUG: unable to handle kernel NULL pointer dereference at 0150
> >> IP: [] check_node+0x138/0x250 [btrfs]
> > 
> > This isn't from any of the new debugging.  Can you please try it on an
> > unpatched kernel?
> 
> You're confusing that with check_leaf. I added check_node along the way, see 
> my
> mail from Thu, July 05, 2012 at 15:41 (+0200). I'd really like to add 
> something
> similar for the 3.6 series.
> 
> Checking for the null pointer dereference.

Sorry, I wasn't clear.  I meant it wasn't from slab debug or
DEBUG_PAGEALLOC, so it must be new in your patches ;)

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Sami Liedes
On Fri, Jul 06, 2012 at 10:59:24PM +0300, Sami Liedes wrote:
> I think I might try running it overnight with KMEMCHECK to see if it
> reports something. But for now, what there's in the log:

My KMEMCHECK kernel didn't even boot (due to some weird KMEMCHECK/ACPI
interaction), so I won't pursue this idea further at the moment...

> * lots of checksum mismatch [234], no 1s

One thing to notice from the logs, too, is that the device seems to
always be dm-6, the second device of the filesystem. This never seems
to happen to dm-5. There are 1583 lines of "btrfs: dm-6 checksum
verify failed".

Sami


signature.asc
Description: Digital signature


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-06 Thread Sami Liedes
[Retry: I think this mail didn't make it to the list, probably because
of the 73 kilobyte attached log. Here's a URL to the file:]

   http://www.niksula.hut.fi/~sliedes/btrfs-scrub-debug.log.gz

Sami



On Fri, Jul 06, 2012 at 05:09:00PM +0200, Jan Schmidt wrote:
> Oh I see. root->node can be NULL during mount. Please add this on top:

Ok. So, ran it with DEBUG_PAGEALLOC and slub debugging on. This time
it took half an hour to crash, and there's _lots_ of checksum mismatch
[234] messages even before the crash. gzipped dmesg attached.

At 781 seconds there's an "irq 17: nobody cared". That's a known bug
with this (and other Asus) motherboards and happens every now and
then. I doubt it has anything to do with this.

I think I might try running it overnight with KMEMCHECK to see if it
reports something. But for now, what there's in the log:

* lots of checksum mismatch [234], no 1s

* a fair number of "csum_tree_block: [0-9]+ callbacks suppressed"
  lines

* two "btrfs: node seems invalid now. checksum ok = 1" messages, one
  at 1499 seconds and another just before the crash at 1973

* Just before the crash:
  btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
(32768). eb start is 2261163409408, level 100, generation 4412718571037421157, 
nritems 538968254. len param 17. debug 
2/989/538968254/4412718571037421157/0x0/0/0x0/0x0

* the oopses

> > By the way, something seems to be untabifying your patches. I don't
> > know if it's on my side or yours, but at least some other patches I
> > receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
> > mostly makes them apply cleanly for me.
> 
> Oh, I'm sorry. Should have been on my side. I hope it's better with the 
> current
> diff?

Yes. No problem :)

[See attachment for dmesg log.]

Sami
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-09 Thread Sami Liedes
On Mon, Jul 09, 2012 at 11:05:47AM +0200, Arne Jansen wrote:
> > * Just before the crash:
> >   btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
> > (32768). eb start is 2261163409408, level 100, generation 
> > 4412718571037421157, nritems 538968254. len param 17. debug 
> > 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0
> > 
> 
> At a first glance: the generation converted to ascii is: "ent() ==",
> so someone is patching the memory with ascii text, possibly C source.
> It might be interesting to dump the full contents of the eb, to get
> a clue on the source of the data.

I changed the code to dump the contents of the eb struct at the point
where that error (btrfs: invalid parameters...) is printed, at the
"checksum mismatch 4" site and at the "node seems invalid now" site.
Now I have a big log of 1795 corrupted ebs. So far nothing that looks
remotely like ascii text, though. But I have two different versions of
the eb that caused that warning, a less corrupted one and a more
corrupted one:


btrfs: --- start eb contents at 8801b13cc4c8 ---
btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00  
...f
btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff   
...0...
btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00  

btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc558: 58 c5 3c b1 01 88 ff ff 58 c5 3c b1 01 88 ff ff  
X.<.X.<.
btrfs: 8801b13cc568: 00 00 00 00 00 00 00 00 70 c5 3c b1 01 88 ff ff  
p.<.
btrfs: 8801b13cc578: 70 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00  
p.<.
btrfs: 8801b13cc588: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc598: 80 5f 9a 06 00 ea ff ff 00 86 9b 06 00 ea ff ff  
._..
btrfs: 8801b13cc5a8: 40 4c 9a 06 00 ea ff ff 80 66 9a 06 00 ea ff ff  
@L...f..
btrfs: 8801b13cc5b8: 80 eb 9b 06 00 ea ff ff 40 05 a2 06 00 ea ff ff  
@...
btrfs: 8801b13cc5c8: 40 e1 9b 06 00 ea ff ff 80 c4 9c 06 00 ea ff ff  
@...
btrfs: 8801b13cc5d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc5e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc5f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc608: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc618: 98 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00  
..<.
btrfs: 8801b13cc628: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc638: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc648: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc658: 00 00 00 00 00 00 00 00  

btrfs: --- end eb contents at 8801b13cc4c8 ---
btrfs: dm-6 checksum verify failed on 2239404212224 wanted B5F632BC found 
3579FB59 level 160
btrfs: node seems invalid now. checksum ok = 1
btrfs: --- start eb contents at 8801b13cc4c8 ---
[... identical dump to above ...]
btrfs: --- end eb contents at 8801b13cc4c8 ---
btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
(32768). eb start is 2239404212224, level 160, generation 4716553384049587249, 
nritems 295705211. len param 17. debug 
2/989/295705211/4716553384049587249/0x0/0/0x0/0x0
btrfs: --- start eb contents at 8801b13cc4c8 ---
btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00  
...f
btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff   
...0...
btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00  

btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00  

btrfs:

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-09 Thread Arne Jansen
Thanks for doing this, I'll start looking into it right away.
One thing though: it's probably not the contents of struct eb
that's being corrupted, but the data the eb points to. See for
example read_extent_buffer to see how to access it. Sorry for
being unclear on this.

Thanks,
Arne

On 10.07.2012 06:16, Sami Liedes wrote:
> On Mon, Jul 09, 2012 at 11:05:47AM +0200, Arne Jansen wrote:
>>> * Just before the crash:
>>>   btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
>>> (32768). eb start is 2261163409408, level 100, generation 
>>> 4412718571037421157, nritems 538968254. len param 17. debug 
>>> 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0
>>>
>>
>> At a first glance: the generation converted to ascii is: "ent() ==",
>> so someone is patching the memory with ascii text, possibly C source.
>> It might be interesting to dump the full contents of the eb, to get
>> a clue on the source of the data.
> 
> I changed the code to dump the contents of the eb struct at the point
> where that error (btrfs: invalid parameters...) is printed, at the
> "checksum mismatch 4" site and at the "node seems invalid now" site.
> Now I have a big log of 1795 corrupted ebs. So far nothing that looks
> remotely like ascii text, though. But I have two different versions of
> the eb that caused that warning, a less corrupted one and a more
> corrupted one:
> 
> 
> btrfs: --- start eb contents at 8801b13cc4c8 ---
> btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00  
> ...f
> btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff   
> ...0...
> btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc558: 58 c5 3c b1 01 88 ff ff 58 c5 3c b1 01 88 ff ff  
> X.<.X.<.
> btrfs: 8801b13cc568: 00 00 00 00 00 00 00 00 70 c5 3c b1 01 88 ff ff  
> p.<.
> btrfs: 8801b13cc578: 70 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00  
> p.<.
> btrfs: 8801b13cc588: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc598: 80 5f 9a 06 00 ea ff ff 00 86 9b 06 00 ea ff ff  
> ._..
> btrfs: 8801b13cc5a8: 40 4c 9a 06 00 ea ff ff 80 66 9a 06 00 ea ff ff  
> @L...f..
> btrfs: 8801b13cc5b8: 80 eb 9b 06 00 ea ff ff 40 05 a2 06 00 ea ff ff  
> @...
> btrfs: 8801b13cc5c8: 40 e1 9b 06 00 ea ff ff 80 c4 9c 06 00 ea ff ff  
> @...
> btrfs: 8801b13cc5d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc5e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc5f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc608: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc618: 98 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00  
> ..<.
> btrfs: 8801b13cc628: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc638: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc648: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc658: 00 00 00 00 00 00 00 00  
> 
> btrfs: --- end eb contents at 8801b13cc4c8 ---
> btrfs: dm-6 checksum verify failed on 2239404212224 wanted B5F632BC found 
> 3579FB59 level 160
> btrfs: node seems invalid now. checksum ok = 1
> btrfs: --- start eb contents at 8801b13cc4c8 ---
> [... identical dump to above ...]
> btrfs: --- end eb contents at 8801b13cc4c8 ---
> btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
> (32768). eb start is 2239404212224, level 160, generation 
> 4716553384049587249, nritems 295705211. len param 17. debug 
> 2/989/295705211/4716553384049587249/0x0/0/0x0/0x0
> btrfs: --- start eb contents at 8801b13cc4c8 ---
> btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00  
> ...f
> btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff   
> ...0...
> btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 0

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-16 Thread Arne Jansen
Any news on this? I you give me some hints, I can try to reproduce
it here.

-Arne

On 10.07.2012 08:57, Arne Jansen wrote:
> On 10.07.2012 06:16, Sami Liedes wrote:
>> On Mon, Jul 09, 2012 at 11:05:47AM +0200, Arne Jansen wrote:
 * Just before the crash:
   btrfs: invalid parameters for read_extent_buffer: start (32771) > 
 eb->len (32768). eb start is 2261163409408, level 100, generation 
 4412718571037421157, nritems 538968254. len param 17. debug 
 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0

>>>
>>> At a first glance: the generation converted to ascii is: "ent() ==",
>>> so someone is patching the memory with ascii text, possibly C source.
>>> It might be interesting to dump the full contents of the eb, to get
>>> a clue on the source of the data.
>>
>> I changed the code to dump the contents of the eb struct at the point
>> where that error (btrfs: invalid parameters...) is printed, at the
>> "checksum mismatch 4" site and at the "node seems invalid now" site.
>> Now I have a big log of 1795 corrupted ebs. So far nothing that looks
>> remotely like ascii text, though. But I have two different versions of
>> the eb that caused that warning, a less corrupted one and a more
>> corrupted one:
>>
>> 
>> btrfs: --- start eb contents at 8801b13cc4c8 ---
>> btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00  
>> ...f
>> btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff   
>> ...0...
>> btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc558: 58 c5 3c b1 01 88 ff ff 58 c5 3c b1 01 88 ff ff  
>> X.<.X.<.
>> btrfs: 8801b13cc568: 00 00 00 00 00 00 00 00 70 c5 3c b1 01 88 ff ff  
>> p.<.
>> btrfs: 8801b13cc578: 70 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00  
>> p.<.
>> btrfs: 8801b13cc588: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc598: 80 5f 9a 06 00 ea ff ff 00 86 9b 06 00 ea ff ff  
>> ._..
>> btrfs: 8801b13cc5a8: 40 4c 9a 06 00 ea ff ff 80 66 9a 06 00 ea ff ff  
>> @L...f..
>> btrfs: 8801b13cc5b8: 80 eb 9b 06 00 ea ff ff 40 05 a2 06 00 ea ff ff  
>> @...
>> btrfs: 8801b13cc5c8: 40 e1 9b 06 00 ea ff ff 80 c4 9c 06 00 ea ff ff  
>> @...
>> btrfs: 8801b13cc5d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc5e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc5f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc608: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc618: 98 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00  
>> ..<.
>> btrfs: 8801b13cc628: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc638: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc648: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc658: 00 00 00 00 00 00 00 00  
>> 
> 
> this one looks good so far.
> 
>> btrfs: --- end eb contents at 8801b13cc4c8 ---
>> btrfs: dm-6 checksum verify failed on 2239404212224 wanted B5F632BC found 
>> 3579FB59 level 160
>> btrfs: node seems invalid now. checksum ok = 1
>> btrfs: --- start eb contents at 8801b13cc4c8 ---
>> [... identical dump to above ...]
>> btrfs: --- end eb contents at 8801b13cc4c8 ---
>> btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
>> (32768). eb start is 2239404212224, level 160, generation 
>> 4716553384049587249, nritems 295705211. len param 17. debug 
>> 2/989/295705211/4716553384049587249/0x0/0/0x0/0x0
>> btrfs: --- start eb contents at 8801b13cc4c8 ---
>> btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00  
>> ...f
>> btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> 
>> btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff   
>> ...0...
>> btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00  
>> 

Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-16 Thread Sami Liedes
On Mon, Jul 16, 2012 at 10:20:28AM +0200, Arne Jansen wrote:
> Any news on this? I you give me some hints, I can try to reproduce
> it here.

I've been planning for a few days now to try if it's reproducible in a
virtual machine with the same filesystem images. However I haven't
gotten around to doing this yet... I tried to get KMEMCHECK working on
my computer, but failed (and generated a bug report). It seems to work
in KVM though.

So, currently my idea is to boot the machine with a live USB stick,
install kvm and make qemu qcow images backed by the real (2*1.1T)
devices, but writing changes to the qcow images (I dare not mess with
the actual devices, and don't happen to have quite 2.2T extra space
outside of them...), and try to run scrub there. If that succeeds and
the bug happens there too, debugging *should* be easier, and it
*should* be possible to run it under KMEMCHECK too. If the bug doesn't
happen inside a virtual machine, that would be interesting information
too.

Another idea might be to use LVM snapshots of the actual devices, but
recent messages have made be wary of that approach - plus it's
somewhat of a pain, because the snapshotting would have to be done
before mounting anyway, since I have two devices and I doubt LVM
supports snapshotting two separate devices at the exact same moment.

However, I'm a bit ill at the moment, so I assume it might be at least
a few days until I get to actually implementing this...

Sami


signature.asc
Description: Digital signature


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-28 Thread Sami Liedes
On Tue, Jul 17, 2012 at 12:29:33AM +0300, Sami Liedes wrote:
> So, currently my idea is to boot the machine with a live USB stick,
> install kvm and make qemu qcow images backed by the real (2*1.1T)
> devices, but writing changes to the qcow images (I dare not mess with
> the actual devices, and don't happen to have quite 2.2T extra space
> outside of them...), and try to run scrub there. If that succeeds and
> the bug happens there too, debugging *should* be easier, and it
> *should* be possible to run it under KMEMCHECK too. If the bug doesn't
> happen inside a virtual machine, that would be interesting information
> too.

I have now been able to reproduce the bug in KVM with the setup
described above.

I think it's safe to say now that the bug depends on some kind of
interaction between btrfs and dm-crypt. With the following setup, the
bug does NOT happen:

* kvm, single cpu

* sees 3 disks, /dev/vda=root, /dev/vdb=btrfs-dev1, /dev/vdc=btrfs-dev2

* The btrfs devices are essentially snapshots of the real btrfs
  devices in raid-1 configuration (2*1.1T). As the real devices are
  encrypted, the decryption is done outside the KVM, i.e. the KVM
  snapshots are backed by the decrypted devices.

With the following setup, the bug DOES happen:

* kvm, single cpu

* sees 3 disks, /dev/vda=root, /dev/vdb=part1, /dev/vdc=part2, where
  part[12] is are LUKS containers containing the individual btrfs
  devices

* inside kvm, they are opened using

cryptsetup luksOpen /dev/vdb root1
cryptsetup luksOpen /dev/vdc root2

* after this, the filesystem is mounted with

mount /dev/mapper/root1 /media -o 
device=/dev/mapper/root1,device=/dev/mapper/root2

* The devices are snapshots of the actual physical encrypted
  partitions containing the btrfs devices.

I have not yet figured out if this can be reproduced using a pristine,
smaller btrfs filesystem in raid-1 configuration inside KVM or if
there's something about my specific filesystem that causes this. I can
investigate that too; it's easier to do for me than the above testing,
as I don't need to have continuous physical access to the computer to
do that.

Here's the .config of the kernel I used inside KVM to reproduce this:

  http://www.niksula.hut.fi/~sliedes/btrfs/config.3.4.4

I also ran the same tests with KMEMCHECK. Both with and without
crypto, there were quite a number of (of course possibly false)
warnings from btrfs code. I doubt any of them are related to this bug
- there were no KMEMCHECK warnings during the scrub operation. Here
are the logs, anyway:

  http://www.niksula.hut.fi/~sliedes/btrfs/screenlog.nocrypto.gz
  http://www.niksula.hut.fi/~sliedes/btrfs/screenlog.crypto.gz

Sami


signature.asc
Description: Digital signature


Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-28 Thread Sami Liedes
On Sat, Jul 28, 2012 at 03:08:47PM +0300, Sami Liedes wrote:
> I have not yet figured out if this can be reproduced using a pristine,
> smaller btrfs filesystem in raid-1 configuration inside KVM or if
> there's something about my specific filesystem that causes this. I can
> investigate that too; it's easier to do for me than the above testing,
> as I don't need to have continuous physical access to the computer to
> do that.

It seems the bug doesn't happen with a new filesystem made with

   mkfs.btrfs -mraid1 -draid1 /dev/mapper/btrfs_crypt /dev/mapper/btrfs2_crypt

and subsequently filled with data... Unfortunate.

I started to wonder what's so special about the second device of my
filesystem. All the errors always seem to come from that device. The
only thing that comes to mind is that that device was not originally
part of the filesystem; it started as a single-device filesystem
formatted with mkfs.btrfs with default options and was subsequently
rebalanced under a 3.4.2 kernel.

So I started to play with btrfs fi balance in my KVM instance with my
two-device filesystem, and hit this oops, which may or may not be
related to my previous bug and/or dm-crypt...


[   15.195342] device fsid a844eb60-eb9c-4e24-ae91-c1627bf2d439 devid 1 transid 
176 /dev/mapper/btrfs_crypt
[   15.196606] device fsid a844eb60-eb9c-4e24-ae91-c1627bf2d439 devid 2 transid 
176 /dev/mapper/btrfs2_crypt
[   15.197895] device fsid a844eb60-eb9c-4e24-ae91-c1627bf2d439 devid 1 transid 
176 /dev/mapper/btrfs_crypt
[   15.200202] btrfs: disk space caching is enabled
# btrfs device delete [something...]
[ 1462.242456] btrfs: unable to go below two devices on raid1
# btrfs fi balance start -mconvert=dup -dconvert=raid0 /media
[ 1895.048075] btrfs: unable to start balance with target metadata profile 32
# btrfs fi balance start -dconvert=raid0 /media
[ 1917.106536] btrfs: relocating block group 10229907456 flags 17
[ 1929.188609] btrfs: relocating block group 8887730176 flags 17
[ 1944.690916] btrfs: found 2152 extents
[ 1947.016210] btrfs: found 2152 extents
[ 1947.421397] btrfs: relocating block group 7545552896 flags 17
[ 2024.225203] btrfs: found 36762 extents
[ 2094.983055] btrfs: corrupt node block=8830455808,root=1: generation (197) 
too new in slot 0 (maximum expected 196)
[ 2094.984858] [ cut here ]
[ 2094.986076] WARNING: at fs/btrfs/super.c:219 
__btrfs_abort_transaction+0xa5/0xc0()
[ 2094.987912] Hardware name: Bochs
[ 2094.988735] btrfs: Transaction abortedPid: 1623, comm: btrfs-transacti Not 
tainted 3.4.4 #9
[ 2094.988741] Call Trace:
[ 2094.989361]  [] warn_slowpath_common+0x75/0xb0
[ 2094.990829]  [] warn_slowpath_fmt+0x41/0x50
[ 2094.992243]  [] __btrfs_abort_transaction+0xa5/0xc0
[ 2094.993521]  [] __btrfs_free_extent+0x213/0x7b0
[ 2094.994176]  [] ? run_clustered_refs+0xd7/0xa10
[ 2094.994835]  [] ? do_raw_spin_unlock+0x5d/0xb0
[ 2094.995481]  [] run_clustered_refs+0x415/0xa10
[ 2094.996143]  [] ? find_ref_head+0xb8/0xe0
[ 2094.996806]  [] ? btrfs_run_delayed_refs+0x99/0x430
[ 2094.997505]  [] btrfs_run_delayed_refs+0x16d/0x430
[ 2094.998188]  [] ? next_block_group.isra.65+0x29/0x80
[ 2094.998890]  [] ? _raw_spin_unlock+0x26/0x30
[ 2094.999520]  [] btrfs_write_dirty_block_groups+0xb0/0x630
[ 2095.000281]  [] ? free_extent_buffer+0x1a/0x70
[ 2095.000931]  [] commit_cowonly_roots+0xe7/0x1c1
[ 2095.001610]  [] btrfs_commit_transaction+0x519/0xa40
[ 2095.002317]  [] ? abort_exclusive_wait+0xb0/0xb0
[ 2095.002997]  [] transaction_kthread+0x245/0x2c0
[ 2095.003673]  [] ? check_leaf.isra.105+0x300/0x300
[ 2095.004372]  [] kthread+0x8e/0xa0
[ 2095.004902]  [] kernel_thread_helper+0x4/0x10
[ 2095.005549]  [] ? kthread_flush_work_fn+0x10/0x10
[ 2095.006222]  [] ? gs_change+0x13/0x13
[ 2095.006787] ---[ end trace 8341f112debcf176 ]---
[ 2095.007287] BTRFS error (device dm-1) in __btrfs_free_extent:5134: IO failure
[ 2095.008059] btrfs is forced readonly
[ 2095.008454] btrfs: run_one_delayed_ref returned -5
[ 2095.008455] BTRFS error (device dm-1) in btrfs_run_delayed_refs:2454: IO 
failure
[ 2176.876382] [ cut here ]
[ 2176.877217] kernel BUG at fs/btrfs/relocation.c:3733!
[ 2176.878139] invalid opcode:  [#1] SMP 
[ 2176.879049] CPU 5 
[ 2176.879435] Pid: 1676, comm: btrfs Tainted: GW3.4.4 #9 Bochs 
Bochs
[ 2176.880383] RIP: 0010:[]  [] 
relocate_block_group+0x643/0x690
[ 2176.880383] RSP: :880002bfdb08  EFLAGS: 00010206
[ 2176.880383] RAX: ffe2 RBX: ffe2 RCX: 880007aade10
[ 2176.880383] RDX:  RSI:  RDI: 880006fbe800
[ 2176.880383] RBP: 880002bfdb88 R08: 0003 R09: 
[ 2176.880383] R10: 000100072905 R11: 0001 R12: 8800037a8020
[ 2176.880383] R13: 5ee7 R14: 880006bbae10 R15: 8800037a8000
[ 2176.880383] FS:  () GS:880007d4(0063) 
knlGS:0

Re: *** GMX Spamverdacht *** Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-09 Thread Arne Jansen
On 07.07.2012 01:44, Sami Liedes wrote:
> [Retry: I think this mail didn't make it to the list, probably because
> of the 73 kilobyte attached log. Here's a URL to the file:]
> 
>http://www.niksula.hut.fi/~sliedes/btrfs-scrub-debug.log.gz
> 
>   Sami
> 
> 
> 
> On Fri, Jul 06, 2012 at 05:09:00PM +0200, Jan Schmidt wrote:
>> Oh I see. root->node can be NULL during mount. Please add this on top:
> 
> Ok. So, ran it with DEBUG_PAGEALLOC and slub debugging on. This time
> it took half an hour to crash, and there's _lots_ of checksum mismatch
> [234] messages even before the crash. gzipped dmesg attached.
> 
> At 781 seconds there's an "irq 17: nobody cared". That's a known bug
> with this (and other Asus) motherboards and happens every now and
> then. I doubt it has anything to do with this.
> 
> I think I might try running it overnight with KMEMCHECK to see if it
> reports something. But for now, what there's in the log:
> 
> * lots of checksum mismatch [234], no 1s
> 
> * a fair number of "csum_tree_block: [0-9]+ callbacks suppressed"
>   lines
> 
> * two "btrfs: node seems invalid now. checksum ok = 1" messages, one
>   at 1499 seconds and another just before the crash at 1973
> 
> * Just before the crash:
>   btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
> (32768). eb start is 2261163409408, level 100, generation 
> 4412718571037421157, nritems 538968254. len param 17. debug 
> 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0
> 

At a first glance: the generation converted to ascii is: "ent() ==",
so someone is patching the memory with ascii text, possibly C source.
It might be interesting to dump the full contents of the eb, to get
a clue on the source of the data.


> * the oopses
> 
>>> By the way, something seems to be untabifying your patches. I don't
>>> know if it's on my side or yours, but at least some other patches I
>>> receive via linux-btrfs contain tabs. Doing a M-x tabify in emacs
>>> mostly makes them apply cleanly for me.
>>
>> Oh, I'm sorry. Should have been on my side. I hope it's better with the 
>> current
>> diff?
> 
> Yes. No problem :)
> 
> [See attachment for dmesg log.]
> 
>   Sami
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: *** GMX Spamverdacht *** Re: btrfs GPF in read_extent_buffer() while scrubbing with kernel 3.4.2

2012-07-09 Thread Arne Jansen
On 10.07.2012 06:16, Sami Liedes wrote:
> On Mon, Jul 09, 2012 at 11:05:47AM +0200, Arne Jansen wrote:
>>> * Just before the crash:
>>>   btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
>>> (32768). eb start is 2261163409408, level 100, generation 
>>> 4412718571037421157, nritems 538968254. len param 17. debug 
>>> 2/989/538968254/4412718571037421157/0x0/0/0x0/0x0
>>>
>>
>> At a first glance: the generation converted to ascii is: "ent() ==",
>> so someone is patching the memory with ascii text, possibly C source.
>> It might be interesting to dump the full contents of the eb, to get
>> a clue on the source of the data.
> 
> I changed the code to dump the contents of the eb struct at the point
> where that error (btrfs: invalid parameters...) is printed, at the
> "checksum mismatch 4" site and at the "node seems invalid now" site.
> Now I have a big log of 1795 corrupted ebs. So far nothing that looks
> remotely like ascii text, though. But I have two different versions of
> the eb that caused that warning, a less corrupted one and a more
> corrupted one:
> 
> 
> btrfs: --- start eb contents at 8801b13cc4c8 ---
> btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00  
> ...f
> btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff   
> ...0...
> btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc528: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc538: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc548: 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc558: 58 c5 3c b1 01 88 ff ff 58 c5 3c b1 01 88 ff ff  
> X.<.X.<.
> btrfs: 8801b13cc568: 00 00 00 00 00 00 00 00 70 c5 3c b1 01 88 ff ff  
> p.<.
> btrfs: 8801b13cc578: 70 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00  
> p.<.
> btrfs: 8801b13cc588: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc598: 80 5f 9a 06 00 ea ff ff 00 86 9b 06 00 ea ff ff  
> ._..
> btrfs: 8801b13cc5a8: 40 4c 9a 06 00 ea ff ff 80 66 9a 06 00 ea ff ff  
> @L...f..
> btrfs: 8801b13cc5b8: 80 eb 9b 06 00 ea ff ff 40 05 a2 06 00 ea ff ff  
> @...
> btrfs: 8801b13cc5c8: 40 e1 9b 06 00 ea ff ff 80 c4 9c 06 00 ea ff ff  
> @...
> btrfs: 8801b13cc5d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc5e8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc5f8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc608: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc618: 98 c5 3c b1 01 88 ff ff 00 00 00 00 00 00 00 00  
> ..<.
> btrfs: 8801b13cc628: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc638: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc648: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc658: 00 00 00 00 00 00 00 00  
> 

this one looks good so far.

> btrfs: --- end eb contents at 8801b13cc4c8 ---
> btrfs: dm-6 checksum verify failed on 2239404212224 wanted B5F632BC found 
> 3579FB59 level 160
> btrfs: node seems invalid now. checksum ok = 1
> btrfs: --- start eb contents at 8801b13cc4c8 ---
> [... identical dump to above ...]
> btrfs: --- end eb contents at 8801b13cc4c8 ---
> btrfs: invalid parameters for read_extent_buffer: start (32771) > eb->len 
> (32768). eb start is 2239404212224, level 160, generation 
> 4716553384049587249, nritems 295705211. len param 17. debug 
> 2/989/295705211/4716553384049587249/0x0/0/0x0/0x0
> btrfs: --- start eb contents at 8801b13cc4c8 ---
> btrfs: 8801b13cc4c8: 00 80 e4 66 09 02 00 00 00 80 00 00 00 00 00 00  
> ...f
> btrfs: 8801b13cc4d8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc4e8: 20 00 00 00 00 00 00 00 30 00 d7 10 02 88 ff ff   
> ...0...
> btrfs: 8801b13cc4f8: 02 02 00 00 03 00 00 00 06 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc508: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc518: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 
> btrfs: 8801b13cc528: 00 00