Hi,

On Fri, Oct 26, 2012 at 08:35:43AM +0200, Jan Schmidt wrote:
> while running extensive qgroup and tree mod log tests I got to the following
> BUG, which is probably not related to tree mod log:

I've left the fsx DIO tester overnight, after Josef identified the fix for the
csum mismatch bug, and found it crashed at the same point as Jan reported:

[80582.012720] ------------[ cut here ]------------
[80582.013045] kernel BUG at fs/btrfs/ctree.c:2945!
[80582.013045] invalid opcode: 0000 [#1] PREEMPT SMP
io_blk virtio_pci sr_mod 8139cp parport microcode virtio_ring floppy virtio sg 
pcspkr button i2c_piix4 cdrom processor thermal_sys scsi_dh_alua scsi_dh_emc 
scsi_dh_rdac scsi_dh_hp_sw scsi_dh ata_ge
neric ata_piix [last unloaded: btrfs]
[80582.013045] CPU 16
[80582.013045] Pid: 13045, comm: btrfs-endio-wri Tainted: G        W    
3.8.0-rc4-desktop+ #54 Bochs Bochs
[80582.013045] RIP: 0010:[<ffffffffa0393a25>]  [<ffffffffa0393a25>] 
btrfs_set_item_key_safe+0x65/0x1e0 [btrfs]
[80582.013045] RSP: 0018:ffff880adb9d9ad8  EFLAGS: 00010206
[80582.013045] RAX: 000000000000006c RBX: ffff880adb9d9c18 RCX: 000000000000012e
[80582.013045] RDX: 0000000000016000 RSI: ffff880adeeee076 RDI: ffff880adb9d9af9
[80582.013045] RBP: ffff880adb9d9b38 R08: ffff880ae4cd9dc0 R09: 0000000000000000
[80582.013045] R10: ffff880bee7c0f40 R11: 0000000000000015 R12: 0000000000000001
[80582.013045] R13: ffff880add976c78 R14: ffff880adb9d9ae8 R15: ffff880be92301c0
[80582.013045] FS:  0000000000000000(0000) GS:ffff880c1f200000(0000) 
knlGS:0000000000000000
[80582.013045] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[80582.013045] CR2: ffffffffff600400 CR3: 0000000be6cb8000 CR4: 00000000000006e0
[80582.013045] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[80582.013045] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[80582.013045] Process btrfs-endio-wri (pid: 13045, threadinfo 
ffff880adb9d8000, task ffff880b301545c0)
[80582.013045] Stack:
[80582.013045]  ffff880bec37a000 ffff880b932a5170 000000000000012e 
000000000160006c
[80582.013045]  ffff880adb9d9b00 ffffffffa03a1ad0 000000000000012e 
0000000000010000
[80582.013045]  ffff880add976c78 ffff880be92301c0 0000000000016000 
0000000000000f96
[80582.013045] Call Trace:
[80582.013045]  [<ffffffffa03a1ad0>] ? btrfs_inc_extent_ref+0x90/0xb0 [btrfs]
[80582.013045]  [<ffffffffa03cdc50>] __btrfs_drop_extents+0x530/0xb70 [btrfs]
[80582.013045]  [<ffffffffa03cecab>] btrfs_drop_extents+0x6b/0x90 [btrfs]
[80582.013045]  [<ffffffffa03bce1a>] insert_reserved_file_extent+0x8a/0x2a0 
[btrfs]
[80582.013045]  [<ffffffffa03c0c6e>] btrfs_finish_ordered_io+0x39e/0x3e0 [btrfs]
[80582.013045]  [<ffffffffa03c0cc0>] finish_ordered_fn+0x10/0x20 [btrfs]
[80582.013045]  [<ffffffffa03e7205>] worker_loop+0x165/0x4c0 [btrfs]
[80582.013045]  [<ffffffffa03e70a0>] ? check_pending_worker_creates+0xe0/0xe0 
[btrfs]
[80582.013045]  [<ffffffffa03e70a0>] ? check_pending_worker_creates+0xe0/0xe0 
[btrfs]
[80582.013045]  [<ffffffff8106d026>] kthread+0xc6/0xd0
[80582.013045]  [<ffffffff8106cf60>] ? kthread_freezable_should_stop+0x70/0x70
[80582.013045]  [<ffffffff815f7a7c>] ret_from_fork+0x7c/0xb0
[80582.013045]  [<ffffffff8106cf60>] ? kthread_freezable_should_stop+0x70/0x70
[80582.013045] Code: 00 00 00 4c 89 ef 48 63 d2 4c 89 f6 48 8d 14 92 48 8d 54 
92 65 e8 6c 0c 04 00 48 8b 0b 48 39 4d b0 48 8b 55 b9 0f b6 45 b8 76 0b <0f> 0b 
eb fe 0f 1f 80 00 00 00 00 72 1e 3a 43
08 77 ee 66 0f 1f
[80582.013045] RIP  [<ffffffffa0393a25>] btrfs_set_item_key_safe+0x65/0x1e0 
[btrfs]
[80582.013045]  RSP <ffff880adb9d9ad8>
[80582.071249] ---[ end trace 5360a95c1506db85 ]---

The lines are shifted, but it's the same code:

2933 void btrfs_set_item_key_safe(struct btrfs_trans_handle *trans,
2934                              struct btrfs_root *root, struct btrfs_path 
*path,
2935                              struct btrfs_key *new_key)
2936 {
2937         struct btrfs_disk_key disk_key;
2938         struct extent_buffer *eb;
2939         int slot;
2940
2941         eb = path->nodes[0];
2942         slot = path->slots[0];
2943         if (slot > 0) {
2944                 btrfs_item_key(eb, &disk_key, slot - 1);
2945                 BUG_ON(comp_keys(&disk_key, new_key) >= 0);
^^^^
2946         }
2947         if (slot < btrfs_header_nritems(eb) - 1) {
2948                 btrfs_item_key(eb, &disk_key, slot + 1);
2949                 BUG_ON(comp_keys(&disk_key, new_key) <= 0);
2950         }
2951
2952         btrfs_cpu_key_to_disk(&disk_key, new_key);
2953         btrfs_set_item_key(eb, &disk_key, slot);
2954         btrfs_mark_buffer_dirty(eb);
2955         if (slot == 0)
2956                 fixup_low_keys(trans, root, path, &disk_key, 1);
2957 }

It was unexpected so I haven't put any debugging into config or code.
According to the timestamps, the test has been running for 16h 58m.

Test setup:
* kvm virtual machine with 48 cpus
* 48G memory (single bit ECC)
* 5x 136GB 15K SAS disks (virtio,cache=none)

filesystem was freshly created before the test as data/raid0 and
meta/radi0 and then converted via balance to data/single meta/single
(that was one of the reproducers of the csum problem).

balance log:

[19441.016374] btrfs: disk space caching is enabled
[19452.900668] btrfs: relocating block group 4332584960 flags 9
[19455.935676] btrfs: found 4 extents
[19456.247759] btrfs: found 4 extents
[19456.471688] btrfs: relocating block group 37683200 flags 12
[19459.345387] btrfs: found 16 extents
[19459.688217] btrfs: relocating block group 20971520 flags 10
[19460.194694] btrfs: found 1 extents
[19460.465852] btrfs: relocating block group 12582912 flags 1
[19463.829548] btrfs: found 1 extents
[19464.123115] btrfs: found 1 extents
[19464.512113] btrfs: relocating block group 4194304 flags 4
[19475.407355] btrfs: found 1 extents

# btrfs fi df /mnt/btrfs/
Data: total=3.00GiB, used=11.29MiB
System, RAID1: total=32.00MiB, used=4.00KiB
System: total=4.00MiB, used=0.00
Metadata, RAID1: total=2.00GiB, used=336.00KiB

# btrfs fi show
Label: none  uuid: a15e1f6c-0e19-420e-93e2-82b4aa072a47
        Total devices 5 FS bytes used 10.41MiB
        devid    1 size 136.70GiB used 1.04GiB path /dev/vda
        devid    2 size 136.70GiB used 2.00GiB path /dev/vdb
        devid    3 size 136.70GiB used 1.00GiB path /dev/vdc
        devid    4 size 136.70GiB used 1.03GiB path /dev/vdd
        devid    5 size 136.70GiB used 2.00GiB path /dev/vde

the sources used were cmason/for-linus 
(0d90060f330ef49a60f56df2d9439ba263c979d9)
plus https://patchwork.kernel.org/patch/2020661/ (Btrfs: put csums on the right 
ordered extent)
plus https://patchwork.kernel.org/patch/1987481/ (Btrfs: fix wrong max device 
number for single profile)

btrfs-debug-tree dumped the filesystem without problems, so it's an in-memory
issue.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to