In 2012, I setup a Centos 6.x machine with a btrfs file system on top
of DRBD, we did some testing prior to going production and it seemed
fine, and has worked fine for a long time. However, now we are
encountering problems and was wondering if I could get any help.

[root@ysmha01 tmp]# btrfs fi show
Label: none  uuid: 7a38f3ab-f3b0-4b3d-81c0-28b347b26da1
        Total devices 1 FS bytes used 5.79TB
        devid    1 size 18.19TB used 8.94TB path /dev/drbd0

Btrfs Btrfs v0.20-rc1

While still running the official Centos
kernel-2.6.32-504.12.2.el6.x86_64 the machine started crashing with a
kernel oops. Since that happened, I tried a few different 2.6.32
kernels with the same result. Yesterday I switched to the elrepo
kernel-lt 3.10.75-1.el6.elrepo.x86_64 version and was able to get the
machine up and running and found some error messages which lead me to
believe things were not too bad after all:

Apr 21 17:28:01 ysmha01 kernel: BTRFS warning (device drbd0): block
group 578776203264 has wrong amount of free space
Apr 21 17:28:01 ysmha01 kernel: BTRFS warning (device drbd0): failed
to load free space cache for block group 578776203264, rebuild it now
Apr 21 17:28:02 ysmha01 kernel: BTRFS warning (device drbd0): block
group 622799618048 has wrong amount of free space
Apr 21 17:28:02 ysmha01 kernel: BTRFS warning (device drbd0): failed
to load free space cache for block group 622799618048, rebuild it now
Apr 21 17:30:32 ysmha01 kernel: perf samples too long (2573 > 2500),
lowering kernel.perf_event_max_sample_rate to 50000
Apr 21 17:54:56 ysmha01 kernel: BTRFS warning (device drbd0): block
group 7255336419328 has wrong amount of free space
Apr 21 17:54:56 ysmha01 kernel: BTRFS warning (device drbd0): failed
to load free space cache for block group 7255336419328, rebuild it now
Apr 21 17:54:56 ysmha01 kernel: BTRFS warning (device drbd0): block
group 7256410161152 has wrong amount of free space

Since then, the machine was left up and serving samba shares until it
had another kernel oops this morning.

Apr 22 11:18:38 ysmha01 kernel: gb3i libcxgbi cxgb3 mdio libiscsi_tcp
qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi
Apr 22 11:18:38 ysmha01 kernel: CPU: 8 PID: 17465 Comm:
btrfs-endio-wri Not tainted 3.10.75-1.el6.elrepo.x86_64 #1
Apr 22 11:18:38 ysmha01 kernel: Hardware name: Supermicro
X9DR3-F/X9DR3-F, BIOS 1.0c 06/29/2012
Apr 22 11:18:38 ysmha01 kernel: task: ffff880467986e20 ti:
ffff8807d8036000 task.ti: ffff8807d8036000
Apr 22 11:18:38 ysmha01 kernel: RIP: 0010:[<ffffffffa05c3de2>]
[<ffffffffa05c3de2>] __btrfs_drop_extents+0xb52/0xb90 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: RSP: 0018:ffff8807d8037b38  EFLAGS: 00010297
Apr 22 11:18:38 ysmha01 kernel: RAX: 0000000000000007 RBX:
ffff88070a0b36d0 RCX: 000000003f9e1000
Apr 22 11:18:38 ysmha01 kernel: RDX: ffff8803d2bb2001 RSI:
0000000000000e50 RDI: 00000000ffffffff
Apr 22 11:18:38 ysmha01 kernel: RBP: ffff8807d8037c58 R08:
0000000000000000 R09: ffff8807d8037ae0
Apr 22 11:18:38 ysmha01 kernel: R10: 0000000000000000 R11:
0000000000000000 R12: ffff88006304ae20
Apr 22 11:18:38 ysmha01 kernel: R13: 000000003f9e2000 R14:
000000003f9e1000 R15: 0000000000000001
Apr 22 11:18:38 ysmha01 kernel: FS:  0000000000000000(0000)
GS:ffff88087fc40000(0000) knlGS:0000000000000000
Apr 22 11:18:38 ysmha01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Apr 22 11:18:38 ysmha01 kernel: CR2: 0000000000000030 CR3:
0000000001c0c000 CR4: 00000000000427e0
Apr 22 11:18:38 ysmha01 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Apr 22 11:18:38 ysmha01 kernel: DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Apr 22 11:18:38 ysmha01 kernel: Stack:
Apr 22 11:18:38 ysmha01 kernel: 0000000000648b05 000000003e800000
ffff880600000000 ffff8805e77a5000
Apr 22 11:18:38 ysmha01 kernel: 00000000012e7000 000000058113a3bb
0000000101037bd8 ffff8800391d0780
Apr 22 11:18:38 ysmha01 kernel: 000006c960852000 00000000011e1000
ffff8805c44acef0 ffffffff00000001
Apr 22 11:18:38 ysmha01 kernel: Call Trace:
Apr 22 11:18:38 ysmha01 kernel: [<ffffffff81182405>] ?
kmem_cache_alloc+0x275/0x280
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05c4923>]
btrfs_drop_extents+0x73/0xa0 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05b5b3c>]
insert_reserved_file_extent.clone.0+0x7c/0x290 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05b111b>] ?
start_transaction+0xab/0x4d0 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05cfc32>] ?
test_range_bit+0x32/0x170 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05ba5a2>]
btrfs_finish_ordered_io+0x3e2/0x4a0 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffff8106b200>] ? usleep_range+0x20/0x50
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05ba675>]
finish_ordered_fn+0x15/0x20 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05dd85c>]
worker_loop+0x15c/0x4b0 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05dd700>] ?
check_pending_worker_creates+0xe0/0xe0 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffffa05dd700>] ?
check_pending_worker_creates+0xe0/0xe0 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: [<ffffffff810821ce>] kthread+0xce/0xe0
Apr 22 11:18:38 ysmha01 kernel: [<ffffffff81082100>] ?
kthread_freezable_should_stop+0x70/0x70
Apr 22 11:18:38 ysmha01 kernel: [<ffffffff815f9448>] ret_from_fork+0x58/0x90
Apr 22 11:18:38 ysmha01 kernel: [<ffffffff81082100>] ?
kthread_freezable_should_stop+0x70/0x70
Apr 22 11:18:38 ysmha01 kernel: Code: 10 21 62 a0 e8 f0 00 fc ff c7 85
38 ff ff ff 01 00 00 00 e9 dc fa ff ff 0f 0b eb fe 0f 0b eb fe 0f 0b
0f 1f 80 00 00 00 00 eb f7 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00
eb f6 0f 0b eb fe 0f
Apr 22 11:18:38 ysmha01 kernel: RIP  [<ffffffffa05c3de2>]
__btrfs_drop_extents+0xb52/0xb90 [btrfs]
Apr 22 11:18:38 ysmha01 kernel: RSP <ffff8807d8037b38>
Apr 22 11:18:38 ysmha01 kernel: ---[ end trace e7607252d1383d86 ]---

At this point, the machine was rebooted and upon mount, I also used
the clear_cache mount option, later the machine crashed again.

Apr 22 13:41:19 ysmha01 kernel: block drbd0: role( Secondary -> Primary )
Apr 22 13:42:08 ysmha01 kernel: device fsid
7a38f3ab-f3b0-4b3d-81c0-28b347b26da1 devid 1 transid 1699374
/dev/drbd0
Apr 22 13:42:08 ysmha01 kernel: btrfs: force clearing of disk cache
Apr 22 13:42:08 ysmha01 kernel: btrfs: disk space caching is enabled
Apr 22 13:42:45 ysmha01 kernel: SELinux: initialized (dev drbd0, type
btrfs), uses xattr
Apr 22 13:47:39 ysmha01 kernel: INFO: task btrfs-transacti:3175
blocked for more than 120 seconds.
Apr 22 13:47:39 ysmha01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 13:47:39 ysmha01 kernel: btrfs-transacti D ffffffff81810d00
0  3175      2 0x00000080
Apr 22 13:47:39 ysmha01 kernel: ffff880303301ce8 0000000000000046
ffff880303301fd8 000000000001314
Apr 22 13:47:39 ysmha01 kernel: ffff880303300010 0000000000013140
0000000000013140 0000000000013140
Apr 22 13:47:39 ysmha01 kernel: ffff880303301fd8 0000000000013140
ffff88046988a340 ffff88046bb69360
Apr 22 13:47:39 ysmha01 kernel: Call Trace:
Apr 22 13:47:39 ysmha01 kernel: [<ffffffff815eeed9>] schedule+0x29/0x70
Apr 22 13:47:39 ysmha01 kernel: [<ffffffff815ed085>]
schedule_timeout+0x195/0x220
Apr 22 13:47:39 ysmha01 kernel: [<ffffffff81082c90>] ? prepare_to_wait+0x60/0x90
Apr 22 13:47:39 ysmha01 kernel: [<ffffffffa05b0457>]
btrfs_commit_transaction+0x1f7/0xa40 [btrfs]
Apr 22 13:47:39 ysmha01 kernel: [<ffffffffa05b111b>] ?
start_transaction+0xab/0x4d0 [btrfs]
Apr 22 13:47:39 ysmha01 kernel: [<ffffffff810829e0>] ? wake_up_bit+0x40/0x40
Apr 22 13:47:39 ysmha01 kernel: [<ffffffffa05aac96>]
transaction_kthread+0x1a6/0x220 [btrfs]
Apr 22 13:47:39 ysmha01 kernel: [<ffffffffa05aaaf0>] ?
btree_readpage_end_io_hook+0x2c0/0x2c0 [btrfs]
Apr 22 13:47:39 ysmha01 kernel: [<ffffffffa05aaaf0>] ?
btree_readpage_end_io_hook+0x2c0/0x2c0 [btrfs]
Apr 22 13:47:39 ysmha01 kernel: [<ffffffff810821ce>] kthread+0xce/0xe0
Apr 22 13:47:39 ysmha01 kernel: [<ffffffff81082100>] ?
kthread_freezable_should_stop+0x70/0x70
Apr 22 13:47:39 ysmha01 kernel: [<ffffffff815f9448>] ret_from_fork+0x58/0x90
Apr 22 13:47:39 ysmha01 kernel: [<ffffffff81082100>] ?
kthread_freezable_should_stop+0x70/0x70
Apr 22 13:49:39 ysmha01 kernel: INFO: task btrfs-transacti:3175
blocked for more than 120 seconds.
Apr 22 13:49:39 ysmha01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 13:49:39 ysmha01 kernel: btrfs-transacti D ffffffff81810d00
0  3175      2 0x00000080
Apr 22 13:49:39 ysmha01 kernel: ffff880303301ce8 0000000000000046
ffff880303301fd8 0000000000013140
Apr 22 13:49:39 ysmha01 kernel: ffff880303300010 0000000000013140
0000000000013140 0000000000013140
Apr 22 13:49:39 ysmha01 kernel: ffff880303301fd8 0000000000013140
ffff88046988a340 ffff88046bb69360
........snip......

When the oops happens, then the mount point becomes unusable. What
would be the best path to recovery from here?

What other information may I provide?

Diego
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to