At 11/29/2016 04:21 PM, Chandan Rajendra wrote:
On Tuesday, November 29, 2016 03:55:53 PM Qu Wenruo wrote:
At 11/29/2016 02:36 PM, Chandan Rajendra wrote:
When executing btrfs/126 test on kdave/for-next branch on a ppc64 guest, I
noticed the following call trace.
[ 77.335887] ------------[ cut here ]------------
[ 77.336115] WARNING: CPU: 0 PID: 8325 at
/root/repos/linux/fs/btrfs/qgroup.c:2443 .btrfs_qgroup_free_refroot+0x188/0x220
[ 77.336303] Modules linked in:
[ 77.336393] CPU: 0 PID: 8325 Comm: umount Not tainted
4.9.0-rc5-00062-g6b74e43 #22
[ 77.336526] task: c00000062b8d4880 task.stack: c00000062b9a4000
[ 77.336638] NIP: c0000000005cf018 LR: c0000000005ceff8 CTR: c0000000005390c0
[ 77.336771] REGS: c00000062b9a7450 TRAP: 0700 Not tainted
(4.9.0-rc5-00062-g6b74e43)
[ 77.336908] MSR: 800000000282b032
[ 77.336960] <
[ 77.337027] SF
[ 77.337053] ,VEC
[ 77.337087] ,VSX
[ 77.337114] ,EE
[ 77.337146] ,FP
[ 77.337173] ,ME
[ 77.337207] ,IR
[ 77.337233] ,DR
[ 77.337267] ,RI
[ 77.337294] >
[ 77.337330] CR: 88000842 XER: 00000000
[ 77.337392] CFAR: c0000000005c9b5c
[ 77.337443] SOFTE: 1
[ 77.337477]
GPR00:
[ 77.337517] c0000000005cefcc
[ 77.337575] c00000062b9a76d0
[ 77.337626] c000000001103a00
[ 77.337714] c00000063e058d40
[ 77.337765]
GPR04:
[ 77.337817] c00000062b9a7740
[ 77.337868] 0000000000000008
[ 77.337927] 000000000005f000
[ 77.337978] 000000063eed0000
[ 77.338037]
GPR08:
[ 77.338080] c00000062f9629c8
[ 77.338138] 0000000000000001
[ 77.338191] fffffffffffff000
[ 77.338248] c00000063e058e80
[ 77.338300]
GPR12:
[ 77.338384] 0000000000000000
[ 77.338435] c00000000fe00000
[ 77.338498] 0000000020000000
[ 77.338548] 0000000000000000
[ 77.338605]
GPR16:
[ 77.338645] 0000000000000008
[ 77.338703] 000000004d5906fc
[ 77.338754] 000000004d5a6c08
[ 77.338811] 000000004d54b3d0
[ 77.338866]
GPR20:
[ 77.338921] 000001000bcf8440
[ 77.338972] 0000000000000000
[ 77.339030] 0000000000000000
[ 77.339080] c00000062523b078
[ 77.339138]
GPR24:
[ 77.339178] c00000062523b080
[ 77.339240] c000000000da2b58
[ 77.339290] 0000000000000000
[ 77.339347] c00000062e539600
[ 77.339398]
GPR28:
[ 77.339485] 0000000000060000
[ 77.339536] c00000062523b000
[ 77.339594] c00000062e539600
[ 77.339644] c00000062e539688
[ 77.339740] NIP [c0000000005cf018] .btrfs_qgroup_free_refroot+0x188/0x220
[ 77.339852] LR [c0000000005ceff8] .btrfs_qgroup_free_refroot+0x168/0x220
[ 77.339959] Call Trace:
[ 77.339998] [c00000062b9a76d0] [c0000000005cefcc]
.btrfs_qgroup_free_refroot+0x13c/0x220
[ 77.340123] (unreliable)
[ 77.340193] [c00000062b9a7790] [c00000000054210c]
.commit_fs_roots+0x19c/0x240
[ 77.340355] [c00000062b9a78a0] [c0000000005451a0]
.btrfs_commit_transaction.part.5+0x480/0xbe0
[ 77.340554] [c00000062b9a7970] [c000000000503bd4] .btrfs_sync_fs+0x74/0x1c0
[ 77.340725] [c00000062b9a7a10] [c0000000002d6ce0] .sync_filesystem+0xd0/0x100
[ 77.340891] [c00000062b9a7a90] [c000000000294f88]
.generic_shutdown_super+0x38/0x1a0
[ 77.341052] [c00000062b9a7b10] [c000000000295508] .kill_anon_super+0x18/0x30
[ 77.342243] [c00000062b9a7b90] [c000000000503db8] .btrfs_kill_super+0x18/0xd0
[ 77.342411] [c00000062b9a7c10] [c0000000002958c8]
.deactivate_locked_super+0x98/0xe0
[ 77.342573] [c00000062b9a7c90] [c0000000002be114] .cleanup_mnt+0x54/0xa0
[ 77.342723] [c00000062b9a7d10] [c0000000000d7e74] .task_work_run+0x124/0x180
[ 77.342862] [c00000062b9a7db0] [c00000000001c354] .do_notify_resume+0xa4/0xb0
[ 77.343030] [c00000062b9a7e30] [c00000000000c344]
.ret_from_except_lite+0x70/0x74
[ 77.343187] Instruction dump:
[ 77.343276] 3b400000 e87d0c78 38810070 4bffab1d 60000000 2c230000 4182ff40
e8dfffc8
[ 77.343522] eb630008 7d5c3010 7d294910 7d2900d0 <0b090000> 7fbc3040 41dd0070
f95fffc8
[ 77.343771] ---[ end trace 8130fb71377d4ff8 ]---
[ 77.343857] BTRFS warning (device loop1): qgroup 5 reserved space underflow,
have: 389120, to free: 393216
[ 77.735193] BTRFS info (device loop1): disabling disk space caching
[ 77.736629] BTRFS info (device loop1): has skinny extents
The above call trace happens because of the following:
|--------------------------------+-----------------------------------|
| Task A | Task B |
|--------------------------------+-----------------------------------|
| Write 4k bytes to a file | |
| at offset range [0, 4095] | |
| - __btrfs_buffered_write | |
| -- btrfs_check_data_free_space | |
| --- Set EXTENT_QGROUP_RESERVED | |
| bit for range [0, 64k-1] | |
| -- Copy 4k data from userspace | |
| to page cache | |
| | Write 4k bytes to the file |
| | at offset range [4096, 8191] |
| | - __btrfs_buffered_write |
| | -- btrfs_check_data_free_space |
| | --- Since EXTENT_QGROUP_RESERVED |
| | is already set we don't |
| | reserve qgroup space |
| | -- Assume the call to |
| | btrfs_delalloc_reserve_metadata() |
| | fails |
| | -- btrfs_free_reserved_data_space |
| | --- Clear EXTENT_QGROUP_RESERVED |
| | file range [4096, 8191] |
|--------------------------------+-----------------------------------|
Task B part seems a little strange.
For the lastest for-next branch, for-next-20161125, in
btrfs_delalloc_reserve_metadata(), it will roundup @start and @len.
I was using for-next-20161118. But the problem is still reproducible on
for-next-20161125. I guess, you meant to mention
btrfs_free_reserved_data_space() instead of
btrfs_delalloc_reserve_metadata().
My fault. And yes, I was meant to mention btrfs_free_reserved_data_space().
So for btrfs_qgroup_free_data(), we will free the range of [0, 64k-1],
just like what we reserved in Task A.
So with the new code, Instead of 4k, Task B frees 64k worth of qgroup space
that it had *not* reserved.
For 64K page size, we only support 64K sectorsize. (Without your subpage
size support patches)
And in that case, file extent will be 64K size, so we reserve 64K and
free 64K, which is correct.
The reservation was originally done by Task A. So
when Task A's file extent item is being inserted, it will end up releasing 64k
additional qgroup reservation.
You are right about the case.
If Tasl A's data isn't written into disk(buffered write), then Task B
doesn't need to reserve space, just reuse the space Task A reserved.
And in that case, you're completely right, in error path for Task B, the
free will be deadly.
Two fix seems to be available:
1) Stop buffered write for the reserved space of Task A
But since Task A succeeded in buffered write, it should assume the
data will reach disk, and we will break the assumption.
2) Use accurate free_reserved range in error path
I think it's possible to export the extent_changeset structure and
use them to record what range we reserved, so we can free accurate
amount of space.
But the change may be a little large. Several interface needs to be
changed.
If I misunderstood the problem again, please point it out in case I
missed something again.
BTW, does the 64K sectorsize of PPC64 makes metadata much easier to hit
ENOSPC?
Thanks again for your detailed report,
Qu
Executing "perf record -e btrfs:btrfs_qgroup_release_data -e
btrfs:btrfs_qgroup_reserve_data -a -g ./check btrfs/126" gives the following,
xfs_io 9536 [002] 454.601476: btrfs:btrfs_qgroup_reserve_data: (nil)U:
root=5, ino=257, start=0, len=65536, reserved=65536, op=reserve
xfs_io 9536 [002] 454.601522: btrfs:btrfs_qgroup_reserve_data: (nil)U:
root=5, ino=257, start=0, len=65536, reserved=0, op=reserve
xfs_io 9536 [002] 454.601530: btrfs:btrfs_qgroup_release_data: (nil)U:
root=5, ino=257, start=0, len=65536, reserved=65536, op=free
kthreadd 9171 [003] 454.704145: btrfs:btrfs_qgroup_release_data: (nil)U:
root=5, ino=257, start=0, len=65536, reserved=0, op=release
Please note that in the above perf ouput, Instead of two tasks, the same task
writes two 4k blocks of data.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html