On 2018年01月15日 04:17, Johannes Ernst wrote: >> On Jan 13, 2018, at 18:27, Qu Wenruo <quwenruo.bt...@gmx.com> wrote: >> On 2018年01月14日 09:36, Johannes Ernst wrote: >>> Summary: frequent “hangs” of the system with dmesg saying: >>> >>> task systemd:22229 blocked for more than 120 seconds. >>> [ 2948.928653] Tainted: G C O 4.14.9-1-ARCH #1 >>> [ 2948.928658] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables >>> this message. >>> [ 2948.928665] systemd D 0 22229 22226 0x00000100 >>> [ 2948.928671] Call Trace: >>> [ 2948.928684] ? __schedule+0x290/0x890 >>> [ 2948.928690] schedule+0x2f/0x90 >>> [ 2948.928744] btrfs_tree_read_lock+0xb6/0x100 [btrfs] >>> [ 2948.928752] ? wait_woken+0x80/0x80 >>> [ 2948.928799] find_parent_nodes+0x341/0xfd0 [btrfs] >>> [ 2948.928827] ? btrfs_search_slot+0x84c/0x9f0 [btrfs] >>> [ 2948.928873] ? btrfs_find_all_roots_safe+0x9e/0x110 [btrfs] >>> [ 2948.928912] btrfs_find_all_roots_safe+0x9e/0x110 [btrfs] >>> [ 2948.928950] btrfs_find_all_roots+0x45/0x60 [btrfs] >>> [ 2948.928987] btrfs_qgroup_trace_extent_post+0x30/0x60 [btrfs] >> >> You're using qgroup, and the timing is to find the old_roots of one >> extent, which will only search commit roots. >> >> Normally this shouldn't cause any problem, especially for commit roots. >> >> Is there any special operation happening? > > Nope. It appears it happens right when systemd-nspawn begins to run and I am > not executing any other commands. > > I did not realize qgroups are involved … all I did is mkfs.btrfs and running > systemd-nspawn :-) Perhaps the defaults should be qgroups off? (But I digress) > >>> That works well … but not for long. Often we don’t make it through the test >>> suite and the starting of new containers hangs. Other filesystem operations >>> also hang. The above stacktrace, or something rather similar shows up in >>> dmesg (not in the journal, because that hangs, too!) This is repeated, but >>> I don’t see any other relevant messages. Only a reboot seems to allows to >>> recover. >> >> So Qgroup is used to limit disk usage of each container? >> >> Maybe it's related to subvolume deletion? >> >> Anyway, if qgroup is not necessary, disabling qgroup should fix your >> problem. >> >> Despite of that, did that really hangs? >> Qgroup dramatically increase overhead to delete a subvolume or balance >> the fs. >> Maybe it's just a little slow? > > I have waited for several hours and the system has not recovered (me walking > away from the running tests, returning hours later). > > Update: so I executed "btrfs quota disable” on all relevant file systems. > (right?) Running tests again, this morning I’m getting:
So not a bug for quota. (At least I have less thing to worry about) > > INFO: task systemd-journal:20876 blocked for more than 120 seconds. > [ 5037.962603] Tainted: G C O 4.14.9-1-ARCH #1 > [ 5037.962609] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 5037.962616] systemd-journal D 0 20876 20860 0x00000100 > [ 5037.962622] Call Trace: > [ 5037.962635] ? __schedule+0x290/0x890 > [ 5037.962640] ? __slab_free+0x14e/0x300 > [ 5037.962645] ? _copy_to_iter+0x8f/0x3d0 > [ 5037.962651] schedule+0x2f/0x90 > [ 5037.962704] btrfs_tree_read_lock+0xb6/0x100 [btrfs] Still some tree write is blocking all incoming read. BTW, did your tests include any send and relocation operations? IIRC there is a bug that makes send and relocation (I'm not sure though) will cause similar problem. Thanks, Qu > [ 5037.962713] ? wait_woken+0x80/0x80 > [ 5037.962739] btrfs_read_lock_root_node+0x2f/0x40 [btrfs] > [ 5037.962767] btrfs_search_slot+0x703/0x9f0 [btrfs] > [ 5037.962796] btrfs_insert_empty_items+0x66/0xb0 [btrfs] > [ 5037.962841] btrfs_insert_orphan_item+0x66/0xa0 [btrfs] > [ 5037.962880] btrfs_orphan_add+0xa1/0x200 [btrfs] > [ 5037.962919] btrfs_setattr+0x123/0x3b0 [btrfs] > [ 5037.962926] notify_change+0x2fd/0x420 > [ 5037.962933] do_truncate+0x75/0xc0 > [ 5037.962940] do_sys_ftruncate.constprop.19+0xe7/0x100 > [ 5037.962947] do_syscall_64+0x55/0x110 > [ 5037.962952] entry_SYSCALL64_slow_path+0x25/0x25 > [ 5037.962956] RIP: 0033:0x7fd9423697ba > [ 5037.962958] RSP: 002b:00007fff1179cc18 EFLAGS: 00000206 ORIG_RAX: > 000000000000004d > [ 5037.962962] RAX: ffffffffffffffda RBX: 000055bd48cbe9f0 RCX: > 00007fd9423697ba > [ 5037.962965] RDX: 000055bd48cbe660 RSI: 0000000000640000 RDI: > 000000000000000f > [ 5037.962966] RBP: 00007fff1179cc50 R08: 000055bd48cbc62c R09: > 000055bd48cbea6c > [ 5037.962968] R10: 000055bd48cbe9f0 R11: 0000000000000206 R12: > 00007fff1179cc48 > [ 5037.962970] R13: 0000000000000003 R14: 000562b7234e71a9 R15: > 000055bd47749ca0 > > and doing a simple “touch /build/tmp/foo” never returns. 10+ hours have > passed since the previous command was issued. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
signature.asc
Description: OpenPGP digital signature