Re: [f2fs-dev] [BUG] watchdog: BUG: soft lockup.. after heavy disk access
2019-05-01 0:35 GMT+02:00, Hagbard Celine : > 2019-04-30 8:25 GMT+02:00, Chao Yu : >> On 2019/4/29 21:39, Hagbard Celine wrote: >>> 2019-04-29 9:36 GMT+02:00, Chao Yu : Hi Hagbard, At a glance, may I ask what's your device size? Since I notice that you used -i option during mkfs, if size is large enough, checkpoint may crash nat/sit_bitmap online. mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7 >>> >>> Both partitions are 128GB. >> >> 128GB is safe. :) >> >>> On 2019/4/29 15:25, Hagbard Celine wrote: > First I must admit that I'm not 100% sure if this is > f2fs/NMI/something-else bug, but it only triggers after significant > disk access. > > I've hit this bug several times since kernel 5.0.7 (have not tried > earlier kernels) while compiling many packages in batch. But in those > occasions it would hang all cores and loose the latest changes to the > filesystem so I could not get any logs. > This time it triggered while I was in the process of moving two of my > installations to new partitions and locked only one core, so I was > able to get the log after rebooting. > The I did to trigger was make a new partition and run commands: > #mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7 > #mount -o > "rw,realtime,lazytime,background_gc=on,disable_ext_identity,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict" >> >> Could you revert below commit to check whether the issue is caused by it? > > This did not seem to help. I was not able to reproduce the bug by > copying partition contents to a newly made partition, but the again I > was not able to reproduce it that way on unpatched kernel either. > > Goth with and without the mentioned patch reverted the only way I am > able to somewhat reliably reproduce is by compiling a big batch of > packages in Exherbo (using paludis/sydbox), if the batch contains one > of a handful big packages (util-linux is the most common to trigger) > and that big package is not first in the batch. In those cases it will > trigger one of the following ways in most runs: > > -NMI watchdog: BUG: soft lockup.. on from 4- to 8-cores. > > -NMI watchdog: Watchdog detected hard LOCKUP on cpu.. on 2- to 4-cores > > -no error message at all, everything suddenly goes to a grinding halt. After some more testing, it seems that this is not quite accurate. I might be facing at least two different issues. After about a day of compiling on a pure ext4 install I have not got any of the hangs with "NMI watchdog" errors, but one of the package that sometimes hung with no errors at all (dev-lang/llvm, during test phase) still exhibits this behavior. I'll do another full system compile on ext4 just to make sure it's not pure luck I did not get any NMI watchdog errors there. Then I'll copy the system to f2fs partition and try dropping mount options (starting with data_flush), running a full system compile for each, to see if that makes any difference. (unless anyone got a better idea on how to proceed, or comes up with another patch to test) > Most of the time I also loose the ability to switch vt by alt+F-key > and it's not even responding to "magic sysreq key". > > The few times I've been able to do vt-switch to the vt with output > from syslog there is usually nothing in the time-frame of crash but > one or two "kernel: perf: interrupt took too long..." > ..or some random subsystem messaging because something timed out > (usually some message from i915). Just to be clear the i915 is not in > use during the tests, console is on amdgpudrmfb and X is not running, > I even tested with a kernel without i915 to completely rule it out. > >> commit 50fa53eccf9f911a5b435248a2b0bd484fd82e5e >> Author: Chao Yu >> Date: Thu Aug 2 23:03:19 2018 +0800 >> >> f2fs: fix to avoid broken of dnode block list >> >> The only risk of reverting this patch is we may potentially lose fsynced >> data >> after sudden power-cut, although it's rare, anyway please backup your >> data >> before try it. >> >> Thanks, >> > /dev/nvme0n1p7 /mnt/gentoo-alt-new > #cd /mnt/gentoo-alt > #cp -a . /mnt/gentoo-alt-new > #umount /nmt/gentoo-alt > The bug triggers just after last command and last command was run > within 20 seconds ofter the second-last command returned. > "/mnt/gentoo-alt" was already mounted with same options as > "/mnt/gentoo-alt-new", and contained a Gentoo-rootfs of 64GB data > (according df -h). > > This was on kernel 5.0.10 with "[PATCH] f2fs: fix potential recursive > call when enabling data_flush" by Chao Yu > > Syslog for bug follows: > > Apr 29 07:02:52 40o2 kernel: watchdog: BUG: soft lockup - CPU#4 stuck > for 21s! [irq/61-nvme0q5:383] > Apr 29 07:02:52 40o2 kernel:
Re: [f2fs-dev] [BUG] watchdog: BUG: soft lockup.. after heavy disk access
Hi Hagbard, At a glance, may I ask what's your device size? Since I notice that you used -i option during mkfs, if size is large enough, checkpoint may crash nat/sit_bitmap online. mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7 On 2019/4/29 15:25, Hagbard Celine wrote: > First I must admit that I'm not 100% sure if this is > f2fs/NMI/something-else bug, but it only triggers after significant > disk access. > > I've hit this bug several times since kernel 5.0.7 (have not tried > earlier kernels) while compiling many packages in batch. But in those > occasions it would hang all cores and loose the latest changes to the > filesystem so I could not get any logs. > This time it triggered while I was in the process of moving two of my > installations to new partitions and locked only one core, so I was > able to get the log after rebooting. > The I did to trigger was make a new partition and run commands: > #mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7 > #mount -o > "rw,realtime,lazytime,background_gc=on,disable_ext_identity,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict" > /dev/nvme0n1p7 /mnt/gentoo-alt-new > #cd /mnt/gentoo-alt > #cp -a . /mnt/gentoo-alt-new > #umount /nmt/gentoo-alt > The bug triggers just after last command and last command was run > within 20 seconds ofter the second-last command returned. > "/mnt/gentoo-alt" was already mounted with same options as > "/mnt/gentoo-alt-new", and contained a Gentoo-rootfs of 64GB data > (according df -h). > > This was on kernel 5.0.10 with "[PATCH] f2fs: fix potential recursive > call when enabling data_flush" by Chao Yu > > Syslog for bug follows: > > Apr 29 07:02:52 40o2 kernel: watchdog: BUG: soft lockup - CPU#4 stuck > for 21s! [irq/61-nvme0q5:383] > Apr 29 07:02:52 40o2 kernel: Modules linked in: ipv6 crc_ccitt 8021q > garp stp llc nls_cp437 vfat fat sd_mod iTCO_wdt x86_pkg_temp_thermal > kvm_intel kvm irqbypass ghash_clmulni_intel serio_raw i2c_i801 > firewire_ohci firewire_core igb crc_itu_t uas lpc_ich dca usb_storage > processor_thermal_device ahci intel_soc_dts_iosf int340x_thermal_zone > libahci pcc_cpufreq efivarfs > Apr 29 07:02:52 40o2 kernel: CPU: 4 PID: 383 Comm: irq/61-nvme0q5 Not > tainted 5.0.10-gentoo #1 > Apr 29 07:02:52 40o2 kernel: Hardware name: To Be Filled By O.E.M. To > Be Filled By O.E.M./C226 WS, BIOS P3.40 06/25/2018 > Apr 29 07:02:52 40o2 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x40 > Apr 29 07:02:52 40o2 kernel: Code: f6 c7 02 75 1e 56 9d e8 38 a0 69 ff > bf 01 00 00 00 e8 ce 32 60 ff 65 8b 05 37 8a 70 5b 85 c0 74 0b 5b c3 > e8 3c 9f 69 ff 53 9d e0 e8 bb 58 4f ff 5b c3 90 e8 6b 50 0f 00 b8 > 00 fe ff ff f0 0f > Apr 29 07:02:52 40o2 kernel: RSP: 0018:99a4003d7d58 EFLAGS: > 0246 ORIG_RAX: ff13 > Apr 29 07:02:52 40o2 kernel: RAX: 0202 RBX: > 0246 RCX: dead0200 > Apr 29 07:02:52 40o2 kernel: RDX: 8ab7fd8852e0 RSI: > RDI: a490c264 > Apr 29 07:02:52 40o2 kernel: RBP: bef2cc242db8 R08: > 0002 R09: 00024688 > Apr 29 07:02:52 40o2 kernel: R10: ffb8 R11: > ffb8 R12: 8ab7fd885000 > Apr 29 07:02:52 40o2 kernel: R13: 8ab7fd8852d8 R14: > 8ab7fef42b40 R15: 6db6db6db6db6db7 > Apr 29 07:02:52 40o2 kernel: FS: () > GS:8ab81ed0() knlGS: > Apr 29 07:02:52 40o2 kernel: CS: 0010 DS: ES: CR0: 80050033 > Apr 29 07:02:52 40o2 kernel: CR2: 7e6278ff9000 CR3: > 000792e0c004 CR4: 003606e0 > Apr 29 07:02:52 40o2 kernel: DR0: DR1: > DR2: > Apr 29 07:02:52 40o2 kernel: DR3: DR6: > fffe0ff0 DR7: 0400 > Apr 29 07:02:52 40o2 kernel: Call Trace: > Apr 29 07:02:52 40o2 kernel: ? f2fs_del_fsync_node_entry+0x9f/0xd0 > Apr 29 07:02:52 40o2 kernel: ? f2fs_write_end_io+0xb6/0x1e0 > Apr 29 07:02:52 40o2 kernel: ? blk_update_request+0xc0/0x270 > Apr 29 07:02:52 40o2 kernel: ? blk_mq_end_request+0x1a/0x130 > Apr 29 07:02:52 40o2 kernel: ? blk_mq_complete_request+0x92/0x110 > Apr 29 07:02:52 40o2 kernel: ? irq_finalize_oneshot.part.46+0xe0/0xe0 > Apr 29 07:02:52 40o2 kernel: ? nvme_irq+0xf9/0x260 > Apr 29 07:02:52 40o2 kernel: ? irq_finalize_oneshot.part.46+0xe0/0xe0 > Apr 29 07:02:52 40o2 kernel: ? irq_forced_thread_fn+0x30/0x80 > Apr 29 07:02:52 40o2 kernel: ? irq_thread+0xe7/0x160 > Apr 29 07:02:52 40o2 kernel: ? wake_threads_waitq+0x30/0x30 > Apr 29 07:02:52 40o2 kernel: ? irq_thread_check_affinity+0x80/0x80 > Apr 29 07:02:52 40o2 kernel: ? kthread+0x116/0x130 > Apr 29 07:02:52 40o2 kernel: ? kthread_create_worker_on_cpu+0x70/0x70 > Apr 29 07:02:52 40o2 kernel: ? ret_from_fork+0x3a/0x50 > Apr 29 07:02:55 40o2 kernel: rcu: INFO: rcu_preempt self-detected stall on
[f2fs-dev] [BUG] watchdog: BUG: soft lockup.. after heavy disk access
First I must admit that I'm not 100% sure if this is f2fs/NMI/something-else bug, but it only triggers after significant disk access. I've hit this bug several times since kernel 5.0.7 (have not tried earlier kernels) while compiling many packages in batch. But in those occasions it would hang all cores and loose the latest changes to the filesystem so I could not get any logs. This time it triggered while I was in the process of moving two of my installations to new partitions and locked only one core, so I was able to get the log after rebooting. The I did to trigger was make a new partition and run commands: #mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7 #mount -o "rw,realtime,lazytime,background_gc=on,disable_ext_identity,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict" /dev/nvme0n1p7 /mnt/gentoo-alt-new #cd /mnt/gentoo-alt #cp -a . /mnt/gentoo-alt-new #umount /nmt/gentoo-alt The bug triggers just after last command and last command was run within 20 seconds ofter the second-last command returned. "/mnt/gentoo-alt" was already mounted with same options as "/mnt/gentoo-alt-new", and contained a Gentoo-rootfs of 64GB data (according df -h). This was on kernel 5.0.10 with "[PATCH] f2fs: fix potential recursive call when enabling data_flush" by Chao Yu Syslog for bug follows: Apr 29 07:02:52 40o2 kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 21s! [irq/61-nvme0q5:383] Apr 29 07:02:52 40o2 kernel: Modules linked in: ipv6 crc_ccitt 8021q garp stp llc nls_cp437 vfat fat sd_mod iTCO_wdt x86_pkg_temp_thermal kvm_intel kvm irqbypass ghash_clmulni_intel serio_raw i2c_i801 firewire_ohci firewire_core igb crc_itu_t uas lpc_ich dca usb_storage processor_thermal_device ahci intel_soc_dts_iosf int340x_thermal_zone libahci pcc_cpufreq efivarfs Apr 29 07:02:52 40o2 kernel: CPU: 4 PID: 383 Comm: irq/61-nvme0q5 Not tainted 5.0.10-gentoo #1 Apr 29 07:02:52 40o2 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./C226 WS, BIOS P3.40 06/25/2018 Apr 29 07:02:52 40o2 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x40 Apr 29 07:02:52 40o2 kernel: Code: f6 c7 02 75 1e 56 9d e8 38 a0 69 ff bf 01 00 00 00 e8 ce 32 60 ff 65 8b 05 37 8a 70 5b 85 c0 74 0b 5b c3 e8 3c 9f 69 ff 53 9d e0 e8 bb 58 4f ff 5b c3 90 e8 6b 50 0f 00 b8 00 fe ff ff f0 0f Apr 29 07:02:52 40o2 kernel: RSP: 0018:99a4003d7d58 EFLAGS: 0246 ORIG_RAX: ff13 Apr 29 07:02:52 40o2 kernel: RAX: 0202 RBX: 0246 RCX: dead0200 Apr 29 07:02:52 40o2 kernel: RDX: 8ab7fd8852e0 RSI: RDI: a490c264 Apr 29 07:02:52 40o2 kernel: RBP: bef2cc242db8 R08: 0002 R09: 00024688 Apr 29 07:02:52 40o2 kernel: R10: ffb8 R11: ffb8 R12: 8ab7fd885000 Apr 29 07:02:52 40o2 kernel: R13: 8ab7fd8852d8 R14: 8ab7fef42b40 R15: 6db6db6db6db6db7 Apr 29 07:02:52 40o2 kernel: FS: () GS:8ab81ed0() knlGS: Apr 29 07:02:52 40o2 kernel: CS: 0010 DS: ES: CR0: 80050033 Apr 29 07:02:52 40o2 kernel: CR2: 7e6278ff9000 CR3: 000792e0c004 CR4: 003606e0 Apr 29 07:02:52 40o2 kernel: DR0: DR1: DR2: Apr 29 07:02:52 40o2 kernel: DR3: DR6: fffe0ff0 DR7: 0400 Apr 29 07:02:52 40o2 kernel: Call Trace: Apr 29 07:02:52 40o2 kernel: ? f2fs_del_fsync_node_entry+0x9f/0xd0 Apr 29 07:02:52 40o2 kernel: ? f2fs_write_end_io+0xb6/0x1e0 Apr 29 07:02:52 40o2 kernel: ? blk_update_request+0xc0/0x270 Apr 29 07:02:52 40o2 kernel: ? blk_mq_end_request+0x1a/0x130 Apr 29 07:02:52 40o2 kernel: ? blk_mq_complete_request+0x92/0x110 Apr 29 07:02:52 40o2 kernel: ? irq_finalize_oneshot.part.46+0xe0/0xe0 Apr 29 07:02:52 40o2 kernel: ? nvme_irq+0xf9/0x260 Apr 29 07:02:52 40o2 kernel: ? irq_finalize_oneshot.part.46+0xe0/0xe0 Apr 29 07:02:52 40o2 kernel: ? irq_forced_thread_fn+0x30/0x80 Apr 29 07:02:52 40o2 kernel: ? irq_thread+0xe7/0x160 Apr 29 07:02:52 40o2 kernel: ? wake_threads_waitq+0x30/0x30 Apr 29 07:02:52 40o2 kernel: ? irq_thread_check_affinity+0x80/0x80 Apr 29 07:02:52 40o2 kernel: ? kthread+0x116/0x130 Apr 29 07:02:52 40o2 kernel: ? kthread_create_worker_on_cpu+0x70/0x70 Apr 29 07:02:52 40o2 kernel: ? ret_from_fork+0x3a/0x50 Apr 29 07:02:55 40o2 kernel: rcu: INFO: rcu_preempt self-detected stall on CPU Apr 29 07:02:55 40o2 kernel: rcu: \x095-: (13949 ticks this GP) idle=a26/1/0x4002 softirq=191162/191162 fqs=4843 Apr 29 07:02:55 40o2 kernel: rcu: \x09 (t=21000 jiffies g=184461 q=674) Apr 29 07:02:55 40o2 kernel: Sending NMI from CPU 5 to CPUs 4: Apr 29 07:02:55 40o2 kernel: NMI backtrace for cpu 4 Apr 29 07:02:55 40o2 kernel: CPU: 4 PID: 383 Comm: irq/61-nvme0q5 Tainted: G L5.0.10-gentoo #1 Apr 29 07:02:55 40o2