Re: [f2fs-dev] [BUG] watchdog: BUG: soft lockup.. after heavy disk access

2019-05-02 Thread Hagbard Celine
2019-05-01 0:35 GMT+02:00, Hagbard Celine :
> 2019-04-30 8:25 GMT+02:00, Chao Yu :
>> On 2019/4/29 21:39, Hagbard Celine wrote:
>>> 2019-04-29 9:36 GMT+02:00, Chao Yu :
 Hi Hagbard,

 At a glance, may I ask what's your device size? Since I notice that you
 used
 -i
 option during mkfs, if size is large enough, checkpoint may crash
 nat/sit_bitmap
 online.

 mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7
>>>
>>> Both partitions are 128GB.
>>
>> 128GB is safe. :)
>>
>>>
 On 2019/4/29 15:25, Hagbard Celine wrote:
> First I must admit that I'm not 100% sure if this is
> f2fs/NMI/something-else bug, but it only triggers after significant
> disk access.
>
> I've hit this bug several times since kernel 5.0.7 (have not tried
> earlier kernels) while compiling many packages in batch. But in those
> occasions it would hang all cores and loose the latest changes to the
> filesystem so I could not get any logs.
> This time it triggered while I was in the process of moving two of my
> installations to new partitions and locked only one core, so I was
> able to get the log after rebooting.
> The I did to trigger was make a new partition and run commands:
> #mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7
> #mount -o
> "rw,realtime,lazytime,background_gc=on,disable_ext_identity,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
>>
>> Could you revert below commit to check whether the issue is caused by it?
>
> This did not seem to help. I was not able to reproduce the bug by
> copying partition contents to a newly made partition, but the again I
> was not able to reproduce it that way on unpatched kernel either.
>
> Goth with and without the mentioned patch reverted the only way I am
> able to somewhat reliably reproduce is by compiling a big batch of
> packages in Exherbo (using paludis/sydbox), if the batch contains one
> of a handful big packages (util-linux is the most common to trigger)
> and that big package is not first in the batch. In those cases it will
> trigger one of the following ways in most runs:
>
> -NMI watchdog: BUG: soft lockup.. on from 4- to 8-cores.
>
> -NMI watchdog: Watchdog detected hard LOCKUP on cpu.. on 2- to 4-cores
>
> -no error message at all, everything suddenly goes to a grinding halt.

After some more testing, it seems that this is not quite accurate. I
might be facing at least two different issues. After about a day of
compiling on a pure ext4 install I have not got any of the hangs with
"NMI watchdog" errors, but one of the package that sometimes hung with
no errors at all (dev-lang/llvm, during test phase) still exhibits
this behavior.

I'll do another full system compile on ext4 just to make sure it's not
pure luck I did not get any NMI watchdog errors there. Then I'll copy
the system to f2fs partition and try dropping mount options (starting
with data_flush), running a full system compile for each, to see if
that makes any difference. (unless anyone got a better idea on how to
proceed, or comes up with another patch to test)

> Most of the time I also loose the ability to switch vt by alt+F-key
> and it's not even responding to "magic sysreq key".
>
> The few times I've been able to do vt-switch to the vt with output
> from syslog there is usually nothing in the time-frame of crash but
> one or two "kernel: perf: interrupt took too long..."
> ..or some random subsystem messaging because something timed out
> (usually some message from i915). Just to be clear the i915 is not in
> use during the tests, console is on amdgpudrmfb and X is not running,
> I even tested with a kernel without i915 to completely rule it out.
>
>> commit 50fa53eccf9f911a5b435248a2b0bd484fd82e5e
>> Author: Chao Yu 
>> Date:   Thu Aug 2 23:03:19 2018 +0800
>>
>> f2fs: fix to avoid broken of dnode block list
>>
>> The only risk of reverting this patch is we may potentially lose fsynced
>> data
>> after sudden power-cut, although it's rare, anyway please backup your
>> data
>> before try it.
>>
>> Thanks,
>>
> /dev/nvme0n1p7 /mnt/gentoo-alt-new
> #cd /mnt/gentoo-alt
> #cp -a . /mnt/gentoo-alt-new
> #umount /nmt/gentoo-alt
> The bug triggers just after last command and last command was run
> within 20 seconds ofter the second-last command returned.
> "/mnt/gentoo-alt" was already mounted with same options as
> "/mnt/gentoo-alt-new", and contained a Gentoo-rootfs of 64GB data
> (according df -h).
>
> This was on kernel 5.0.10 with "[PATCH] f2fs: fix potential recursive
> call when enabling data_flush" by Chao Yu
>
> Syslog for bug follows:
>
> Apr 29 07:02:52 40o2 kernel: watchdog: BUG: soft lockup - CPU#4 stuck
> for 21s! [irq/61-nvme0q5:383]
> Apr 29 07:02:52 40o2 kernel: 

Re: [f2fs-dev] [BUG] watchdog: BUG: soft lockup.. after heavy disk access

2019-04-29 Thread Chao Yu
Hi Hagbard,

At a glance, may I ask what's your device size? Since I notice that you used -i
option during mkfs, if size is large enough, checkpoint may crash nat/sit_bitmap
online.

mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7

On 2019/4/29 15:25, Hagbard Celine wrote:
> First I must admit that I'm not 100% sure if this is
> f2fs/NMI/something-else bug, but it only triggers after significant
> disk access.
> 
> I've hit this bug several times since kernel 5.0.7 (have not tried
> earlier kernels) while compiling many packages in batch. But in those
> occasions it would hang all cores and loose the latest changes to the
> filesystem so I could not get any logs.
> This time it triggered while I was in the process of moving two of my
> installations to new partitions and locked only one core, so I was
> able to get the log after rebooting.
> The I did to trigger was make a new partition and run commands:
> #mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7
> #mount -o 
> "rw,realtime,lazytime,background_gc=on,disable_ext_identity,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
> /dev/nvme0n1p7 /mnt/gentoo-alt-new
> #cd /mnt/gentoo-alt
> #cp -a . /mnt/gentoo-alt-new
> #umount /nmt/gentoo-alt
> The bug triggers just after last command and last command was run
> within 20 seconds ofter the second-last command returned.
> "/mnt/gentoo-alt" was already mounted with same options as
> "/mnt/gentoo-alt-new", and contained a Gentoo-rootfs of 64GB data
> (according df -h).
> 
> This was on kernel 5.0.10 with "[PATCH] f2fs: fix potential recursive
> call when enabling data_flush" by Chao Yu
> 
> Syslog for bug follows:
> 
> Apr 29 07:02:52 40o2 kernel: watchdog: BUG: soft lockup - CPU#4 stuck
> for 21s! [irq/61-nvme0q5:383]
> Apr 29 07:02:52 40o2 kernel: Modules linked in: ipv6 crc_ccitt 8021q
> garp stp llc nls_cp437 vfat fat sd_mod iTCO_wdt x86_pkg_temp_thermal
> kvm_intel kvm irqbypass ghash_clmulni_intel serio_raw i2c_i801
> firewire_ohci firewire_core igb crc_itu_t uas lpc_ich dca usb_storage
> processor_thermal_device ahci intel_soc_dts_iosf int340x_thermal_zone
> libahci pcc_cpufreq efivarfs
> Apr 29 07:02:52 40o2 kernel: CPU: 4 PID: 383 Comm: irq/61-nvme0q5 Not
> tainted 5.0.10-gentoo #1
> Apr 29 07:02:52 40o2 kernel: Hardware name: To Be Filled By O.E.M. To
> Be Filled By O.E.M./C226 WS, BIOS P3.40 06/25/2018
> Apr 29 07:02:52 40o2 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x40
> Apr 29 07:02:52 40o2 kernel: Code: f6 c7 02 75 1e 56 9d e8 38 a0 69 ff
> bf 01 00 00 00 e8 ce 32 60 ff 65 8b 05 37 8a 70 5b 85 c0 74 0b 5b c3
> e8 3c 9f 69 ff 53 9d  e0 e8 bb 58 4f ff 5b c3 90 e8 6b 50 0f 00 b8
> 00 fe ff ff f0 0f
> Apr 29 07:02:52 40o2 kernel: RSP: 0018:99a4003d7d58 EFLAGS:
> 0246 ORIG_RAX: ff13
> Apr 29 07:02:52 40o2 kernel: RAX: 0202 RBX:
> 0246 RCX: dead0200
> Apr 29 07:02:52 40o2 kernel: RDX: 8ab7fd8852e0 RSI:
>  RDI: a490c264
> Apr 29 07:02:52 40o2 kernel: RBP: bef2cc242db8 R08:
> 0002 R09: 00024688
> Apr 29 07:02:52 40o2 kernel: R10: ffb8 R11:
> ffb8 R12: 8ab7fd885000
> Apr 29 07:02:52 40o2 kernel: R13: 8ab7fd8852d8 R14:
> 8ab7fef42b40 R15: 6db6db6db6db6db7
> Apr 29 07:02:52 40o2 kernel: FS:  ()
> GS:8ab81ed0() knlGS:
> Apr 29 07:02:52 40o2 kernel: CS:  0010 DS:  ES:  CR0: 80050033
> Apr 29 07:02:52 40o2 kernel: CR2: 7e6278ff9000 CR3:
> 000792e0c004 CR4: 003606e0
> Apr 29 07:02:52 40o2 kernel: DR0:  DR1:
>  DR2: 
> Apr 29 07:02:52 40o2 kernel: DR3:  DR6:
> fffe0ff0 DR7: 0400
> Apr 29 07:02:52 40o2 kernel: Call Trace:
> Apr 29 07:02:52 40o2 kernel:  ? f2fs_del_fsync_node_entry+0x9f/0xd0
> Apr 29 07:02:52 40o2 kernel:  ? f2fs_write_end_io+0xb6/0x1e0
> Apr 29 07:02:52 40o2 kernel:  ? blk_update_request+0xc0/0x270
> Apr 29 07:02:52 40o2 kernel:  ? blk_mq_end_request+0x1a/0x130
> Apr 29 07:02:52 40o2 kernel:  ? blk_mq_complete_request+0x92/0x110
> Apr 29 07:02:52 40o2 kernel:  ? irq_finalize_oneshot.part.46+0xe0/0xe0
> Apr 29 07:02:52 40o2 kernel:  ? nvme_irq+0xf9/0x260
> Apr 29 07:02:52 40o2 kernel:  ? irq_finalize_oneshot.part.46+0xe0/0xe0
> Apr 29 07:02:52 40o2 kernel:  ? irq_forced_thread_fn+0x30/0x80
> Apr 29 07:02:52 40o2 kernel:  ? irq_thread+0xe7/0x160
> Apr 29 07:02:52 40o2 kernel:  ? wake_threads_waitq+0x30/0x30
> Apr 29 07:02:52 40o2 kernel:  ? irq_thread_check_affinity+0x80/0x80
> Apr 29 07:02:52 40o2 kernel:  ? kthread+0x116/0x130
> Apr 29 07:02:52 40o2 kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
> Apr 29 07:02:52 40o2 kernel:  ? ret_from_fork+0x3a/0x50
> Apr 29 07:02:55 40o2 kernel: rcu: INFO: rcu_preempt self-detected stall on 

[f2fs-dev] [BUG] watchdog: BUG: soft lockup.. after heavy disk access

2019-04-29 Thread Hagbard Celine
First I must admit that I'm not 100% sure if this is
f2fs/NMI/something-else bug, but it only triggers after significant
disk access.

I've hit this bug several times since kernel 5.0.7 (have not tried
earlier kernels) while compiling many packages in batch. But in those
occasions it would hang all cores and loose the latest changes to the
filesystem so I could not get any logs.
This time it triggered while I was in the process of moving two of my
installations to new partitions and locked only one core, so I was
able to get the log after rebooting.
The I did to trigger was make a new partition and run commands:
#mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7
#mount -o 
"rw,realtime,lazytime,background_gc=on,disable_ext_identity,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
/dev/nvme0n1p7 /mnt/gentoo-alt-new
#cd /mnt/gentoo-alt
#cp -a . /mnt/gentoo-alt-new
#umount /nmt/gentoo-alt
The bug triggers just after last command and last command was run
within 20 seconds ofter the second-last command returned.
"/mnt/gentoo-alt" was already mounted with same options as
"/mnt/gentoo-alt-new", and contained a Gentoo-rootfs of 64GB data
(according df -h).

This was on kernel 5.0.10 with "[PATCH] f2fs: fix potential recursive
call when enabling data_flush" by Chao Yu

Syslog for bug follows:

Apr 29 07:02:52 40o2 kernel: watchdog: BUG: soft lockup - CPU#4 stuck
for 21s! [irq/61-nvme0q5:383]
Apr 29 07:02:52 40o2 kernel: Modules linked in: ipv6 crc_ccitt 8021q
garp stp llc nls_cp437 vfat fat sd_mod iTCO_wdt x86_pkg_temp_thermal
kvm_intel kvm irqbypass ghash_clmulni_intel serio_raw i2c_i801
firewire_ohci firewire_core igb crc_itu_t uas lpc_ich dca usb_storage
processor_thermal_device ahci intel_soc_dts_iosf int340x_thermal_zone
libahci pcc_cpufreq efivarfs
Apr 29 07:02:52 40o2 kernel: CPU: 4 PID: 383 Comm: irq/61-nvme0q5 Not
tainted 5.0.10-gentoo #1
Apr 29 07:02:52 40o2 kernel: Hardware name: To Be Filled By O.E.M. To
Be Filled By O.E.M./C226 WS, BIOS P3.40 06/25/2018
Apr 29 07:02:52 40o2 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x40
Apr 29 07:02:52 40o2 kernel: Code: f6 c7 02 75 1e 56 9d e8 38 a0 69 ff
bf 01 00 00 00 e8 ce 32 60 ff 65 8b 05 37 8a 70 5b 85 c0 74 0b 5b c3
e8 3c 9f 69 ff 53 9d  e0 e8 bb 58 4f ff 5b c3 90 e8 6b 50 0f 00 b8
00 fe ff ff f0 0f
Apr 29 07:02:52 40o2 kernel: RSP: 0018:99a4003d7d58 EFLAGS:
0246 ORIG_RAX: ff13
Apr 29 07:02:52 40o2 kernel: RAX: 0202 RBX:
0246 RCX: dead0200
Apr 29 07:02:52 40o2 kernel: RDX: 8ab7fd8852e0 RSI:
 RDI: a490c264
Apr 29 07:02:52 40o2 kernel: RBP: bef2cc242db8 R08:
0002 R09: 00024688
Apr 29 07:02:52 40o2 kernel: R10: ffb8 R11:
ffb8 R12: 8ab7fd885000
Apr 29 07:02:52 40o2 kernel: R13: 8ab7fd8852d8 R14:
8ab7fef42b40 R15: 6db6db6db6db6db7
Apr 29 07:02:52 40o2 kernel: FS:  ()
GS:8ab81ed0() knlGS:
Apr 29 07:02:52 40o2 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Apr 29 07:02:52 40o2 kernel: CR2: 7e6278ff9000 CR3:
000792e0c004 CR4: 003606e0
Apr 29 07:02:52 40o2 kernel: DR0:  DR1:
 DR2: 
Apr 29 07:02:52 40o2 kernel: DR3:  DR6:
fffe0ff0 DR7: 0400
Apr 29 07:02:52 40o2 kernel: Call Trace:
Apr 29 07:02:52 40o2 kernel:  ? f2fs_del_fsync_node_entry+0x9f/0xd0
Apr 29 07:02:52 40o2 kernel:  ? f2fs_write_end_io+0xb6/0x1e0
Apr 29 07:02:52 40o2 kernel:  ? blk_update_request+0xc0/0x270
Apr 29 07:02:52 40o2 kernel:  ? blk_mq_end_request+0x1a/0x130
Apr 29 07:02:52 40o2 kernel:  ? blk_mq_complete_request+0x92/0x110
Apr 29 07:02:52 40o2 kernel:  ? irq_finalize_oneshot.part.46+0xe0/0xe0
Apr 29 07:02:52 40o2 kernel:  ? nvme_irq+0xf9/0x260
Apr 29 07:02:52 40o2 kernel:  ? irq_finalize_oneshot.part.46+0xe0/0xe0
Apr 29 07:02:52 40o2 kernel:  ? irq_forced_thread_fn+0x30/0x80
Apr 29 07:02:52 40o2 kernel:  ? irq_thread+0xe7/0x160
Apr 29 07:02:52 40o2 kernel:  ? wake_threads_waitq+0x30/0x30
Apr 29 07:02:52 40o2 kernel:  ? irq_thread_check_affinity+0x80/0x80
Apr 29 07:02:52 40o2 kernel:  ? kthread+0x116/0x130
Apr 29 07:02:52 40o2 kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
Apr 29 07:02:52 40o2 kernel:  ? ret_from_fork+0x3a/0x50
Apr 29 07:02:55 40o2 kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Apr 29 07:02:55 40o2 kernel: rcu: \x095-: (13949 ticks this GP)
idle=a26/1/0x4002 softirq=191162/191162 fqs=4843
Apr 29 07:02:55 40o2 kernel: rcu: \x09 (t=21000 jiffies g=184461 q=674)
Apr 29 07:02:55 40o2 kernel: Sending NMI from CPU 5 to CPUs 4:
Apr 29 07:02:55 40o2 kernel: NMI backtrace for cpu 4
Apr 29 07:02:55 40o2 kernel: CPU: 4 PID: 383 Comm: irq/61-nvme0q5
Tainted: G L5.0.10-gentoo #1
Apr 29 07:02:55 40o2