Re: [f2fs-dev] BUG: kernel stack overflow when mounting with data_flush

2019-05-20 Thread Hagbard Celine
2019-05-20 11:37 GMT+02:00, Chao Yu :
> On 2019/5/16 1:01, Hagbard Celine wrote:
>> 2019-05-15 18:50 GMT+02:00, Hagbard Celine :
>>> 2019-05-15 10:13 GMT+02:00, Chao Yu :
>>>> On 2019/5/15 16:03, Hagbard Celine wrote:
>>>>> 2019-05-15 4:25 GMT+02:00, Chao Yu :
>>>>>> On 2019/5/15 2:13, Hagbard Celine wrote:
>>>>>>> 2019-04-02 15:31 GMT+02:00, Chao Yu :
>>>>>>>> On 2019-4-2 20:41, Hagbard Celine wrote:
>>>>>>>>> That seems to have fixed it. No more errors in syslog after
>>>>>>>>> extracting
>>>>>>>>> my stage3 tarball. Also ran a couple of kernel compiles on a
>>>>>>>>> partition
>>>>>>>>> mounted with data_flush and system seems stable.
>>>>>>>>
>>>>>>>> Thanks a lot for your quick test. :)
>>>>>>>
>>>>>>> My test might have been a little too quick, or I found another
>>>>>>> data_flush bug that behaves similar.
>>>>>>
>>>>>> oops...
>>>>>>
>>>>>>>>>>
>>>>>>>>>> -if (is_dir)
>>>>>>>>>> -F2FS_I(inode)->cp_task = current;
>>>>>>>>>> +F2FS_I(inode)->cp_task = current;
>>>>>>
>>>>>> If you're sure that this patch was applying before you test, I guess
>>>>>> we
>>>>>> need
>>>>>> an
>>>>>> extra barrier here to avoid out-of-order execution.
>>>>>>
>>>>>> smp_mb()
>>>>>>
>>>>>>>>>>
>>>>>>>>>>  filemap_fdatawrite(inode->i_mapping);
>>>>>>>>>>
>>>>>>>>>> -if (is_dir)
>>>>>>>>>> -F2FS_I(inode)->cp_task = NULL;
>>>>>>>>>> +F2FS_I(inode)->cp_task = NULL;
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>> If I did this correctly; it did not get rid of the stack overflow.
>>>>> Here is what I did:
>>>>>
>>>>> Added smb_mb() in checkpoint.c so the relevant part looks like this:
>>>>>
>>>>>   unsigned long cur_ino = inode->i_ino;
>>>>>
>>>>>   F2FS_I(inode)->cp_task = current;
>>>>>
>>>>>   smp_mb();
>>>>>
>>>>>   filemap_fdatawrite(inode->i_mapping);
>>>>>
>>>>>   F2FS_I(inode)->cp_task = NULL;
>>>>>
>>>>>   iput(inode);
>>>>>   
>>>>>
>>>>> Compiled, rebooted and ran my test-scripts again. And got this during
>>>>> copy-phase in second script:
>>>>
>>>> It looks very easy to reproduce this bug, could you add log to track
>>>> F2FS_I(inode)->cp_task's value:
>>> That wasn't so easy, with all the logging from those prink the copy
>>> process would hang where it would oops without the printk's.
>>
>> Forgot in last mail that I actually had log from hang with both printk
>> enabled also:
>
> Sorry for the delay.
>
> I found another two issues related to data_flush, could you try below fixing
> patch?
>
> [PATCH] f2fs: fix to avoid deadloop if data_flush is on

I ran several runs of my test scripts on with this new patch on top of
kernel 5.0.15 with "[PATCH] f2fs: fix potential recursive call when
enabling data_flush" and the extra smp_mb() in checkpoint.c.
When that worked I did the same with this new patch on top of kernel
5.0.15 with "[PATCH] f2fs: fix potential recursive call when enabling
data_flush" and _without_ the extra smp_mb() in checkpoint.c.

In both cases I get no oops or hang.

>
> Thanks,
>
>>
>> [  194.681126] sync_dirty_inodes: inode:590309, cp_task:13327ef9
>> [  194.682258] sync_dirty_inodes: inode:590301, cp_task:13327ef9
>> [  194.682665] sync_dirty_inodes: inode:590311, cp_task:13327ef9
>> [  194.682703] sync_dirty_inodes: inode:590312, cp_task:13327ef9
>> [  194.682791] sync_dirty_inodes: inode:590313, cp_task:13327ef9
>> [  194.683566] sync_dirty_inodes: inode:590314, cp_task:13327ef9
>> [  194.683669] s

Re: [f2fs-dev] BUG: kernel stack overflow when mounting with data_flush

2019-05-15 Thread Hagbard Celine
2019-05-15 18:50 GMT+02:00, Hagbard Celine :
> 2019-05-15 10:13 GMT+02:00, Chao Yu :
>> On 2019/5/15 16:03, Hagbard Celine wrote:
>>> 2019-05-15 4:25 GMT+02:00, Chao Yu :
>>>> On 2019/5/15 2:13, Hagbard Celine wrote:
>>>>> 2019-04-02 15:31 GMT+02:00, Chao Yu :
>>>>>> On 2019-4-2 20:41, Hagbard Celine wrote:
>>>>>>> That seems to have fixed it. No more errors in syslog after
>>>>>>> extracting
>>>>>>> my stage3 tarball. Also ran a couple of kernel compiles on a
>>>>>>> partition
>>>>>>> mounted with data_flush and system seems stable.
>>>>>>
>>>>>> Thanks a lot for your quick test. :)
>>>>>
>>>>> My test might have been a little too quick, or I found another
>>>>> data_flush bug that behaves similar.
>>>>
>>>> oops...
>>>>
>>>>>>>>
>>>>>>>> -  if (is_dir)
>>>>>>>> -  F2FS_I(inode)->cp_task = current;
>>>>>>>> +  F2FS_I(inode)->cp_task = current;
>>>>
>>>> If you're sure that this patch was applying before you test, I guess we
>>>> need
>>>> an
>>>> extra barrier here to avoid out-of-order execution.
>>>>
>>>> smp_mb()
>>>>
>>>>>>>>
>>>>>>>>filemap_fdatawrite(inode->i_mapping);
>>>>>>>>
>>>>>>>> -  if (is_dir)
>>>>>>>> -  F2FS_I(inode)->cp_task = NULL;
>>>>>>>> +  F2FS_I(inode)->cp_task = NULL;
>>>>
>>>> Thanks,
>>>>
>>> If I did this correctly; it did not get rid of the stack overflow.
>>> Here is what I did:
>>>
>>> Added smb_mb() in checkpoint.c so the relevant part looks like this:
>>>
>>> unsigned long cur_ino = inode->i_ino;
>>>
>>> F2FS_I(inode)->cp_task = current;
>>>
>>> smp_mb();
>>>
>>> filemap_fdatawrite(inode->i_mapping);
>>>
>>> F2FS_I(inode)->cp_task = NULL;
>>>
>>> iput(inode);
>>> 
>>>
>>> Compiled, rebooted and ran my test-scripts again. And got this during
>>> copy-phase in second script:
>>
>> It looks very easy to reproduce this bug, could you add log to track
>> F2FS_I(inode)->cp_task's value:
> That wasn't so easy, with all the logging from those prink the copy
> process would hang where it would oops without the printk's.

Forgot in last mail that I actually had log from hang with both printk
enabled also:

[  194.681126] sync_dirty_inodes: inode:590309, cp_task:13327ef9
[  194.682258] sync_dirty_inodes: inode:590301, cp_task:13327ef9
[  194.682665] sync_dirty_inodes: inode:590311, cp_task:13327ef9
[  194.682703] sync_dirty_inodes: inode:590312, cp_task:13327ef9
[  194.682791] sync_dirty_inodes: inode:590313, cp_task:13327ef9
[  194.683566] sync_dirty_inodes: inode:590314, cp_task:13327ef9
[  194.683669] sync_dirty_inodes: inode:590315, cp_task:13327ef9
[  194.684829] sync_dirty_inodes: inode:590316, cp_task:13327ef9
[  194.712860] sync_dirty_inodes: inode:590317, cp_task:13327ef9
[  194.712908] sync_dirty_inodes: inode:590310, cp_task:13327ef9
[  194.713094] sync_dirty_inodes: inode:590319, cp_task:13327ef9
[  194.713348] sync_dirty_inodes: inode:590320, cp_task:13327ef9
[  194.713384] sync_dirty_inodes: inode:590321, cp_task:13327ef9
[  194.714634] sync_dirty_inodes: inode:590322, cp_task:13327ef9
[  194.715349] sync_dirty_inodes: inode:590323, cp_task:13327ef9
[  194.715381] sync_dirty_inodes: inode:590324, cp_task:13327ef9
[  194.718592] sync_dirty_inodes: inode:590326, cp_task:13327ef9
[  194.719217] sync_dirty_inodes: inode:590327, cp_task:13327ef9
[  194.719354] sync_dirty_inodes: inode:590328, cp_task:13327ef9
[  194.719903] sync_dirty_inodes: inode:590329, cp_task:13327ef9
[  194.720859] sync_dirty_inodes: inode:590521, cp_task:13327ef9
[  194.720868] sync_dirty_inodes: inode:590300, cp_task:13327ef9
[  194.720985] sync_dirty_inodes: inode:590523, cp_task:13327ef9
[  194.738075] sync_dirty_inodes: inode:591528, cp_task:13327ef9
[  194.738168] sync_dirty_inodes: inode:591529, cp_task:13327ef9
[  194.738190] sync_dirt

Re: [f2fs-dev] BUG: kernel stack overflow when mounting with data_flush

2019-05-15 Thread Hagbard Celine
2019-05-15 10:13 GMT+02:00, Chao Yu :
> On 2019/5/15 16:03, Hagbard Celine wrote:
>> 2019-05-15 4:25 GMT+02:00, Chao Yu :
>>> On 2019/5/15 2:13, Hagbard Celine wrote:
>>>> 2019-04-02 15:31 GMT+02:00, Chao Yu :
>>>>> On 2019-4-2 20:41, Hagbard Celine wrote:
>>>>>> That seems to have fixed it. No more errors in syslog after
>>>>>> extracting
>>>>>> my stage3 tarball. Also ran a couple of kernel compiles on a
>>>>>> partition
>>>>>> mounted with data_flush and system seems stable.
>>>>>
>>>>> Thanks a lot for your quick test. :)
>>>>
>>>> My test might have been a little too quick, or I found another
>>>> data_flush bug that behaves similar.
>>>
>>> oops...
>>>
>>>>>>>
>>>>>>> -   if (is_dir)
>>>>>>> -   F2FS_I(inode)->cp_task = current;
>>>>>>> +   F2FS_I(inode)->cp_task = current;
>>>
>>> If you're sure that this patch was applying before you test, I guess we
>>> need
>>> an
>>> extra barrier here to avoid out-of-order execution.
>>>
>>> smp_mb()
>>>
>>>>>>>
>>>>>>> filemap_fdatawrite(inode->i_mapping);
>>>>>>>
>>>>>>> -   if (is_dir)
>>>>>>> -   F2FS_I(inode)->cp_task = NULL;
>>>>>>> +   F2FS_I(inode)->cp_task = NULL;
>>>
>>> Thanks,
>>>
>> If I did this correctly; it did not get rid of the stack overflow.
>> Here is what I did:
>>
>> Added smb_mb() in checkpoint.c so the relevant part looks like this:
>>
>>  unsigned long cur_ino = inode->i_ino;
>>
>>  F2FS_I(inode)->cp_task = current;
>>
>>  smp_mb();
>>
>>  filemap_fdatawrite(inode->i_mapping);
>>
>>  F2FS_I(inode)->cp_task = NULL;
>>
>>  iput(inode);
>>  
>>
>> Compiled, rebooted and ran my test-scripts again. And got this during
>> copy-phase in second script:
>
> It looks very easy to reproduce this bug, could you add log to track
> F2FS_I(inode)->cp_task's value:
That wasn't so easy, with all the logging from those prink the copy
process would hang where it would oops without the printk's.
I was able to reproduse the bug with one of the two printk's at a time
only, and I had to disable syslog-ng and fcron for it not to hang even
then.

Followig is log from two runs, one with each of the printk's, hope it helps.

--BEGIN log one
<4>[  593.806592] write_data_page: inode:710085, cp_task:  (null)
<4>[  593.806688] write_data_page: inode:710110, cp_task:  (null)
<4>[  593.808558] write_data_page: inode:710321, cp_task:  (null)
<4>[  593.808575] write_data_page: inode:710325, cp_task:  (null)
<4>[  593.808590] write_data_page: inode:710326, cp_task:  (null)
<4>[  593.808606] write_data_page: inode:710332, cp_task:  (null)
<4>[  593.966185] write_data_page: inode:721775, cp_task:  (null)
<4>[  593.966203] write_data_page: inode:721776, cp_task:  (null)
<4>[  593.966219] write_data_page: inode:721777, cp_task:  (null)
<4>[  593.966235] write_data_page: inode:721778, cp_task:  (null)
<4>[  593.966250] write_data_page: inode:721779, cp_task:  (null)
<4>[  593.966266] write_data_page: inode:721780, cp_task:  (null)
<4>[  593.966281] write_data_page: inode:721781, cp_task:  (null)
<4>[  593.966296] write_data_page: inode:721782, cp_task:  (null)
<4>[  593.966311] write_data_page: inode:721783, cp_task:  (null)
<4>[  593.966327] write_data_page: inode:721784, cp_task:  (null)
<4>[  593.966343] write_data_page: inode:721785, cp_task:  (null)
<4>[  593.966359] write_data_page: inode:721786, cp_task:  (null)
<4>[  593.966374] write_data_page: inode:721787, cp_task:  (null)
<4>[  594.340072] write_data_page: inode:746183, cp_task:  (null)
<0>[  594.923096] BUG: stack guard page was hit at 6e7354a5
(stack is 6445beb4..988529ca)
<0>[  594.923108] BUG: stack guard page was hit at d2c9ec98
(stack is b417d4d3..1b88c4fe)
<4>[  594.926975] kernel stack overflow (double-fault):  [#1]
PREEMPT SMP PTI
<4>[  594.934772] CPU: 7 PID: 2158 Comm: cp Not tainted

Re: [f2fs-dev] BUG: kernel stack overflow when mounting with data_flush

2019-05-15 Thread Hagbard Celine
2019-05-15 4:25 GMT+02:00, Chao Yu :
> On 2019/5/15 2:13, Hagbard Celine wrote:
>> 2019-04-02 15:31 GMT+02:00, Chao Yu :
>>> On 2019-4-2 20:41, Hagbard Celine wrote:
>>>> That seems to have fixed it. No more errors in syslog after extracting
>>>> my stage3 tarball. Also ran a couple of kernel compiles on a partition
>>>> mounted with data_flush and system seems stable.
>>>
>>> Thanks a lot for your quick test. :)
>>
>> My test might have been a little too quick, or I found another
>> data_flush bug that behaves similar.
>
> oops...
>
>>>>>
>>>>> - if (is_dir)
>>>>> - F2FS_I(inode)->cp_task = current;
>>>>> + F2FS_I(inode)->cp_task = current;
>
> If you're sure that this patch was applying before you test, I guess we need
> an
> extra barrier here to avoid out-of-order execution.
>
> smp_mb()
>
>>>>>
>>>>>   filemap_fdatawrite(inode->i_mapping);
>>>>>
>>>>> - if (is_dir)
>>>>> - F2FS_I(inode)->cp_task = NULL;
>>>>> + F2FS_I(inode)->cp_task = NULL;
>
> Thanks,
>
If I did this correctly; it did not get rid of the stack overflow.
Here is what I did:

Added smb_mb() in checkpoint.c so the relevant part looks like this:

unsigned long cur_ino = inode->i_ino;

F2FS_I(inode)->cp_task = current;

smp_mb();

filemap_fdatawrite(inode->i_mapping);

F2FS_I(inode)->cp_task = NULL;

iput(inode);


Compiled, rebooted and ran my test-scripts again. And got this during
copy-phase in second script:

<5>[ 1215.731077] F2FS-fs (nvme0n1p7): Found nat_bits in checkpoint
<5>[ 1215.812730] F2FS-fs (nvme0n1p7): Mounted with checkpoint version
= 6319b5f3
<5>[ 1215.856781] F2FS-fs (nvme0n1p8): Mounted with checkpoint version
= 7a6b5e6d
<5>[ 1587.552859] F2FS-fs (nvme0n1p7): Found nat_bits in checkpoint
<5>[ 1587.597483] F2FS-fs (nvme0n1p7): Mounted with checkpoint version
= 6319b776
<5>[ 1587.630029] F2FS-fs (nvme0n1p8): Mounted with checkpoint version
= 7a6b5e71
<0>[ 1720.608369] BUG: stack guard page was hit at 33d16c42
(stack is ed3eabe7..ffbe4ff0)
<4>[ 1720.612537] kernel stack overflow (double-fault):  [#1]
PREEMPT SMP PTI
<4>[ 1720.616750] CPU: 3 PID: 1982 Comm: kworker/u16:0 Not tainted
5.0.15-gentoo-f2fsbarr #3
<4>[ 1720.621057] Hardware name: To Be Filled By O.E.M. To Be Filled
By O.E.M./C226 WS, BIOS P3.40 06/25/2018
<4>[ 1720.625465] Workqueue: writeback wb_workfn (flush-259:0)
<4>[ 1720.629881] RIP: 0010:sched_clock_cpu+0x9/0xd0
<4>[ 1720.634283] Code: 08 e8 2b 9b f0 ff 48 89 03 48 03 05 a1 2e 62
01 48 2b 43 08 5b 48 89 05 8d 2e 62 01 c3 0f 1f 40 00 41 54 55 53 0f
1f 44 00 00  02 9b f0 ff 48 03 05 7b 2e 62 01 48 89 c2 5b 48 89 d0
5d 41 5c
<4>[ 1720.639109] RSP: 0018:a661c0364000 EFLAGS: 00010046
<4>[ 1720.643893] RAX: 0003 RBX: 91cf5ecd54c0 RCX:
a661c03640f8
<4>[ 1720.648739] RDX:  RSI: 0003 RDI:
0003
<4>[ 1720.653589] RBP: b16485c0 R08: 0004 R09:
00020e00
<4>[ 1720.658441] R10: b16485c0 R11: 00cb R12:

<4>[ 1720.663255] R13: a661c03640f8 R14: 0046 R15:
91cf3c8a01c0
<4>[ 1720.668069] FS:  ()
GS:91cf5ecc() knlGS:
<4>[ 1720.672971] CS:  0010 DS:  ES:  CR0: 80050033
<4>[ 1720.677885] CR2: a661c0363ff8 CR3: 00069bc0c003 CR4:
003606e0
<4>[ 1720.682859] DR0:  DR1:  DR2:

<4>[ 1720.687839] DR3:  DR6: fffe0ff0 DR7:
0400
<4>[ 1720.692821] Call Trace:
<4>[ 1720.697807]  record_times+0x16/0xb0
<4>[ 1720.702795]  psi_task_change+0xe9/0x210
<4>[ 1720.707795]  activate_task+0xac/0x120
<4>[ 1720.712772]  ttwu_do_activate+0x43/0x80
<4>[ 1720.717768]  try_to_wake_up+0x1ef/0x510
<4>[ 1720.722547]  __queue_work+0xf2/0x3f0
<4>[ 1720.727110]  mod_delayed_work_on+0x59/0xa0
<4>[ 1720.731725]  kblockd_mod_delayed_work_on+0x17/0x20
<4>[ 1720.736403]  blk_mq_run_hw_queue+0x88/0xe0
<4>[ 1720.741094]  blk_mq_flush_plug_list+0x19e/0x300
<4>[ 1720.745810]  blk_flush_plug_list+0xd7/0x100
<4>[ 1720.750534]  io_schedule_prepare+0x3c/0x40
<4>[ 1720.755171]  io_schedule+0xb/0x40
<4>[ 1720.759697]  __lock_page+0x13c/0x240
<4>[ 1720.764214]  ? file_check_and_advance_wb_err+0xe0/0xe0
<4>[ 1720.768762]

Re: [f2fs-dev] BUG: kernel stack overflow when mounting with data_flush

2019-05-14 Thread Hagbard Celine
2019-04-02 15:31 GMT+02:00, Chao Yu :
> On 2019-4-2 20:41, Hagbard Celine wrote:
>> That seems to have fixed it. No more errors in syslog after extracting
>> my stage3 tarball. Also ran a couple of kernel compiles on a partition
>> mounted with data_flush and system seems stable.
>
> Thanks a lot for your quick test. :)

My test might have been a little too quick, or I found another
data_flush bug that behaves similar.

While trying to find a faster method to trigger the "watchdog: BUG:
soft lockup.. after heavy disk access" issue I reported in another
mail; I got again "stack guard page was hit...", "kernel stack
overflow (double-fault)..." which appear only when mounted with
data_flush.

What I did to trigger this time was I made two scripts:

--BEGIN first script
#!/bin/bash
mkfs.f2fs -a 1 -f -i -l NVME_Exherbo-ts2 -o 7 /dev/nvme0n1p7
mount -o 
"rw,relatime,lazytime,background_gc=on,disable_ext_identify,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
/dev/nvme0n1p7 /mnt/exherbo-2tst/
mount -o 
"rw,relatime,lazytime,background_gc=on,disable_ext_identify,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
/dev/nvme0n1p8 /mnt/exherbo
mkdir /mnt/exherbo-2tst/a
mkdir /mnt/exherbo-2tst/b
mkdir /mnt/exherbo-2tst/c
mkdir /mnt/exherbo-2tst/d
mkdir /mnt/exherbo-2tst/e
mkdir /mnt/exherbo-2tst/f
mkdir /mnt/exherbo-2tst/g
mkdir /mnt/exherbo-2tst/h
cd /mnt/exherbo
cp -a . /mnt/exherbo-2tst/a
cp -a . /mnt/exherbo-2tst/b
cp -a . /mnt/exherbo-2tst/c
cp -a . /mnt/exherbo-2tst/d
cp -a . /mnt/exherbo-2tst/e
cp -a . /mnt/exherbo-2tst/f
cp -a . /mnt/exherbo-2tst/g
cp -a . /mnt/exherbo-2tst/h
cd
df -h
umount /mnt/exherbo
umount /mnt/exherbo-2tst/
--END first script

--BEGIN second script
#!/bin/bash
mount -o 
"rw,relatime,lazytime,background_gc=on,disable_ext_identify,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
/dev/nvme0n1p7 /mnt/exherbo-2tst/
mount -o 
"rw,relatime,lazytime,background_gc=on,disable_ext_identify,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
/dev/nvme0n1p8 /mnt/exherbo
cd /mnt/exherbo-2tst/
rm -r *
cd
mkdir /mnt/exherbo-2tst/a
mkdir /mnt/exherbo-2tst/b
mkdir /mnt/exherbo-2tst/c
mkdir /mnt/exherbo-2tst/d
mkdir /mnt/exherbo-2tst/e
mkdir /mnt/exherbo-2tst/f
mkdir /mnt/exherbo-2tst/g
mkdir /mnt/exherbo-2tst/h
cd /mnt/exherbo
cp -a . /mnt/exherbo-2tst/a
cp -a . /mnt/exherbo-2tst/b
cp -a . /mnt/exherbo-2tst/c
cp -a . /mnt/exherbo-2tst/d
cp -a . /mnt/exherbo-2tst/e
cp -a . /mnt/exherbo-2tst/f
cp -a . /mnt/exherbo-2tst/g
cp -a . /mnt/exherbo-2tst/h
cd
df -h
umount /mnt/exherbo
umount /mnt/exherbo-2tst/
--END second script

I ran these in order, with /dev/nvme0n1p8(sorce partition) formatted
with same options as used on /dev/nvme0n1p7 in script and containing a
exherbo-install of 17GB according to "df -h".
When running the second script the bug triggers during copying every
time. If I remove data_flush from mount options in scripts, bug does
not trigger. Both partitions used are 128GB in size.

This was on kernel-5.0.15 with "[PATCH] f2fs: fix potential recursive
call when enabling data_flush" by Chao Yu

Syslog follows:
<6>[ 1020.669305] EXT4-fs (nvme0n1p2): mounted filesystem with ordered
data mode. Opts: discard
<5>[ 1400.426449] F2FS-fs (nvme0n1p7): Found nat_bits in checkpoint
<5>[ 1400.487987] F2FS-fs (nvme0n1p7): Mounted with checkpoint version
= 7f73ca21
<5>[ 1400.528024] F2FS-fs (nvme0n1p8): Mounted with checkpoint version
= 7a6b5e4a
<5>[ 1678.585243] F2FS-fs (nvme0n1p7): Found nat_bits in checkpoint
<5>[ 1678.629891] F2FS-fs (nvme0n1p7): Mounted with checkpoint version
= 7f73cba1
<5>[ 1678.664250] F2FS-fs (nvme0n1p8): Mounted with checkpoint version
= 7a6b5e4e
<0>[ 1810.859985] BUG: stack guard page was hit at 973394e8
(stack is 5c69b096..5a84ab36)
<4>[ 1810.864326] kernel stack overflow (double-fault):  [#1]
PREEMPT SMP PTI
<4>[ 1810.868562] CPU: 0 PID: 2328 Comm: cp Not tainted 5.0.15-gentoo #2
<4>[ 1810.872779] Hardware name: To Be Filled By O.E.M. To Be Filled
By O.E.M./C226 WS, BIOS P3.40 06/25/2018
<4>[ 1810.877036] RIP: 0010:__alloc_pages_nodemask+0x0/0x230
<4>[ 1810.881238] Code: 83 3c 24 08 0f 84 f7 fa ff ff 8b 4c 24 44 85
c9 0f 85 eb fa ff ff c7 44 24 38 00 00 00 00 e9 df f4 ff ff e8 b2 1c
ee ff 66 90  fb 64 9f 00 41 5

Re: [f2fs-dev] [BUG] watchdog: BUG: soft lockup.. after heavy disk access

2019-05-02 Thread Hagbard Celine
2019-05-01 0:35 GMT+02:00, Hagbard Celine :
> 2019-04-30 8:25 GMT+02:00, Chao Yu :
>> On 2019/4/29 21:39, Hagbard Celine wrote:
>>> 2019-04-29 9:36 GMT+02:00, Chao Yu :
>>>> Hi Hagbard,
>>>>
>>>> At a glance, may I ask what's your device size? Since I notice that you
>>>> used
>>>> -i
>>>> option during mkfs, if size is large enough, checkpoint may crash
>>>> nat/sit_bitmap
>>>> online.
>>>>
>>>> mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7
>>>
>>> Both partitions are 128GB.
>>
>> 128GB is safe. :)
>>
>>>
>>>> On 2019/4/29 15:25, Hagbard Celine wrote:
>>>>> First I must admit that I'm not 100% sure if this is
>>>>> f2fs/NMI/something-else bug, but it only triggers after significant
>>>>> disk access.
>>>>>
>>>>> I've hit this bug several times since kernel 5.0.7 (have not tried
>>>>> earlier kernels) while compiling many packages in batch. But in those
>>>>> occasions it would hang all cores and loose the latest changes to the
>>>>> filesystem so I could not get any logs.
>>>>> This time it triggered while I was in the process of moving two of my
>>>>> installations to new partitions and locked only one core, so I was
>>>>> able to get the log after rebooting.
>>>>> The I did to trigger was make a new partition and run commands:
>>>>> #mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7
>>>>> #mount -o
>>>>> "rw,realtime,lazytime,background_gc=on,disable_ext_identity,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
>>
>> Could you revert below commit to check whether the issue is caused by it?
>
> This did not seem to help. I was not able to reproduce the bug by
> copying partition contents to a newly made partition, but the again I
> was not able to reproduce it that way on unpatched kernel either.
>
> Goth with and without the mentioned patch reverted the only way I am
> able to somewhat reliably reproduce is by compiling a big batch of
> packages in Exherbo (using paludis/sydbox), if the batch contains one
> of a handful big packages (util-linux is the most common to trigger)
> and that big package is not first in the batch. In those cases it will
> trigger one of the following ways in most runs:
>
> -NMI watchdog: BUG: soft lockup.. on from 4- to 8-cores.
>
> -NMI watchdog: Watchdog detected hard LOCKUP on cpu.. on 2- to 4-cores
>
> -no error message at all, everything suddenly goes to a grinding halt.

After some more testing, it seems that this is not quite accurate. I
might be facing at least two different issues. After about a day of
compiling on a pure ext4 install I have not got any of the hangs with
"NMI watchdog" errors, but one of the package that sometimes hung with
no errors at all (dev-lang/llvm, during test phase) still exhibits
this behavior.

I'll do another full system compile on ext4 just to make sure it's not
pure luck I did not get any NMI watchdog errors there. Then I'll copy
the system to f2fs partition and try dropping mount options (starting
with data_flush), running a full system compile for each, to see if
that makes any difference. (unless anyone got a better idea on how to
proceed, or comes up with another patch to test)

> Most of the time I also loose the ability to switch vt by alt+F-key
> and it's not even responding to "magic sysreq key".
>
> The few times I've been able to do vt-switch to the vt with output
> from syslog there is usually nothing in the time-frame of crash but
> one or two "kernel: perf: interrupt took too long..."
> ..or some random subsystem messaging because something timed out
> (usually some message from i915). Just to be clear the i915 is not in
> use during the tests, console is on amdgpudrmfb and X is not running,
> I even tested with a kernel without i915 to completely rule it out.
>
>> commit 50fa53eccf9f911a5b435248a2b0bd484fd82e5e
>> Author: Chao Yu 
>> Date:   Thu Aug 2 23:03:19 2018 +0800
>>
>> f2fs: fix to avoid broken of dnode block list
>>
>> The only risk of reverting this patch is we may potentially lose fsynced
>> data
>> after sudden power-cut, although it's rare, anyway please backup your
>> data
>> before try it.
>>
>> Thanks,
>>
>>>>> /dev/nvme0n1p7 /mnt/gentoo-alt-new
>>>>> #cd /mnt/gentoo-alt
>>&g

[f2fs-dev] [BUG] watchdog: BUG: soft lockup.. after heavy disk access

2019-04-29 Thread Hagbard Celine
First I must admit that I'm not 100% sure if this is
f2fs/NMI/something-else bug, but it only triggers after significant
disk access.

I've hit this bug several times since kernel 5.0.7 (have not tried
earlier kernels) while compiling many packages in batch. But in those
occasions it would hang all cores and loose the latest changes to the
filesystem so I could not get any logs.
This time it triggered while I was in the process of moving two of my
installations to new partitions and locked only one core, so I was
able to get the log after rebooting.
The I did to trigger was make a new partition and run commands:
#mkfs-f2fs -a 1 -f -i -l NVME_Gentoo-alt -o7 /dev/nvme0n1p7
#mount -o 
"rw,realtime,lazytime,background_gc=on,disable_ext_identity,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
/dev/nvme0n1p7 /mnt/gentoo-alt-new
#cd /mnt/gentoo-alt
#cp -a . /mnt/gentoo-alt-new
#umount /nmt/gentoo-alt
The bug triggers just after last command and last command was run
within 20 seconds ofter the second-last command returned.
"/mnt/gentoo-alt" was already mounted with same options as
"/mnt/gentoo-alt-new", and contained a Gentoo-rootfs of 64GB data
(according df -h).

This was on kernel 5.0.10 with "[PATCH] f2fs: fix potential recursive
call when enabling data_flush" by Chao Yu

Syslog for bug follows:

Apr 29 07:02:52 40o2 kernel: watchdog: BUG: soft lockup - CPU#4 stuck
for 21s! [irq/61-nvme0q5:383]
Apr 29 07:02:52 40o2 kernel: Modules linked in: ipv6 crc_ccitt 8021q
garp stp llc nls_cp437 vfat fat sd_mod iTCO_wdt x86_pkg_temp_thermal
kvm_intel kvm irqbypass ghash_clmulni_intel serio_raw i2c_i801
firewire_ohci firewire_core igb crc_itu_t uas lpc_ich dca usb_storage
processor_thermal_device ahci intel_soc_dts_iosf int340x_thermal_zone
libahci pcc_cpufreq efivarfs
Apr 29 07:02:52 40o2 kernel: CPU: 4 PID: 383 Comm: irq/61-nvme0q5 Not
tainted 5.0.10-gentoo #1
Apr 29 07:02:52 40o2 kernel: Hardware name: To Be Filled By O.E.M. To
Be Filled By O.E.M./C226 WS, BIOS P3.40 06/25/2018
Apr 29 07:02:52 40o2 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x36/0x40
Apr 29 07:02:52 40o2 kernel: Code: f6 c7 02 75 1e 56 9d e8 38 a0 69 ff
bf 01 00 00 00 e8 ce 32 60 ff 65 8b 05 37 8a 70 5b 85 c0 74 0b 5b c3
e8 3c 9f 69 ff 53 9d  e0 e8 bb 58 4f ff 5b c3 90 e8 6b 50 0f 00 b8
00 fe ff ff f0 0f
Apr 29 07:02:52 40o2 kernel: RSP: 0018:99a4003d7d58 EFLAGS:
0246 ORIG_RAX: ff13
Apr 29 07:02:52 40o2 kernel: RAX: 0202 RBX:
0246 RCX: dead0200
Apr 29 07:02:52 40o2 kernel: RDX: 8ab7fd8852e0 RSI:
 RDI: a490c264
Apr 29 07:02:52 40o2 kernel: RBP: bef2cc242db8 R08:
0002 R09: 00024688
Apr 29 07:02:52 40o2 kernel: R10: ffb8 R11:
ffb8 R12: 8ab7fd885000
Apr 29 07:02:52 40o2 kernel: R13: 8ab7fd8852d8 R14:
8ab7fef42b40 R15: 6db6db6db6db6db7
Apr 29 07:02:52 40o2 kernel: FS:  ()
GS:8ab81ed0() knlGS:
Apr 29 07:02:52 40o2 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Apr 29 07:02:52 40o2 kernel: CR2: 7e6278ff9000 CR3:
000792e0c004 CR4: 003606e0
Apr 29 07:02:52 40o2 kernel: DR0:  DR1:
 DR2: 
Apr 29 07:02:52 40o2 kernel: DR3:  DR6:
fffe0ff0 DR7: 0400
Apr 29 07:02:52 40o2 kernel: Call Trace:
Apr 29 07:02:52 40o2 kernel:  ? f2fs_del_fsync_node_entry+0x9f/0xd0
Apr 29 07:02:52 40o2 kernel:  ? f2fs_write_end_io+0xb6/0x1e0
Apr 29 07:02:52 40o2 kernel:  ? blk_update_request+0xc0/0x270
Apr 29 07:02:52 40o2 kernel:  ? blk_mq_end_request+0x1a/0x130
Apr 29 07:02:52 40o2 kernel:  ? blk_mq_complete_request+0x92/0x110
Apr 29 07:02:52 40o2 kernel:  ? irq_finalize_oneshot.part.46+0xe0/0xe0
Apr 29 07:02:52 40o2 kernel:  ? nvme_irq+0xf9/0x260
Apr 29 07:02:52 40o2 kernel:  ? irq_finalize_oneshot.part.46+0xe0/0xe0
Apr 29 07:02:52 40o2 kernel:  ? irq_forced_thread_fn+0x30/0x80
Apr 29 07:02:52 40o2 kernel:  ? irq_thread+0xe7/0x160
Apr 29 07:02:52 40o2 kernel:  ? wake_threads_waitq+0x30/0x30
Apr 29 07:02:52 40o2 kernel:  ? irq_thread_check_affinity+0x80/0x80
Apr 29 07:02:52 40o2 kernel:  ? kthread+0x116/0x130
Apr 29 07:02:52 40o2 kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
Apr 29 07:02:52 40o2 kernel:  ? ret_from_fork+0x3a/0x50
Apr 29 07:02:55 40o2 kernel: rcu: INFO: rcu_preempt self-detected stall on CPU
Apr 29 07:02:55 40o2 kernel: rcu: \x095-: (13949 ticks this GP)
idle=a26/1/0x4002 softirq=191162/191162 fqs=4843
Apr 29 07:02:55 40o2 kernel: rcu: \x09 (t=21000 jiffies g=184461 q=674)
Apr 29 07:02:55 40o2 kernel: Sending NMI from CPU 5 to CPUs 4:
Apr 29 07:02:55 40o2 kernel: NMI backtrace for cpu 4
Apr 29 07:02:55 40o2 kernel: CPU: 4 PID: 383 Comm: irq/61-nvme0q5
Tainted: G L5.0.10-gentoo #1
Apr 29 07:02:55 40o2 

Re: [f2fs-dev] Possible issues with fsck of f2fs root

2019-04-23 Thread Hagbard Celine
2019-04-23 4:55 GMT+02:00, Chao Yu :
> On 2019/4/22 18:05, Hagbard Celine wrote:
>> 2019-04-22 11:26 GMT+02:00, Chao Yu :
>>> On 2019/4/22 17:05, Hagbard Celine wrote:
>>>> 2019-04-22 9:37 GMT+02:00, Chao Yu :
>>>>> On 2019/4/22 15:11, Hagbard Celine wrote:
>>>>>> With this patch the one problem with opening the device in RO mode is
>>>>>> fixed.
>>>>>
>>>>> Oops, with default preen mode fsck should not open ro mounted image,
>>>>> that's
>>>>> the
>>>>> rule we keep line with ext4...
>>>>>
>>>>> How about changing to use -f in your scenario ( on RO mounted root
>>>>> image
>>>>> )?
>>>>
>>>> This was with -f. Without -f it still refuses to open the device.
>>>
>>> What I mean is we'd better to keep line with ext4, just refusing to open
>>> ro
>>> mounted device without -f, since triggering fsck and repair on a mounted
>>> device
>>> is dangerous, it can easily make inconsistency in between in-memory data
>>> and
>>> on-disk data of filesystem. Refusing fsck without -f is to make user
>>> being
>>> aware
>>> of such danger.
>>
>> I am sorry, I've apparently added the -f after my first report. After
>> re-testing it seems that fsck.f2fs is opening the RO partition even
>> without this patch if I use -f. So the part about fsck.f2fs not being
>> able to open RO mounted partition during boot was a user error.
>
> I've sent a patch for your second issue, could you please have a try with
> it?
>
> [PATCH] fsck.f2fs: fix to repair ro mounted device w/ -f
>
> But one concern is that, with this patch, not like the fsck.ext4, fsck.f2fs
> won't show any interaction with below reminding word to remind user to
> decide
> repair or not, it may increase the risk of damaging the device.
>
> Do you want to restore lost files into ./lost_found/?
> Do you want to fix this partition? [Y/N]
>
> Jaegeuk, Hagbard,
>
> Any suggestion on this, in current scenario, how about implement:
> 1. fsck.f2fs -f ro_mounted_device: check; show interaction words if there
> is
> corruption;
> 2. fsck.f2fs -f -a ro_moutned_device: check and repair automatically;

I answered this all too quickly and did not think it trough properly.
As it stands today, if I run "fsck.f2fs -f /dev/some_unmounted_disk"
it will always do a full fsck.
If I on the other hand do "fsck.f2fs -f -a /dev/some_unmounted_disk"
it sometimes only reads the checkpoint state and returns with: "Info:
No errors was reported".
I do not have a ext4 partition with errors to test, but I have a fat
partition that comes up with "Free cluster summary wrong" on every run
of fsck.fat and there fsck asks for confirmation when run with "-f"
and autofixes without asking when running with "-f -a".

Considering this I believe the proposed solution would be
counter-intuitive, unless fsck.ext4 behaves opposite of fsck.fat
already.
>
> Thanks,
>
>>
>>>
>>> Thanks,
>>>
>>>>
>>>>
>>>>> Thanks,
>>>>>
>>>>>> But as far as I can understand it will still only check the fs, not
>>>>>> fix
>>>>>> it.
>>>>>>
>>>>>>
>>>>>> 2019-04-21 12:27 GMT+02:00, Jaegeuk Kim :
>>>>>>
>>>>>>>
>>>>>>> New version of the patch is:
>>>>>>>
>>>>>>> From 3221692b060649378f1f69b898ed85a814af3dbf Mon Sep 17 00:00:00
>>>>>>> 2001
>>>>>>> From: Jaegeuk Kim 
>>>>>>> Date: Tue, 16 Apr 2019 11:46:31 -0700
>>>>>>> Subject: [PATCH] fsck.f2fs: open ro disk if we want to check fs only
>>>>>>>
>>>>>>> This patch fixes the "open failure" issue on ro disk, reported by
>>>>>>> Hagbard.
>>>>>>>
>>>>>>> "
>>>>>>>  If I boot with kernel option "ro rootfstype=f2fs
>>>>>>>  I get the following halfway trough boot:
>>>>>>>
>>>>>>>   * Checking local filesystems  ...
>>>>>>>  Info: Use default preen mode
>>>>>>>  Info: Mounted device!
>>>>>>>  Info: Check FS only due to RO
>>>>>>>  Error: Failed to open the device!
>>>>>>>   * Filesystems couldn't be

Re: [f2fs-dev] Possible issues with fsck of f2fs root

2019-04-23 Thread Hagbard Celine
2019-04-23 13:59 GMT+02:00, Hagbard Celine :
> 2019-04-23 4:55 GMT+02:00, Chao Yu :
>> On 2019/4/22 18:05, Hagbard Celine wrote:
>>> 2019-04-22 11:26 GMT+02:00, Chao Yu :
>>>> On 2019/4/22 17:05, Hagbard Celine wrote:
>>>>> 2019-04-22 9:37 GMT+02:00, Chao Yu :
>>>>>> On 2019/4/22 15:11, Hagbard Celine wrote:
>>>>>>> With this patch the one problem with opening the device in RO mode
>>>>>>> is
>>>>>>> fixed.
>>>>>>
>>>>>> Oops, with default preen mode fsck should not open ro mounted image,
>>>>>> that's
>>>>>> the
>>>>>> rule we keep line with ext4...
>>>>>>
>>>>>> How about changing to use -f in your scenario ( on RO mounted root
>>>>>> image
>>>>>> )?
>>>>>
>>>>> This was with -f. Without -f it still refuses to open the device.
>>>>
>>>> What I mean is we'd better to keep line with ext4, just refusing to
>>>> open
>>>> ro
>>>> mounted device without -f, since triggering fsck and repair on a
>>>> mounted
>>>> device
>>>> is dangerous, it can easily make inconsistency in between in-memory
>>>> data
>>>> and
>>>> on-disk data of filesystem. Refusing fsck without -f is to make user
>>>> being
>>>> aware
>>>> of such danger.
>>>
>>> I am sorry, I've apparently added the -f after my first report. After
>>> re-testing it seems that fsck.f2fs is opening the RO partition even
>>> without this patch if I use -f. So the part about fsck.f2fs not being
>>> able to open RO mounted partition during boot was a user error.
>>
>> I've sent a patch for your second issue, could you please have a try with
>> it?
>>
>> [PATCH] fsck.f2fs: fix to repair ro mounted device w/ -f
>
> Tested by forcing a sudden_power_off by reset-switch, seems to work.
>
>> But one concern is that, with this patch, not like the fsck.ext4,
>> fsck.f2fs
>> won't show any interaction with below reminding word to remind user to
>> decide
>> repair or not, it may increase the risk of damaging the device.
>>
>> Do you want to restore lost files into ./lost_found/?
>> Do you want to fix this partition? [Y/N]
>>
>> Jaegeuk, Hagbard,
>>
>> Any suggestion on this, in current scenario, how about implement:
>> 1. fsck.f2fs -f ro_mounted_device: check; show interaction words if there
>> is
>> corruption;
>> 2. fsck.f2fs -f -a ro_moutned_device: check and repair automatically;
>
> I guess that would be ok. Just to mention: Gentoo defaults to "fsck -A
> -p" during boot, where I can add the "-f" option by config file. I am
> not up to date on what other distros uses for default
> options in their fsck command during boot.

Please ignore that part about defaults, I misread the script: if I set
-f in config file it replaces the default -p, I checked with "set -o
xtrace".

>
>> Thanks,
>>
>>>
>>>>
>>>> Thanks,
>>>>
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>> But as far as I can understand it will still only check the fs, not
>>>>>>> fix
>>>>>>> it.
>>>>>>>
>>>>>>>
>>>>>>> 2019-04-21 12:27 GMT+02:00, Jaegeuk Kim :
>>>>>>>
>>>>>>>>
>>>>>>>> New version of the patch is:
>>>>>>>>
>>>>>>>> From 3221692b060649378f1f69b898ed85a814af3dbf Mon Sep 17 00:00:00
>>>>>>>> 2001
>>>>>>>> From: Jaegeuk Kim 
>>>>>>>> Date: Tue, 16 Apr 2019 11:46:31 -0700
>>>>>>>> Subject: [PATCH] fsck.f2fs: open ro disk if we want to check fs
>>>>>>>> only
>>>>>>>>
>>>>>>>> This patch fixes the "open failure" issue on ro disk, reported by
>>>>>>>> Hagbard.
>>>>>>>>
>>>>>>>> "
>>>>>>>>  If I boot with kernel option "ro rootfstype=f2fs
>>>>>>>>  I get the following halfway trough boot:
>>>>>>>>
>>>>>>>>   * Checking local filesystems  ...
>>>>>>>&

Re: [f2fs-dev] Possible issues with fsck of f2fs root

2019-04-23 Thread Hagbard Celine
2019-04-23 4:55 GMT+02:00, Chao Yu :
> On 2019/4/22 18:05, Hagbard Celine wrote:
>> 2019-04-22 11:26 GMT+02:00, Chao Yu :
>>> On 2019/4/22 17:05, Hagbard Celine wrote:
>>>> 2019-04-22 9:37 GMT+02:00, Chao Yu :
>>>>> On 2019/4/22 15:11, Hagbard Celine wrote:
>>>>>> With this patch the one problem with opening the device in RO mode is
>>>>>> fixed.
>>>>>
>>>>> Oops, with default preen mode fsck should not open ro mounted image,
>>>>> that's
>>>>> the
>>>>> rule we keep line with ext4...
>>>>>
>>>>> How about changing to use -f in your scenario ( on RO mounted root
>>>>> image
>>>>> )?
>>>>
>>>> This was with -f. Without -f it still refuses to open the device.
>>>
>>> What I mean is we'd better to keep line with ext4, just refusing to open
>>> ro
>>> mounted device without -f, since triggering fsck and repair on a mounted
>>> device
>>> is dangerous, it can easily make inconsistency in between in-memory data
>>> and
>>> on-disk data of filesystem. Refusing fsck without -f is to make user
>>> being
>>> aware
>>> of such danger.
>>
>> I am sorry, I've apparently added the -f after my first report. After
>> re-testing it seems that fsck.f2fs is opening the RO partition even
>> without this patch if I use -f. So the part about fsck.f2fs not being
>> able to open RO mounted partition during boot was a user error.
>
> I've sent a patch for your second issue, could you please have a try with
> it?
>
> [PATCH] fsck.f2fs: fix to repair ro mounted device w/ -f

Tested by forcing a sudden_power_off by reset-switch, seems to work.

> But one concern is that, with this patch, not like the fsck.ext4, fsck.f2fs
> won't show any interaction with below reminding word to remind user to
> decide
> repair or not, it may increase the risk of damaging the device.
>
> Do you want to restore lost files into ./lost_found/?
> Do you want to fix this partition? [Y/N]
>
> Jaegeuk, Hagbard,
>
> Any suggestion on this, in current scenario, how about implement:
> 1. fsck.f2fs -f ro_mounted_device: check; show interaction words if there
> is
> corruption;
> 2. fsck.f2fs -f -a ro_moutned_device: check and repair automatically;

I guess that would be ok. Just to mention: Gentoo defaults to "fsck -A
-p" during boot, where I can add the "-f" option by config file. I am
not up to date on what other distros uses for default
options in their fsck command during boot.

> Thanks,
>
>>
>>>
>>> Thanks,
>>>
>>>>
>>>>
>>>>> Thanks,
>>>>>
>>>>>> But as far as I can understand it will still only check the fs, not
>>>>>> fix
>>>>>> it.
>>>>>>
>>>>>>
>>>>>> 2019-04-21 12:27 GMT+02:00, Jaegeuk Kim :
>>>>>>
>>>>>>>
>>>>>>> New version of the patch is:
>>>>>>>
>>>>>>> From 3221692b060649378f1f69b898ed85a814af3dbf Mon Sep 17 00:00:00
>>>>>>> 2001
>>>>>>> From: Jaegeuk Kim 
>>>>>>> Date: Tue, 16 Apr 2019 11:46:31 -0700
>>>>>>> Subject: [PATCH] fsck.f2fs: open ro disk if we want to check fs only
>>>>>>>
>>>>>>> This patch fixes the "open failure" issue on ro disk, reported by
>>>>>>> Hagbard.
>>>>>>>
>>>>>>> "
>>>>>>>  If I boot with kernel option "ro rootfstype=f2fs
>>>>>>>  I get the following halfway trough boot:
>>>>>>>
>>>>>>>   * Checking local filesystems  ...
>>>>>>>  Info: Use default preen mode
>>>>>>>  Info: Mounted device!
>>>>>>>  Info: Check FS only due to RO
>>>>>>>  Error: Failed to open the device!
>>>>>>>   * Filesystems couldn't be fixed
>>>>>>> "
>>>>>>>
>>>>>>> Reported-by: Hagbard Celine 
>>>>>>> Signed-off-by: Jaegeuk Kim 
>>>>>>> ---
>>>>>>>  lib/libf2fs.c | 25 +
>>>>>>>  1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/lib/libf2fs.c b/lib/libf2fs.c
>>

Re: [f2fs-dev] Possible issues with fsck of f2fs root

2019-04-22 Thread Hagbard Celine
2019-04-22 11:26 GMT+02:00, Chao Yu :
> On 2019/4/22 17:05, Hagbard Celine wrote:
>> 2019-04-22 9:37 GMT+02:00, Chao Yu :
>>> On 2019/4/22 15:11, Hagbard Celine wrote:
>>>> With this patch the one problem with opening the device in RO mode is
>>>> fixed.
>>>
>>> Oops, with default preen mode fsck should not open ro mounted image,
>>> that's
>>> the
>>> rule we keep line with ext4...
>>>
>>> How about changing to use -f in your scenario ( on RO mounted root image
>>> )?
>>
>> This was with -f. Without -f it still refuses to open the device.
>
> What I mean is we'd better to keep line with ext4, just refusing to open ro
> mounted device without -f, since triggering fsck and repair on a mounted
> device
> is dangerous, it can easily make inconsistency in between in-memory data
> and
> on-disk data of filesystem. Refusing fsck without -f is to make user being
> aware
> of such danger.

I am sorry, I've apparently added the -f after my first report. After
re-testing it seems that fsck.f2fs is opening the RO partition even
without this patch if I use -f. So the part about fsck.f2fs not being
able to open RO mounted partition during boot was a user error.

>
> Thanks,
>
>>
>>
>>> Thanks,
>>>
>>>> But as far as I can understand it will still only check the fs, not fix
>>>> it.
>>>>
>>>>
>>>> 2019-04-21 12:27 GMT+02:00, Jaegeuk Kim :
>>>>
>>>>>
>>>>> New version of the patch is:
>>>>>
>>>>> From 3221692b060649378f1f69b898ed85a814af3dbf Mon Sep 17 00:00:00 2001
>>>>> From: Jaegeuk Kim 
>>>>> Date: Tue, 16 Apr 2019 11:46:31 -0700
>>>>> Subject: [PATCH] fsck.f2fs: open ro disk if we want to check fs only
>>>>>
>>>>> This patch fixes the "open failure" issue on ro disk, reported by
>>>>> Hagbard.
>>>>>
>>>>> "
>>>>>  If I boot with kernel option "ro rootfstype=f2fs
>>>>>  I get the following halfway trough boot:
>>>>>
>>>>>   * Checking local filesystems  ...
>>>>>  Info: Use default preen mode
>>>>>  Info: Mounted device!
>>>>>  Info: Check FS only due to RO
>>>>>  Error: Failed to open the device!
>>>>>   * Filesystems couldn't be fixed
>>>>> "
>>>>>
>>>>> Reported-by: Hagbard Celine 
>>>>> Signed-off-by: Jaegeuk Kim 
>>>>> ---
>>>>>  lib/libf2fs.c | 25 +
>>>>>  1 file changed, 21 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/lib/libf2fs.c b/lib/libf2fs.c
>>>>> index d30047f..853e713 100644
>>>>> --- a/lib/libf2fs.c
>>>>> +++ b/lib/libf2fs.c
>>>>> @@ -789,6 +789,15 @@ void get_kernel_uname_version(__u8 *version)
>>>>>  #endif /* APPLE_DARWIN */
>>>>>
>>>>>  #ifndef ANDROID_WINDOWS_HOST
>>>>> +static int open_check_fs(char *path, int flag)
>>>>> +{
>>>>> + if (c.func != FSCK || c.fix_on || c.auto_fix)
>>>>> + return -1;
>>>>> +
>>>>> + /* allow to open ro */
>>>>> + return open(path, O_RDONLY | flag);
>>>>> +}
>>>>> +
>>>>>  int get_device_info(int i)
>>>>>  {
>>>>>   int32_t fd = 0;
>>>>> @@ -810,8 +819,11 @@ int get_device_info(int i)
>>>>>   if (c.sparse_mode) {
>>>>>   fd = open(dev->path, O_RDWR | O_CREAT | O_BINARY, 0644);
>>>>>   if (fd < 0) {
>>>>> - MSG(0, "\tError: Failed to open a sparse file!\n");
>>>>> - return -1;
>>>>> + fd = open_check_fs(dev->path, O_BINARY);
>>>>> + if (fd < 0) {
>>>>> + MSG(0, "\tError: Failed to open a sparse 
>>>>> file!\n");
>>>>> + return -1;
>>>>> + }
>>>>>   }
>>>>>   }
>>>>>
>>>>> @@ -825,10 +837,15 @@ int get_device_info(int i)
>>>>>   return -1;
>>>>>   }
>>>>>
>>>>> - if (S_ISBLK(stat_buf->st_mode) && !c.force)
>>>>> + if (S_ISBLK(stat_buf->st_mode) && !c.force) {
>>>>>   fd = open(dev->path, O_RDWR | O_EXCL);
>>>>> - else
>>>>> + if (fd < 0)
>>>>> + fd = open_check_fs(dev->path, O_EXCL);
>>>>> + } else {
>>>>>   fd = open(dev->path, O_RDWR);
>>>>> + if (fd < 0)
>>>>> + fd = open_check_fs(dev->path, 0);
>>>>> + }
>>>>>   }
>>>>>   if (fd < 0) {
>>>>>   MSG(0, "\tError: Failed to open the device!\n");
>>>>> --
>>>>> 2.19.0.605.g01d371f741-goog
>>>>>
>>>>>
>>>>
>>>>
>>>> ___
>>>> Linux-f2fs-devel mailing list
>>>> Linux-f2fs-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>> .
>>>>
>>>
>> .
>>
>


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Possible issues with fsck of f2fs root

2019-04-22 Thread Hagbard Celine
2019-04-22 9:37 GMT+02:00, Chao Yu :
> On 2019/4/22 15:11, Hagbard Celine wrote:
>> With this patch the one problem with opening the device in RO mode is
>> fixed.
>
> Oops, with default preen mode fsck should not open ro mounted image, that's
> the
> rule we keep line with ext4...
>
> How about changing to use -f in your scenario ( on RO mounted root image )?

This was with -f. Without -f it still refuses to open the device.


> Thanks,
>
>> But as far as I can understand it will still only check the fs, not fix
>> it.
>>
>>
>> 2019-04-21 12:27 GMT+02:00, Jaegeuk Kim :
>>
>>>
>>> New version of the patch is:
>>>
>>> From 3221692b060649378f1f69b898ed85a814af3dbf Mon Sep 17 00:00:00 2001
>>> From: Jaegeuk Kim 
>>> Date: Tue, 16 Apr 2019 11:46:31 -0700
>>> Subject: [PATCH] fsck.f2fs: open ro disk if we want to check fs only
>>>
>>> This patch fixes the "open failure" issue on ro disk, reported by
>>> Hagbard.
>>>
>>> "
>>>  If I boot with kernel option "ro rootfstype=f2fs
>>>  I get the following halfway trough boot:
>>>
>>>   * Checking local filesystems  ...
>>>  Info: Use default preen mode
>>>  Info: Mounted device!
>>>  Info: Check FS only due to RO
>>>  Error: Failed to open the device!
>>>   * Filesystems couldn't be fixed
>>> "
>>>
>>> Reported-by: Hagbard Celine 
>>> Signed-off-by: Jaegeuk Kim 
>>> ---
>>>  lib/libf2fs.c | 25 +
>>>  1 file changed, 21 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/lib/libf2fs.c b/lib/libf2fs.c
>>> index d30047f..853e713 100644
>>> --- a/lib/libf2fs.c
>>> +++ b/lib/libf2fs.c
>>> @@ -789,6 +789,15 @@ void get_kernel_uname_version(__u8 *version)
>>>  #endif /* APPLE_DARWIN */
>>>
>>>  #ifndef ANDROID_WINDOWS_HOST
>>> +static int open_check_fs(char *path, int flag)
>>> +{
>>> +   if (c.func != FSCK || c.fix_on || c.auto_fix)
>>> +   return -1;
>>> +
>>> +   /* allow to open ro */
>>> +   return open(path, O_RDONLY | flag);
>>> +}
>>> +
>>>  int get_device_info(int i)
>>>  {
>>> int32_t fd = 0;
>>> @@ -810,8 +819,11 @@ int get_device_info(int i)
>>> if (c.sparse_mode) {
>>> fd = open(dev->path, O_RDWR | O_CREAT | O_BINARY, 0644);
>>> if (fd < 0) {
>>> -   MSG(0, "\tError: Failed to open a sparse file!\n");
>>> -   return -1;
>>> +   fd = open_check_fs(dev->path, O_BINARY);
>>> +   if (fd < 0) {
>>> +   MSG(0, "\tError: Failed to open a sparse 
>>> file!\n");
>>> +   return -1;
>>> +   }
>>> }
>>> }
>>>
>>> @@ -825,10 +837,15 @@ int get_device_info(int i)
>>> return -1;
>>> }
>>>
>>> -   if (S_ISBLK(stat_buf->st_mode) && !c.force)
>>> +   if (S_ISBLK(stat_buf->st_mode) && !c.force) {
>>> fd = open(dev->path, O_RDWR | O_EXCL);
>>> -   else
>>> +   if (fd < 0)
>>> +   fd = open_check_fs(dev->path, O_EXCL);
>>> +   } else {
>>> fd = open(dev->path, O_RDWR);
>>> +   if (fd < 0)
>>> +   fd = open_check_fs(dev->path, 0);
>>> +   }
>>> }
>>> if (fd < 0) {
>>> MSG(0, "\tError: Failed to open the device!\n");
>>> --
>>> 2.19.0.605.g01d371f741-goog
>>>
>>>
>>
>>
>> ___
>> Linux-f2fs-devel mailing list
>> Linux-f2fs-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>> .
>>
>


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Possible issues with fsck of f2fs root

2019-04-22 Thread Hagbard Celine
With this patch the one problem with opening the device in RO mode is fixed.
But as far as I can understand it will still only check the fs, not fix it.


2019-04-21 12:27 GMT+02:00, Jaegeuk Kim :

>
> New version of the patch is:
>
> From 3221692b060649378f1f69b898ed85a814af3dbf Mon Sep 17 00:00:00 2001
> From: Jaegeuk Kim 
> Date: Tue, 16 Apr 2019 11:46:31 -0700
> Subject: [PATCH] fsck.f2fs: open ro disk if we want to check fs only
>
> This patch fixes the "open failure" issue on ro disk, reported by Hagbard.
>
> "
>  If I boot with kernel option "ro rootfstype=f2fs
>  I get the following halfway trough boot:
>
>   * Checking local filesystems  ...
>  Info: Use default preen mode
>  Info: Mounted device!
>  Info: Check FS only due to RO
>  Error: Failed to open the device!
>   * Filesystems couldn't be fixed
> "
>
> Reported-by: Hagbard Celine 
> Signed-off-by: Jaegeuk Kim 
> ---
>  lib/libf2fs.c | 25 +
>  1 file changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/lib/libf2fs.c b/lib/libf2fs.c
> index d30047f..853e713 100644
> --- a/lib/libf2fs.c
> +++ b/lib/libf2fs.c
> @@ -789,6 +789,15 @@ void get_kernel_uname_version(__u8 *version)
>  #endif /* APPLE_DARWIN */
>
>  #ifndef ANDROID_WINDOWS_HOST
> +static int open_check_fs(char *path, int flag)
> +{
> + if (c.func != FSCK || c.fix_on || c.auto_fix)
> + return -1;
> +
> + /* allow to open ro */
> + return open(path, O_RDONLY | flag);
> +}
> +
>  int get_device_info(int i)
>  {
>   int32_t fd = 0;
> @@ -810,8 +819,11 @@ int get_device_info(int i)
>   if (c.sparse_mode) {
>   fd = open(dev->path, O_RDWR | O_CREAT | O_BINARY, 0644);
>   if (fd < 0) {
> - MSG(0, "\tError: Failed to open a sparse file!\n");
> - return -1;
> + fd = open_check_fs(dev->path, O_BINARY);
> + if (fd < 0) {
> + MSG(0, "\tError: Failed to open a sparse 
> file!\n");
> + return -1;
> + }
>   }
>   }
>
> @@ -825,10 +837,15 @@ int get_device_info(int i)
>   return -1;
>   }
>
> - if (S_ISBLK(stat_buf->st_mode) && !c.force)
> + if (S_ISBLK(stat_buf->st_mode) && !c.force) {
>   fd = open(dev->path, O_RDWR | O_EXCL);
> - else
> + if (fd < 0)
> + fd = open_check_fs(dev->path, O_EXCL);
> + } else {
>   fd = open(dev->path, O_RDWR);
> + if (fd < 0)
> + fd = open_check_fs(dev->path, 0);
> + }
>   }
>   if (fd < 0) {
>   MSG(0, "\tError: Failed to open the device!\n");
> --
> 2.19.0.605.g01d371f741-goog
>
>


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] Possible issues with fsck of f2fs root

2019-04-02 Thread Hagbard Celine
Hi, I lost the root filesystem on my previous install after a few
weeks of several power outages last winter. While trying to recover I
discovered that it seem fsck was never run properly during boot in the
lifetime of that install.
After getting the system installed again a while ago, I have been
trying to discern why.
So far I've found the following two possible issues:

ISSUE 1:
If I boot with kernel option "ro rootfstype=f2fs
rootflags=background_gc=on,heap,disable_ext_identify,discard,user_xattr,inline_xattr,inline_dentry,acl,inline_data,flush_merge,data_flush,extent_cache,whint_mode=fs-based,fsync_mode=strict"
I get the following halfway trough boot:

 * Checking local filesystems  ...
Info: Use default preen mode
Info: Mounted device!
Info: Check FS only due to RO
Error: Failed to open the device!
 * Filesystems couldn't be fixed


 [ !! ] * rc: Aborting!

If i from this state try to mount another partition:
# mount -o 
"ro,relatime,lazytime,background_gc=on,discard,heap,user_xattr,inline_xattr,acl,disable_ext_identify,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
/dev/nvme0n1p7 /mnt/f2fstest/

I get the same error if I try to run fsck on it:
# fsck.f2fs /dev/nvme0n1p7
Info: Mounted device!
Info: Check FS only due to RO
Error: Failed to open the device!

If I on the other had boot with kernel option "rw rootfstype=f2fs
rootflags=background_gc=on,heap,disable_ext_identify,discard,user_xattr,inline_xattr,inline_dentry,acl,inline_data,flush_merge,data_flush,extent_cache,whint_mode=fs-based,fsync_mode=strict
panic=30 scsi_mod.use_blk_mq=1"

The boot does not hang and if I try same test as before, mount test partition:
# mount -o 
"ro,relatime,lazytime,background_gc=on,discard,heap,user_xattr,inline_xattr,acl,disable_ext_identify,inline_data,inline_dentry,flush_merge,extent_cache,data_flush,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict"
/dev/nvme0n1p7 /mnt/f2fstest/

Run fsck:
# fsck.f2fs  -f /dev/nvme0n1p7
Info: Force to fix corruption
Info: Mounted device!
Info: Check FS only due to RO
Info: Segments per section = 1
Info: Sections per zone = 1
Info: sector size = 512
Info: total sectors = 134101647 (65479 MB)
Info: MKFS version
  "Linux version 5.0.5-gentoof2fsfix (root@40o2) (gcc version 8.2.0
(Gentoo 8.2.0-r6 p1.7)) #2 SMP PREEMPT Mon Apr 1 17:04:41 +01 2019"
Info: FSCK version
  from "Linux version 5.0.5-gentoo (root@40o2) (gcc version 8.2.0
(Gentoo 8.2.0-r6 p1.7)) #2 SMP PREEMPT Tue Apr 2 07:42:40 +01 2019"
to "Linux version 5.0.5-gentoo (root@40o2) (gcc version 8.2.0
(Gentoo 8.2.0-r6 p1.7)) #2 SMP PREEMPT Tue Apr 2 07:42:40 +01 2019"
Info: superblock features = 0 :
Info: superblock encrypt level = 0, salt = 
Info: total FS sectors = 134101640 (65479 MB)
Info: CKPT version = 70e1454a
Info: Checked valid nat_bits in checkpoint
Info: checkpoint state = 4c1 :  large_nat_bitmap nat_bits crc unmount

[FSCK] Unreachable nat entries[Ok..] [0x0]
[FSCK] SIT valid block bitmap checking[Ok..]
[FSCK] Hard link checking for regular file[Ok..] [0x70]
[FSCK] valid_block_count matching with CP [Ok..] [0x1fe244]
[FSCK] valid_node_count matcing with CP (de lookup)   [Ok..] [0x6c487]
[FSCK] valid_node_count matcing with CP (nat lookup)  [Ok..] [0x6c487]
[FSCK] valid_inode_count matched with CP  [Ok..] [0x6c362]
[FSCK] free segment_count matched with CP [Ok..] [0x6c44]
[FSCK] next block offset is free  [Ok..]
[FSCK] fixing SIT types
[FSCK] other corrupted bugs   [Ok..]

Done.

So a system booted with "rw" root can fsck an "ro" filesystem but a
system booted with root "ro" can not.


ISSUE 2:
Referring to the output from the fsck running against a "ro"
filesystem, especially this line:
Info: Check FS only due to RO

As far as i can tell this says that opposed to other filesystems
running fsck against a "ro" mounted f2fs partition will never fix any
errors.
So I tried running fsck against the same partition mounted "rw":
# mount -o remount,rw /mnt/f2fstest/
# fsck.f2fs  -f /dev/nvme0n1p7
Info: Force to fix corruption
Info: Mounted device!
Error: Not available on mounted device!

I might be misunderstanding something, but all this tells me that
unless one make a custom initramfs that runs fsck before root is
mounted (something no distributions has, as far as I know), fsck will
never fix an f2fs formatted root partition during boot.
If this is by design and not a bug/unintended behavior, it should be
documented somewhere least more people will experience system crashes
like mine.

All tests above done with kernel 5.0.5 and f2fs-tools 1.12.0 with
"fsck.f2fs: allow to fsck readonly image w/ -f option"-patch by Chao
Yu.


___

Re: [f2fs-dev] BUG: kernel stack overflow when mounting with data_flush

2019-04-02 Thread Hagbard Celine
That seems to have fixed it. No more errors in syslog after extracting
my stage3 tarball. Also ran a couple of kernel compiles on a partition
mounted with data_flush and system seems stable.

2019-04-01 10:05 GMT+02:00, Chao Yu :
> On 2019/3/31 2:54, Hagbard Celine wrote:
>> First, yes it is caused by data_flush, this is what I am trying to
>> report. Without that option there is no "stack guard page was hit" and
>> no "kernel stack overflow" and kernel is stable.
>> This time I was using kernel 5.0.3, as can be seen in the log in my first
>> mail.
>> I do not remember exactly what kernel version I tried the first time a
>> saw this bug, but I believe the mount option data_flush was just added
>> when I tried it the first time. The option has always lead to crash
>> here.
>
> Sorry, out of mind at that time, data_flush key words slip out of my eye...
>
> Could you please try below patch?
>
> From 65edbf14a198d0b50765e10340255e2071f7ae75 Mon Sep 17 00:00:00 2001
> From: Chao Yu 
> Date: Mon, 1 Apr 2019 15:59:16 +0800
> Subject: [PATCH] f2fs: fix potential recursive call when enabling
> data_flush
>
> Signed-off-by: Chao Yu 
> ---
>  fs/f2fs/checkpoint.c | 6 ++
>  fs/f2fs/data.c   | 3 ++-
>  2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> index a98e1b02279e..935ebdb9cf47 100644
> --- a/fs/f2fs/checkpoint.c
> +++ b/fs/f2fs/checkpoint.c
> @@ -1009,13 +1009,11 @@ int f2fs_sync_dirty_inodes(struct f2fs_sb_info *sbi,
> enum inode_type type)
>   if (inode) {
>   unsigned long cur_ino = inode->i_ino;
>
> - if (is_dir)
> - F2FS_I(inode)->cp_task = current;
> + F2FS_I(inode)->cp_task = current;
>
>   filemap_fdatawrite(inode->i_mapping);
>
> - if (is_dir)
> - F2FS_I(inode)->cp_task = NULL;
> + F2FS_I(inode)->cp_task = NULL;
>
>   iput(inode);
>   /* We need to give cpu to another writers. */
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index d87dfa5aa112..9d3c11e09a03 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -2038,7 +2038,8 @@ static int __write_data_page(struct page *page, bool
> *submitted,
>   }
>
>   unlock_page(page);
> - if (!S_ISDIR(inode->i_mode) && !IS_NOQUOTA(inode))
> + if (!S_ISDIR(inode->i_mode) && !IS_NOQUOTA(inode) &&
> + !F2FS_I(inode)->cp_task)
>   f2fs_balance_fs(sbi, need_balance_fs);
>
>   if (unlikely(f2fs_cp_error(sbi))) {
> --
> 2.18.0.rc1
>
>
>
>>
>> 2019-03-30 8:29 GMT+01:00, Chao Yu :
>>> Oh, sorry, it's quite possible that bug is caused by data_flush, could
>>> remove that mount option first?
>>>
>>> Thanks,
>>>
>>> On 2019/3/30 11:25, Chao Yu wrote:
>>>> Hi Hagbard,
>>>>
>>>> Sorry for the delay.
>>>>
>>>> On 2019/3/27 21:59, Hagbard Celine wrote:
>>>>> Hi, this is a long standing bug that I've hit before on older kernels,
>>>>> but I was not able to get the syslog saved because of the nature of
>>>>> the bug. This time I had booted form a pen-drive, and was able to save
>>>>> the log to it's efi-partition.
>>>>
>>>> Now which version of kernel do you use? and do you remember what is
>>>> your
>>>> kernel version when this bug occured at first time?
>>>>
>> .
>>
>


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] BUG: kernel stack overflow when mounting with data_flush

2019-03-30 Thread Hagbard Celine
First, yes it is caused by data_flush, this is what I am trying to
report. Without that option there is no "stack guard page was hit" and
no "kernel stack overflow" and kernel is stable.
This time I was using kernel 5.0.3, as can be seen in the log in my first mail.
I do not remember exactly what kernel version I tried the first time a
saw this bug, but I believe the mount option data_flush was just added
when I tried it the first time. The option has always lead to crash
here.

2019-03-30 8:29 GMT+01:00, Chao Yu :
> Oh, sorry, it's quite possible that bug is caused by data_flush, could
> remove that mount option first?
>
> Thanks,
>
> On 2019/3/30 11:25, Chao Yu wrote:
>> Hi Hagbard,
>>
>> Sorry for the delay.
>>
>> On 2019/3/27 21:59, Hagbard Celine wrote:
>>> Hi, this is a long standing bug that I've hit before on older kernels,
>>> but I was not able to get the syslog saved because of the nature of
>>> the bug. This time I had booted form a pen-drive, and was able to save
>>> the log to it's efi-partition.
>>
>> Now which version of kernel do you use? and do you remember what is your
>> kernel version when this bug occured at first time?
>>


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] BUG: kernel stack overflow when mounting with data_flush

2019-03-27 Thread Hagbard Celine
Hi, this is a long standing bug that I've hit before on older kernels,
but I was not able to get the syslog saved because of the nature of
the bug. This time I had booted form a pen-drive, and was able to save
the log to it's efi-partition.
What i did to trigger it was to create a partition and format it f2fs,
then mount it with options:
"rw,relatime,lazytime,background_gc=on,disable_ext_identify,discard,heap,user_xattr,inline_xattr,acl,inline_data,inline_dentry,flush_merge,data_flush,extent_cache,mode=adaptive,active_logs=6,whint_mode=fs-based,alloc_mode=default,fsync_mode=strict".
Then I unpacked a big .tar.xz to the partition (I used a
gentoo-stage3-tarball as I was in process of installing Gentoo).

Same options just without data_flush gives no problems.

Syslog from crash follows:
Mar 20 20:20:34 usbgentoo syslog-ng[3644]: syslog-ng starting up;
version='3.17.2'
Mar 20 20:20:34 usbgentoo /usr/sbin/gpm[3674]: *** info
[daemon/startup.c(136)]:
Mar 20 20:20:34 usbgentoo /usr/sbin/gpm[3674]: Started gpm
successfully. Entered daemon mode.
Mar 20 20:20:34 usbgentoo kernel: ip (3771) used greatest stack depth:
12312 bytes left
Mar 20 20:20:34 usbgentoo dhcpcd[3840]: eth0: waiting for carrier
Mar 20 20:20:37 usbgentoo kernel: igb :03:00.0 eth0: igb: eth0 NIC
Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Mar 20 20:20:37 usbgentoo kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0:
link becomes ready
Mar 20 20:20:37 usbgentoo dhcpcd[3840]: eth0: carrier acquired
Mar 20 20:20:38 usbgentoo dhcpcd[3840]: DUID
00:01:00:01:24:24:e8:94:d0:50:99:3b:c9:21
Mar 20 20:20:38 usbgentoo dhcpcd[3840]: eth0: IAID 99:3b:c9:21
Mar 20 20:20:38 usbgentoo dhcpcd[3840]: eth0: adding address
fe80::3c99:2bbf:63bb:8354
Mar 20 20:20:38 usbgentoo dhcpcd[3840]: eth0: rebinding lease of 192.168.1.22
Mar 20 20:20:38 usbgentoo dhcpcd[3840]: eth0: probing address 192.168.1.22/24
Mar 20 20:20:39 usbgentoo dhcpcd[3840]: eth0: soliciting an IPv6 router
Mar 20 20:20:43 usbgentoo dhcpcd[3840]: eth0: leased 192.168.1.22 for
86400 seconds
Mar 20 20:20:43 usbgentoo dhcpcd[3840]: eth0: adding route to 192.168.1.0/24
Mar 20 20:20:43 usbgentoo dhcpcd[3840]: eth0: adding default route via
192.168.1.1
Mar 20 20:20:43 usbgentoo dhcpcd[3840]: forked to background, child pid 3883
Mar 20 20:20:48 usbgentoo login[4003]: pam_unix(login:auth):
authentication failure; logname=LOGIN uid=0 euid=0 tty=/dev/tty1
ruser= rhost=  user=root
Mar 20 20:20:51 usbgentoo dhcpcd[3883]: eth0: no IPv6 Routers available
Mar 20 20:21:06 usbgentoo login[4003]: pam_unix(login:session):
session opened for user root by LOGIN(uid=0)
Mar 20 20:21:06 usbgentoo login[4012]: ROOT LOGIN  on '/dev/tty1'
Mar 20 20:22:30 usbgentoo kernel: EXT4-fs (sdb4): recovery complete
Mar 20 20:22:30 usbgentoo kernel: EXT4-fs (sdb4): mounted filesystem
with ordered data mode. Opts: (null)
Mar 20 20:22:39 usbgentoo kernel: EXT4-fs (sdb5): recovery complete
Mar 20 20:22:39 usbgentoo kernel: EXT4-fs (sdb5): mounted filesystem
with ordered data mode. Opts: (null)
Mar 20 20:22:46 usbgentoo login[4003]: pam_unix(login:session):
session closed for user root
Mar 20 20:22:51 usbgentoo login[4025]: pam_unix(login:session):
session opened for user root by LOGIN(uid=0)
Mar 20 20:22:51 usbgentoo login[4027]: ROOT LOGIN  on '/dev/tty1'
Mar 20 20:23:34 usbgentoo login[4004]: pam_unix(login:session):
session opened for user root by LOGIN(uid=0)
Mar 20 20:23:34 usbgentoo login[4034]: ROOT LOGIN  on '/dev/tty2'
Mar 20 20:30:21 usbgentoo kernel: F2FS-fs (nvme0n1p5): Found nat_bits
in checkpoint
Mar 20 20:30:22 usbgentoo kernel: F2FS-fs (nvme0n1p5): Mounted with
checkpoint version = b9e8e7
Mar 20 20:30:35 usbgentoo login[4025]: pam_unix(login:session):
session closed for user root
Mar 20 20:30:42 usbgentoo login[4061]: pam_unix(login:session):
session opened for user root by LOGIN(uid=0)
Mar 20 20:30:42 usbgentoo login[4063]: ROOT LOGIN  on '/dev/tty1'
Mar 20 20:40:31 usbgentoo kernel: Adding 23984124k swap on
/dev/nvme0n1p6.  Priority:-2 extents:1 across:23984124k SSDsc
Mar 20 20:54:01 usbgentoo kernel: FAT-fs (nvme0n1p4): Volume was not
properly unmounted. Some data may be corrupt. Please run fsck.
Mar 20 21:05:23 usbgentoo kernel: kworker/dying (1588) used greatest
stack depth: 12064 bytes left
Mar 20 21:06:40 usbgentoo kernel: BUG: stack guard page was hit at
a4b0733c (stack is 56016422..96e7463f)
Mar 20 21:06:40 usbgentoo kernel: kernel stack overflow
(double-fault):  [#1] SMP PTI
Mar 20 21:06:40 usbgentoo kernel: CPU: 7 PID: 1606 Comm:
kworker/u16:15 Not tainted 5.0.3-gentoo #6
Mar 20 21:06:40 usbgentoo kernel: Hardware name: To Be Filled By
O.E.M. To Be Filled By O.E.M./C226 WS, BIOS P3.40 06/25/2018
Mar 20 21:06:40 usbgentoo kernel: Workqueue: writeback wb_workfn (flush-259:0)
Mar 20 21:06:40 usbgentoo kernel: RIP: 0010:f2fs_inode_chksum_verify+0x14/0xc0
Mar 20 21:06:40 usbgentoo kernel: Code: 00 00 04 0f 45 f0 e9 3b 85 e8
ff 90 66 2e 0f 1f 84 00 00 00 00 00 48 8b 47 48 a8 40 0f 85 9d 00