Re: [f2fs-dev] [PATCH 2/2] fsck.f2fs: check extra_attr feature

2018-07-08 Thread Sheng Yong




On 2018/7/9 10:15, Chao Yu wrote:

Hi Sheng,

On 2018/7/3 18:10, Sheng Yong wrote:

Check extra_attr feature for inode. If it is corrupted, remove the
inode.


Could you check the patch:

[PATCH] fsck.f2fs: fix to do sanity check with extra_attr feature

Would it better to do sanity check with it in fsck_chk_inode_blk? since it is
about consistence of inode instead of nid.



Agree. And the patch has already checked the value of i_extra_isize, so this one
can be dropped :-)

Thanks


Thanks,



Link: https://bugzilla.kernel.org/show_bug.cgi?id=200219
Reported-by: Wen Xu 
Signed-off-by: Sheng Yong 
---
  fsck/fsck.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/fsck/fsck.c b/fsck/fsck.c
index 15264b2..acbe25d 100644
--- a/fsck/fsck.c
+++ b/fsck/fsck.c
@@ -460,6 +460,12 @@ static int sanity_check_nid(struct f2fs_sb_info *sbi, u32 
nid,
__check_inode_mode(nid, ftype, le32_to_cpu(node_blk->i.i_mode)))
return -EINVAL;
  
+	if (ntype == TYPE_INODE &&

+   ((f2fs_has_extra_isize(&node_blk->i) &&
+ !(c.feature & F2FS_FEATURE_EXTRA_ATTR)) ||
+get_extra_isize(node_blk) >= DEF_ADDRS_PER_INODE))
+   return -EINVAL;
+
/* workaround to fix later */
if (ftype != F2FS_FT_ORPHAN ||
f2fs_test_bit(nid, fsck->nat_area_bitmap) != 0) {




.




--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Bug report: some new bugs found by fuzzing

2018-07-08 Thread Chao Yu
I updated a commit, could you have a try with last f2fs-dev?

On 2018/7/8 10:43, Xu, Wen wrote:
> It is already fixed by the latest commit in your git tree?
> 
> Thanks,
> Wen
> 
>> On Jul 7, 2018, at 12:32 PM, Chao Yu  wrote:
>>
>> On 2018/7/7 23:48, Xu, Wen wrote:
>>> Sure I will do it. So you are still willing to fix the issues even it may 
>>> be affected by CHECK_FS config?
>>
>> Yes, let me figure out the problem.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/chao/linux.git/commit/?h=f2fs-dev&id=ccf5e1525e92010dd2cb8ff5a820283f9bff5c78
>>
>> Above commit seems breaking normal case, let me update it.
>>
>> Thanks,
>>
>>>
>>> Thanks
>>> -Wen
>>>
 On Jul 6, 2018, at 12:22 PM, Chao Yu  wrote:

 Hi Wen,

 I've update two patches today for these issues, could you please test them?

 On 2018/7/6 9:30, Xu, Wen wrote:
> Thanks very much! I would like to provide any further help or testing.
>
> -Wen
>
>> On Jul 5, 2018, at 9:13 PM, Chao Yu  wrote:
>>
>> Hi Wen,
>>
>> On 2018/7/6 3:19, Xu, Wen wrote:
>>> Dear F2FS developers,
>>>
>>> By fuzzing, I find some new issues in Linux f2fs kernel module. Here 
>>> are the links on Bugzilla,
>>>
>>> 200419  NULL pointer dereference in __remove_dirty_segment() when 
>>> mounting an f2fs image
>>> https://bugzilla.kernel.org/show_bug.cgi?id=200419

 https://git.kernel.org/pub/scm/linux/kernel/git/chao/linux.git/commit/?h=f2fs-dev&id=ccf5e1525e92010dd2cb8ff5a820283f9bff5c78

>>>
>>> 200421  Buffer overrun in f2fs_truncate_inline_inode() when umounting 
>>> an f2fs image
>>> https://bugzilla.kernel.org/show_bug.cgi?id=200421

 https://git.kernel.org/pub/scm/linux/kernel/git/chao/linux.git/commit/?h=f2fs-dev&id=ea08202ee4ca67b31b3510591f2a8032ec3ac4cb

>>>
>>> 200423  Out-of-bound access in f2fs_get_dnode_of_data() when operating 
>>> file on an f2fs image
>>> https://bugzilla.kernel.org/show_bug.cgi?id=200423
>>>
>>> 200425  Invalid memory access in f2fs_find_target_dentry() when 
>>> operating files on an f2fs image
>>> https://bugzilla.kernel.org/show_bug.cgi?id=200425

 Fixes this issue with above commit.

 Thanks,

>>>
>>> Regarding my testing, they can all be reproduced w/ Chao’s f2fs-dev 
>>> branch. Thanks!
>>
>> Alright, I will dig into these issues in these days, once I have 
>> solution, will
>> let you know.
>>
>> Thanks,
>>
>>>
>>> -Wen
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Linux-f2fs-devel mailing list
>>> Linux-f2fs-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
>>>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
> 


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 2/2] fsck.f2fs: check extra_attr feature

2018-07-08 Thread Chao Yu
Hi Sheng,

On 2018/7/3 18:10, Sheng Yong wrote:
> Check extra_attr feature for inode. If it is corrupted, remove the
> inode.

Could you check the patch:

[PATCH] fsck.f2fs: fix to do sanity check with extra_attr feature

Would it better to do sanity check with it in fsck_chk_inode_blk? since it is
about consistence of inode instead of nid.

Thanks,

> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200219
> Reported-by: Wen Xu 
> Signed-off-by: Sheng Yong 
> ---
>  fsck/fsck.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fsck/fsck.c b/fsck/fsck.c
> index 15264b2..acbe25d 100644
> --- a/fsck/fsck.c
> +++ b/fsck/fsck.c
> @@ -460,6 +460,12 @@ static int sanity_check_nid(struct f2fs_sb_info *sbi, 
> u32 nid,
>   __check_inode_mode(nid, ftype, le32_to_cpu(node_blk->i.i_mode)))
>   return -EINVAL;
>  
> + if (ntype == TYPE_INODE &&
> + ((f2fs_has_extra_isize(&node_blk->i) &&
> +   !(c.feature & F2FS_FEATURE_EXTRA_ATTR)) ||
> +  get_extra_isize(node_blk) >= DEF_ADDRS_PER_INODE))
> + return -EINVAL;
> +
>   /* workaround to fix later */
>   if (ftype != F2FS_FT_ORPHAN ||
>   f2fs_test_bit(nid, fsck->nat_area_bitmap) != 0) {
> 


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 1/2] fsck.f2fs: check extent of inline data/dentry inode

2018-07-08 Thread Chao Yu
On 2018/7/3 18:10, Sheng Yong wrote:
> Check extent for inline data/dentry inode. If an inode contains inline
> data/dentry, it should have no extent.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200175
> Reported-by: Wen Xu 
> Signed-off-by: Sheng Yong 

Reviewed-by: Chao Yu 

Thanks,


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH 5/5] f2fs: fix to propagate error from __get_meta_page()

2018-07-08 Thread Chao Yu
From: Chao Yu 

If caller of __get_meta_page() can handle error, let's propagate error
from __get_meta_page().

Signed-off-by: Chao Yu 
---
 fs/f2fs/checkpoint.c | 21 +
 fs/f2fs/f2fs.h   |  2 +-
 fs/f2fs/node.c   | 27 +++
 fs/f2fs/recovery.c   |  8 
 fs/f2fs/segment.c| 33 +++--
 5 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 41e6e5769a2c..cae2b5785bbb 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -71,6 +71,7 @@ static struct page *__get_meta_page(struct f2fs_sb_info *sbi, 
pgoff_t index,
.encrypted_page = NULL,
.is_meta = is_meta,
};
+   int err;
 
if (unlikely(!is_meta))
fio.op_flags &= ~REQ_META;
@@ -85,11 +86,11 @@ static struct page *__get_meta_page(struct f2fs_sb_info 
*sbi, pgoff_t index,
 
fio.page = page;
 
-   if (f2fs_submit_page_bio(&fio)) {
+   err = f2fs_submit_page_bio(&fio);
+   if (err) {
memset(page_address(page), 0, PAGE_SIZE);
f2fs_stop_checkpoint(sbi, false);
-   f2fs_bug_on(sbi, 1);
-   return page;
+   return ERR_PTR(err);
}
 
lock_page(page);
@@ -106,6 +107,7 @@ static struct page *__get_meta_page(struct f2fs_sb_info 
*sbi, pgoff_t index,
if (unlikely(!PageUptodate(page))) {
memset(page_address(page), 0, PAGE_SIZE);
f2fs_stop_checkpoint(sbi, false);
+   return ERR_PTR(-EIO);
}
 out:
return page;
@@ -658,9 +660,15 @@ int f2fs_recover_orphan_inodes(struct f2fs_sb_info *sbi)
f2fs_ra_meta_pages(sbi, start_blk, orphan_blocks, META_CP, true);
 
for (i = 0; i < orphan_blocks; i++) {
-   struct page *page = f2fs_get_meta_page(sbi, start_blk + i);
+   struct page *page;
struct f2fs_orphan_block *orphan_blk;
 
+   page = f2fs_get_meta_page(sbi, start_blk + i);
+   if (IS_ERR(page)) {
+   err = PTR_ERR(page);
+   goto out;
+   }
+
orphan_blk = (struct f2fs_orphan_block *)page_address(page);
for (j = 0; j < le32_to_cpu(orphan_blk->entry_count); j++) {
nid_t ino = le32_to_cpu(orphan_blk->ino[j]);
@@ -751,6 +759,9 @@ static int get_checkpoint_version(struct f2fs_sb_info *sbi, 
block_t cp_addr,
__u32 crc = 0;
 
*cp_page = f2fs_get_meta_page(sbi, cp_addr);
+   if (IS_ERR(*cp_page))
+   return PTR_ERR(*cp_page);
+
*cp_block = (struct f2fs_checkpoint *)page_address(*cp_page);
 
crc_offset = le32_to_cpu((*cp_block)->checksum_offset);
@@ -876,6 +887,8 @@ int f2fs_get_valid_checkpoint(struct f2fs_sb_info *sbi)
unsigned char *ckpt = (unsigned char *)sbi->ckpt;
 
cur_page = f2fs_get_meta_page(sbi, cp_blk_no + i);
+   if (IS_ERR(cur_page))
+   goto free_fail_no_cp;
sit_bitmap_ptr = page_address(cur_page);
memcpy(ckpt + i * blk_size, sit_bitmap_ptr, blk_size);
f2fs_put_page(cur_page, 1);
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 359526d88d3f..76b51a0de3cd 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -2864,7 +2864,7 @@ int f2fs_try_to_free_nids(struct f2fs_sb_info *sbi, int 
nr_shrink);
 void f2fs_recover_inline_xattr(struct inode *inode, struct page *page);
 int f2fs_recover_xattr_data(struct inode *inode, struct page *page);
 int f2fs_recover_inode_page(struct f2fs_sb_info *sbi, struct page *page);
-void f2fs_restore_node_summary(struct f2fs_sb_info *sbi,
+int f2fs_restore_node_summary(struct f2fs_sb_info *sbi,
unsigned int segno, struct f2fs_summary_block *sum);
 void f2fs_flush_nat_entries(struct f2fs_sb_info *sbi, struct cp_control *cpc);
 int f2fs_build_node_manager(struct f2fs_sb_info *sbi);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index c48e2a2e5e82..248d401bf40a 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -113,25 +113,26 @@ static void clear_node_page_dirty(struct page *page)
 
 static struct page *get_current_nat_page(struct f2fs_sb_info *sbi, nid_t nid)
 {
-   pgoff_t index = current_nat_addr(sbi, nid);
-   return f2fs_get_meta_page(sbi, index);
+   struct page *page;
+
+   page = f2fs_get_meta_page(sbi, current_nat_addr(sbi, nid));
+   BUG_ON(IS_ERR(page));
+   return page;
 }
 
 static struct page *get_next_nat_page(struct f2fs_sb_info *sbi, nid_t nid)
 {
struct page *src_page;
struct page *dst_page;
-   pgoff_t src_off;
pgoff_t dst_off;
void *src_addr;
void *dst_addr;
struct f2fs_nm_info *nm_i = NM_I(sbi);
 
-   src_off = current_nat_addr(sbi, nid);
-   dst_off = next_nat_addr(sbi, src_off);
+   dst_off = n

[f2fs-dev] [PATCH 1/5] f2fs: detect bug_on in f2fs_wait_discard_bios

2018-07-08 Thread Chao Yu
From: Chao Yu 

Add bug_on to detect potential non-empty discard wait list.

Signed-off-by: Chao Yu 
---
 fs/f2fs/segment.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index cbf8f3f9a8e7..e55188975fcc 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1495,6 +1495,8 @@ bool f2fs_wait_discard_bios(struct f2fs_sb_info *sbi)
 
/* just to make sure there is no pending discard commands */
__wait_all_discard_cmd(sbi, NULL);
+
+   f2fs_bug_on(sbi, atomic_read(&dcc->discard_cmd_cnt));
return dropped;
 }
 
-- 
2.16.2.17.g38e79b1fd


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH 2/5] f2fs: clean up with IS_INODE()

2018-07-08 Thread Chao Yu
From: Chao Yu 

Signed-off-by: Chao Yu 
---
 fs/f2fs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 8211f5c288a1..b0ab39d83235 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -122,7 +122,7 @@ static bool f2fs_enable_inode_chksum(struct f2fs_sb_info 
*sbi, struct page *page
if (!f2fs_sb_has_inode_chksum(sbi->sb))
return false;
 
-   if (!RAW_IS_INODE(F2FS_NODE(page)) || !(ri->i_inline & F2FS_EXTRA_ATTR))
+   if (!IS_INODE(page) || !(ri->i_inline & F2FS_EXTRA_ATTR))
return false;
 
if (!F2FS_FITS_IN_INODE(ri, le16_to_cpu(ri->i_extra_isize),
-- 
2.16.2.17.g38e79b1fd


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH 3/5] f2fs: fix to do sanity check with i_extra_isize

2018-07-08 Thread Chao Yu
From: Chao Yu 

If inode.i_extra_isize was fuzzed to an abnormal value, when
calculating inline data size, the result will overflow, result
in accessing invalid memory area when operating inline data.

Let's do sanity check with i_extra_isize during inode loading
for fixing.

https://bugzilla.kernel.org/show_bug.cgi?id=200421

- Reproduce

- POC (poc.c)
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 

static void activity(char *mpoint) {

  char *foo_bar_baz;
  char *foo_baz;
  char *xattr;
  int err;

  err = asprintf(&foo_bar_baz, "%s/foo/bar/baz", mpoint);
  err = asprintf(&foo_baz, "%s/foo/baz", mpoint);
  err = asprintf(&xattr, "%s/foo/bar/xattr", mpoint);

  rename(foo_bar_baz, foo_baz);

  char buf2[113];
  memset(buf2, 0, sizeof(buf2));
  listxattr(xattr, buf2, sizeof(buf2));
  removexattr(xattr, "user.mime_type");

}

int main(int argc, char *argv[]) {
  activity(argv[1]);
  return 0;
}

- Kernel message
Umount the image will leave the following message
[ 2910.995489] F2FS-fs (loop0): Mounted with checkpoint version = 2
[ 2918.416465] 
==
[ 2918.416807] BUG: KASAN: slab-out-of-bounds in f2fs_iget+0xcb9/0x1a80
[ 2918.417009] Read of size 4 at addr 88018efc2068 by task a.out/1229

[ 2918.417311] CPU: 1 PID: 1229 Comm: a.out Not tainted 4.17.0+ #1
[ 2918.417314] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 2918.417323] Call Trace:
[ 2918.417366]  dump_stack+0x71/0xab
[ 2918.417401]  print_address_description+0x6b/0x290
[ 2918.417407]  kasan_report+0x28e/0x390
[ 2918.417411]  ? f2fs_iget+0xcb9/0x1a80
[ 2918.417415]  f2fs_iget+0xcb9/0x1a80
[ 2918.417422]  ? f2fs_lookup+0x2e7/0x580
[ 2918.417425]  f2fs_lookup+0x2e7/0x580
[ 2918.417433]  ? __recover_dot_dentries+0x400/0x400
[ 2918.417447]  ? legitimize_path.isra.29+0x5a/0xa0
[ 2918.417453]  __lookup_slow+0x11c/0x220
[ 2918.417457]  ? may_delete+0x2a0/0x2a0
[ 2918.417475]  ? deref_stack_reg+0xe0/0xe0
[ 2918.417479]  ? __lookup_hash+0xb0/0xb0
[ 2918.417483]  lookup_slow+0x3e/0x60
[ 2918.417488]  walk_component+0x3ac/0x990
[ 2918.417492]  ? generic_permission+0x51/0x1e0
[ 2918.417495]  ? inode_permission+0x51/0x1d0
[ 2918.417499]  ? pick_link+0x3e0/0x3e0
[ 2918.417502]  ? link_path_walk+0x4b1/0x770
[ 2918.417513]  ? _raw_spin_lock_irqsave+0x25/0x50
[ 2918.417518]  ? walk_component+0x990/0x990
[ 2918.417522]  ? path_init+0x2e6/0x580
[ 2918.417526]  path_lookupat+0x13f/0x430
[ 2918.417531]  ? trailing_symlink+0x3a0/0x3a0
[ 2918.417534]  ? do_renameat2+0x270/0x7b0
[ 2918.417538]  ? __kasan_slab_free+0x14c/0x190
[ 2918.417541]  ? do_renameat2+0x270/0x7b0
[ 2918.417553]  ? kmem_cache_free+0x85/0x1e0
[ 2918.417558]  ? do_renameat2+0x270/0x7b0
[ 2918.417563]  filename_lookup+0x13c/0x280
[ 2918.417567]  ? filename_parentat+0x2b0/0x2b0
[ 2918.417572]  ? kasan_unpoison_shadow+0x31/0x40
[ 2918.417575]  ? kasan_kmalloc+0xa6/0xd0
[ 2918.417593]  ? strncpy_from_user+0xaa/0x1c0
[ 2918.417598]  ? getname_flags+0x101/0x2b0
[ 2918.417614]  ? path_listxattr+0x87/0x110
[ 2918.417619]  path_listxattr+0x87/0x110
[ 2918.417623]  ? listxattr+0xc0/0xc0
[ 2918.417637]  ? mm_fault_error+0x1b0/0x1b0
[ 2918.417654]  do_syscall_64+0x73/0x160
[ 2918.417660]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2918.417676] RIP: 0033:0x7f2f3a3480d7
[ 2918.417677] Code: f0 ff ff 73 01 c3 48 8b 0d be dd 2b 00 f7 d8 64 89 01 48 
83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 c2 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 91 dd 2b 00 f7 d8 64 89 01 48
[ 2918.417732] RSP: 002b:7fff4095b7d8 EFLAGS: 0206 ORIG_RAX: 
00c2
[ 2918.417744] RAX: ffda RBX:  RCX: 7f2f3a3480d7
[ 2918.417746] RDX: 0071 RSI: 7fff4095b810 RDI: 0126a0c0
[ 2918.417749] RBP: 7fff4095b890 R08: 0126a010 R09: 
[ 2918.417751] R10: 01ab R11: 0206 R12: 004005e0
[ 2918.417753] R13: 7fff4095b990 R14:  R15: 

[ 2918.417853] Allocated by task 329:
[ 2918.418002]  kasan_kmalloc+0xa6/0xd0
[ 2918.418007]  kmem_cache_alloc+0xc8/0x1e0
[ 2918.418023]  mempool_init_node+0x194/0x230
[ 2918.418027]  mempool_init+0x12/0x20
[ 2918.418042]  bioset_init+0x2bd/0x380
[ 2918.418052]  blk_alloc_queue_node+0xe9/0x540
[ 2918.418075]  dm_create+0x2c0/0x800
[ 2918.418080]  dev_create+0xd2/0x530
[ 2918.418083]  ctl_ioctl+0x2a3/0x5b0
[ 2918.418087]  dm_ctl_ioctl+0xa/0x10
[ 2918.418092]  do_vfs_ioctl+0x13e/0x8c0
[ 2918.418095]  ksys_ioctl+0x66/0x70
[ 2918.418098]  __x64_sys_ioctl+0x3d/0x50
[ 2918.418102]  do_syscall_64+0x73/0x160
[ 2918.418106]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 2918.418204] Freed by task 

[f2fs-dev] [PATCH 4/5] f2fs: fix to do sanity check with cp_pack_start_sum

2018-07-08 Thread Chao Yu
From: Chao Yu 

After fuzzing, cp_pack_start_sum could be corrupted, so current log's
summary info should be wrong due to loading incorrect summary block.
Then, if segment's type in current log is exceeded NR_CURSEG_TYPE, it
can lead accessing invalid dirty_i->dirty_segmap bitmap finally.

Add sanity check for cp_pack_start_sum to fix this issue.

https://bugzilla.kernel.org/show_bug.cgi?id=200419

- Reproduce

- Kernel message (f2fs-dev w/ KASAN)
[ 3117.578432] F2FS-fs (loop0): Invalid log blocks per segment (8)

[ 3117.578445] F2FS-fs (loop0): Can't find valid F2FS filesystem in 2th 
superblock
[ 3117.581364] F2FS-fs (loop0): invalid crc_offset: 30716
[ 3117.583564] WARNING: CPU: 1 PID: 1225 at fs/f2fs/checkpoint.c:90 
__get_meta_page+0x448/0x4b0
[ 3117.583570] Modules linked in: snd_hda_codec_generic snd_hda_intel 
snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer joydev input_leds 
serio_raw snd soundcore mac_hid i2c_piix4 ib_iser rdma_cm iw_cm ib_cm ib_core 
configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs 
zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
multipath linear 8139too qxl ttm drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
pcbc aesni_intel psmouse aes_x86_64 8139cp crypto_simd cryptd mii glue_helper 
pata_acpi floppy
[ 3117.584014] CPU: 1 PID: 1225 Comm: mount Not tainted 4.17.0+ #1
[ 3117.584017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 3117.584022] RIP: 0010:__get_meta_page+0x448/0x4b0
[ 3117.584023] Code: 00 49 8d bc 24 84 00 00 00 e8 74 54 da ff 41 83 8c 24 84 
00 00 00 08 4c 89 f6 4c 89 ef e8 c0 d9 95 00 48 89 ef e8 18 e3 00 00 <0f> 0b f0 
80 4d 48 04 e9 0f fe ff ff 0f 0b 48 89 c7 48 89 04 24 e8
[ 3117.584072] RSP: 0018:88018eb678c0 EFLAGS: 00010286
[ 3117.584082] RAX: 88018f0a6a78 RBX: ea0007a46600 RCX: 9314d1b2
[ 3117.584085] RDX: 0001 RSI:  RDI: 88018f0a6a98
[ 3117.584087] RBP: 88018ebe9980 R08: 0002 R09: 0001
[ 3117.584090] R10: 0001 R11: ed00326e4450 R12: 880193722200
[ 3117.584092] R13: 88018ebe9afc R14: 0206 R15: 88018eb67900
[ 3117.584096] FS:  7f5694636840() GS:8801f3b0() 
knlGS:
[ 3117.584098] CS:  0010 DS:  ES:  CR0: 80050033
[ 3117.584101] CR2: 016f21b8 CR3: 000191c22000 CR4: 06e0
[ 3117.584112] Call Trace:
[ 3117.584121]  ? f2fs_set_meta_page_dirty+0x150/0x150
[ 3117.584127]  ? f2fs_build_segment_manager+0xbf9/0x3190
[ 3117.584133]  ? f2fs_npages_for_summary_flush+0x75/0x120
[ 3117.584145]  f2fs_build_segment_manager+0xda8/0x3190
[ 3117.584151]  ? f2fs_get_valid_checkpoint+0x298/0xa00
[ 3117.584156]  ? f2fs_flush_sit_entries+0x10e0/0x10e0
[ 3117.584184]  ? map_id_range_down+0x17c/0x1b0
[ 3117.584188]  ? __put_user_ns+0x30/0x30
[ 3117.584206]  ? find_next_bit+0x53/0x90
[ 3117.584237]  ? cpumask_next+0x16/0x20
[ 3117.584249]  f2fs_fill_super+0x1948/0x2b40
[ 3117.584258]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584279]  ? sget_userns+0x65e/0x690
[ 3117.584296]  ? set_blocksize+0x88/0x130
[ 3117.584302]  ? f2fs_commit_super+0x1a0/0x1a0
[ 3117.584305]  mount_bdev+0x1c0/0x200
[ 3117.584310]  mount_fs+0x5c/0x190
[ 3117.584320]  vfs_kern_mount+0x64/0x190
[ 3117.584330]  do_mount+0x2e4/0x1450
[ 3117.584343]  ? lockref_put_return+0x130/0x130
[ 3117.584347]  ? copy_mount_string+0x20/0x20
[ 3117.584357]  ? kasan_unpoison_shadow+0x31/0x40
[ 3117.584362]  ? kasan_kmalloc+0xa6/0xd0
[ 3117.584373]  ? memcg_kmem_put_cache+0x16/0x90
[ 3117.584377]  ? __kmalloc_track_caller+0x196/0x210
[ 3117.584383]  ? _copy_from_user+0x61/0x90
[ 3117.584396]  ? memdup_user+0x3e/0x60
[ 3117.584401]  ksys_mount+0x7e/0xd0
[ 3117.584405]  __x64_sys_mount+0x62/0x70
[ 3117.584427]  do_syscall_64+0x73/0x160
[ 3117.584440]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3117.584455] RIP: 0033:0x7f5693f14b9a
[ 3117.584456] Code: 48 8b 0d 01 c3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 
0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d ce c2 2b 00 f7 d8 64 89 01 48
[ 3117.584505] RSP: 002b:7fff27346488 EFLAGS: 0206 ORIG_RAX: 
00a5
[ 3117.584510] RAX: ffda RBX: 016e2030 RCX: 7f5693f14b9a
[ 3117.584512] RDX: 016e2210 RSI: 016e3f30 RDI: 016ee040
[ 3117.584514] RBP:  R08:  R09: 0013
[ 3117.584516] R10: c0ed R11: 0206 R12: 016ee040
[ 3117.584519] R13: 016e2210 R14:  R15: 0003
[ 3117.584523] ---[ end trace a8e0d899985faf31 ]---
[ 3117.685663] F2FS-fs (loop0): f2fs_check_nid_range: out-of-range nid=2, run 
fsck to fix.
[ 3117.685673] F2FS-fs (loop0): recover_

[f2fs-dev] [PATCH v2] f2fs: split discard command in prior to block layer

2018-07-08 Thread Chao Yu
From: Chao Yu 

Some devices has small max_{hw,}discard_sectors, so that in
__blkdev_issue_discard(), one big size discard bio can be split
into multiple small size discard bios, result in heavy load in IO
scheduler and device, which can hang other sync IO for long time.

Now, f2fs is trying to control discard commands more elaboratively,
in order to make less conflict in between discard IO and user IO
to enhance application's performance, so in this patch, we will
split discard bio in f2fs in prior to in block layer to reduce
issuing multiple discard bios in a short time.

Signed-off-by: Chao Yu 
---
v2:
- change to split discard command entry before submission.
 fs/f2fs/f2fs.h|  13 +++---
 fs/f2fs/segment.c | 117 +-
 2 files changed, 86 insertions(+), 44 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index c48b655d5d8d..359526d88d3f 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -178,7 +178,6 @@ enum {
 
 #define MAX_DISCARD_BLOCKS(sbi)BLKS_PER_SEC(sbi)
 #define DEF_MAX_DISCARD_REQUEST8   /* issue 8 discards per 
round */
-#define DEF_MAX_DISCARD_LEN512 /* Max. 2MB per discard */
 #define DEF_MIN_DISCARD_ISSUE_TIME 50  /* 50 ms, if exists */
 #define DEF_MID_DISCARD_ISSUE_TIME 500 /* 500 ms, if device busy */
 #define DEF_MAX_DISCARD_ISSUE_TIME 6   /* 60 s, if no candidates */
@@ -709,22 +708,22 @@ static inline void set_extent_info(struct extent_info 
*ei, unsigned int fofs,
 }
 
 static inline bool __is_discard_mergeable(struct discard_info *back,
-   struct discard_info *front)
+   struct discard_info *front, unsigned int max_len)
 {
return (back->lstart + back->len == front->lstart) &&
-   (back->len + front->len < DEF_MAX_DISCARD_LEN);
+   (back->len + front->len <= max_len);
 }
 
 static inline bool __is_discard_back_mergeable(struct discard_info *cur,
-   struct discard_info *back)
+   struct discard_info *back, unsigned int max_len)
 {
-   return __is_discard_mergeable(back, cur);
+   return __is_discard_mergeable(back, cur, max_len);
 }
 
 static inline bool __is_discard_front_mergeable(struct discard_info *cur,
-   struct discard_info *front)
+   struct discard_info *front, unsigned int max_len)
 {
-   return __is_discard_mergeable(cur, front);
+   return __is_discard_mergeable(cur, front, max_len);
 }
 
 static inline bool __is_extent_mergeable(struct extent_info *back,
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index f7f56dd091b4..cbf8f3f9a8e7 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -966,17 +966,24 @@ static void __init_discard_policy(struct f2fs_sb_info 
*sbi,
}
 }
 
-
+static void __update_discard_tree_range(struct f2fs_sb_info *sbi,
+   struct block_device *bdev, block_t lstart,
+   block_t start, block_t len);
 /* this function is copied from blkdev_issue_discard from block/blk-lib.c */
 static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
struct discard_policy *dpolicy,
-   struct discard_cmd *dc)
+   struct discard_cmd *dc,
+   unsigned int *issued)
 {
+   struct block_device *bdev = dc->bdev;
+   struct request_queue *q = bdev_get_queue(bdev);
+   unsigned int max_discard_blocks =
+   SECTOR_TO_BLOCK(q->limits.max_discard_sectors);
struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
struct list_head *wait_list = (dpolicy->type == DPOLICY_FSTRIM) ?
&(dcc->fstrim_list) : &(dcc->wait_list);
-   struct bio *bio = NULL;
int flag = dpolicy->sync ? REQ_SYNC : 0;
+   block_t lstart, start, len, total_len;
 
if (dc->state != D_PREP)
return;
@@ -984,30 +991,63 @@ static void __submit_discard_cmd(struct f2fs_sb_info *sbi,
if (is_sbi_flag_set(sbi, SBI_NEED_FSCK))
return;
 
-   trace_f2fs_issue_discard(dc->bdev, dc->start, dc->len);
-
-   dc->error = __blkdev_issue_discard(dc->bdev,
-   SECTOR_FROM_BLOCK(dc->start),
-   SECTOR_FROM_BLOCK(dc->len),
-   GFP_NOFS, 0, &bio);
-   if (!dc->error) {
-   /* should keep before submission to avoid D_DONE right away */
-   dc->state = D_SUBMIT;
-   atomic_inc(&dcc->issued_discard);
-   atomic_inc(&dcc->issing_discard);
-   if (bio) {
-   bio->bi_private = dc;
-  

[f2fs-dev] [PATCH v2 2/2] f2fs: issue small discard by LBA order

2018-07-08 Thread Chao Yu
From: Chao Yu 

For small granularity discard which size is smaller than 64KB, if we
issue those kind of discards orderly by size, their IOs will be spread
into entire logical address, so that in FTL, L2P table will be updated
randomly, result bad wear rate in the table.

In this patch, we choose to issue small discard by LBA order, by this
way, we can expect that L2P table updates from adjacent discard IOs can
be merged in the cache, so it can reduce lifetime wearing of flash.

Signed-off-by: Chao Yu 
---
v2:
- remove unneeded header file.
- don't retry submission if there is queued IO.
 fs/f2fs/f2fs.h|  2 ++
 fs/f2fs/segment.c | 64 +++
 2 files changed, 66 insertions(+)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index bf5f7a336ace..c48b655d5d8d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -297,6 +297,7 @@ struct discard_policy {
unsigned int io_aware_gran; /* minimum granularity discard not be 
aware of I/O */
bool io_aware;  /* issue discard in idle time */
bool sync;  /* submit discard with REQ_SYNC flag */
+   bool ordered;   /* issue discard by lba order */
unsigned int granularity;   /* discard granularity */
 };
 
@@ -313,6 +314,7 @@ struct discard_cmd_control {
unsigned int max_discards;  /* max. discards to be issued */
unsigned int discard_granularity;   /* discard granularity */
unsigned int undiscard_blks;/* # of undiscard blocks */
+   unsigned int next_pos;  /* next discard position */
atomic_t issued_discard;/* # of issued discard */
atomic_t issing_discard;/* # of issing discard */
atomic_t discard_cmd_cnt;   /* # of cached cmd count */
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index aee198869b1f..f7f56dd091b4 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -936,6 +936,7 @@ static void __init_discard_policy(struct f2fs_sb_info *sbi,
/* common policy */
dpolicy->type = discard_type;
dpolicy->sync = true;
+   dpolicy->ordered = false;
dpolicy->granularity = granularity;
 
dpolicy->max_requests = DEF_MAX_DISCARD_REQUEST;
@@ -947,6 +948,7 @@ static void __init_discard_policy(struct f2fs_sb_info *sbi,
dpolicy->max_interval = DEF_MAX_DISCARD_ISSUE_TIME;
dpolicy->io_aware = true;
dpolicy->sync = false;
+   dpolicy->ordered = true;
if (utilization(sbi) > DEF_DISCARD_URGENT_UTIL) {
dpolicy->granularity = 1;
dpolicy->max_interval = DEF_MIN_DISCARD_ISSUE_TIME;
@@ -1183,6 +1185,63 @@ static int __queue_discard_cmd(struct f2fs_sb_info *sbi,
return 0;
 }
 
+static unsigned int __issue_discard_cmd_orderly(struct f2fs_sb_info *sbi,
+   struct discard_policy *dpolicy)
+{
+   struct discard_cmd_control *dcc = SM_I(sbi)->dcc_info;
+   struct discard_cmd *prev_dc = NULL, *next_dc = NULL;
+   struct rb_node **insert_p = NULL, *insert_parent = NULL;
+   struct discard_cmd *dc;
+   struct blk_plug plug;
+   unsigned int pos = dcc->next_pos;
+   unsigned int issued = 0;
+   bool io_interrupted = false;
+
+   mutex_lock(&dcc->cmd_lock);
+   dc = (struct discard_cmd *)f2fs_lookup_rb_tree_ret(&dcc->root,
+   NULL, pos,
+   (struct rb_entry **)&prev_dc,
+   (struct rb_entry **)&next_dc,
+   &insert_p, &insert_parent, true);
+   if (!dc)
+   dc = next_dc;
+
+   blk_start_plug(&plug);
+
+   while (dc) {
+   struct rb_node *node;
+
+   if (dc->state != D_PREP)
+   goto next;
+
+   if (dpolicy->io_aware && !is_idle(sbi)) {
+   io_interrupted = true;
+   break;
+   }
+
+   dcc->next_pos = dc->lstart + dc->len;
+   __submit_discard_cmd(sbi, dpolicy, dc);
+
+   if (++issued >= dpolicy->max_requests)
+   break;
+next:
+   node = rb_next(&dc->rb_node);
+   dc = rb_entry_safe(node, struct discard_cmd, rb_node);
+   }
+
+   blk_finish_plug(&plug);
+
+   if (!dc)
+   dcc->next_pos = 0;
+
+   mutex_unlock(&dcc->cmd_lock);
+
+   if (!issued && io_interrupted)
+   issued = -1;
+
+   return issued;
+}
+
 static int __issue_discard_cmd(struct f2fs_sb_info *sbi,
struct discard_policy *dpolicy)
 {
@@ -1196,6 +1255,10 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi,
for (i = MAX_PLIST_NUM - 1; i >= 0; i--) {

[f2fs-dev] [PATCH] f2fs: stop issuing discard immediately if there is queued IO

2018-07-08 Thread Chao Yu
From: Chao Yu 

For background discard policy, even if there is queued user IO, still
we will check max_requests times for next discard entry, it is unneeded,
let's just stop this round submission immediately.

Signed-off-by: Chao Yu 
---
 fs/f2fs/segment.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 99beaf0a2dea..aee198869b1f 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1190,7 +1190,7 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi,
struct list_head *pend_list;
struct discard_cmd *dc, *tmp;
struct blk_plug plug;
-   int i, iter = 0, issued = 0;
+   int i, issued = 0;
bool io_interrupted = false;
 
for (i = MAX_PLIST_NUM - 1; i >= 0; i--) {
@@ -1211,20 +1211,19 @@ static int __issue_discard_cmd(struct f2fs_sb_info *sbi,
if (dpolicy->io_aware && i < dpolicy->io_aware_gran &&
!is_idle(sbi)) {
io_interrupted = true;
-   goto skip;
+   break;
}
 
__submit_discard_cmd(sbi, dpolicy, dc);
-   issued++;
-skip:
-   if (++iter >= dpolicy->max_requests)
+
+   if (++issued >= dpolicy->max_requests)
break;
}
blk_finish_plug(&plug);
 next:
mutex_unlock(&dcc->cmd_lock);
 
-   if (iter >= dpolicy->max_requests)
+   if (issued >= dpolicy->max_requests || io_interrupted)
break;
}
 
-- 
2.16.2.17.g38e79b1fd


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH v3 2/2] f2fs: let checkpoint flush dnode page of regular

2018-07-08 Thread Chao Yu
From: Chao Yu 

Fsyncer will wait on all dnode pages of regular writeback before flushing,
if there are async dnode pages blocked by IO scheduler, it may decrease
fsync's performance.

In this patch, we choose to let f2fs_balance_fs_bg() to trigger checkpoint
to flush these dnode pages of regular, so async IO of dnode page can be
elimitnated, making fsyncer only need to wait for sync IO.

Signed-off-by: Chao Yu 
---
v3:
- rebase code.
 fs/f2fs/node.c| 8 +++-
 fs/f2fs/node.h| 5 +
 fs/f2fs/segment.c | 4 +++-
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 31dc372c56a0..c48e2a2e5e82 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1453,6 +1453,10 @@ static int __write_node_page(struct page *page, bool 
atomic, bool *submitted,
if (unlikely(is_sbi_flag_set(sbi, SBI_POR_DOING)))
goto redirty_out;
 
+   if (wbc->sync_mode == WB_SYNC_NONE &&
+   IS_DNODE(page) && is_cold_node(page))
+   goto redirty_out;
+
/* get old block addr of this node page */
nid = nid_of_node(page);
f2fs_bug_on(sbi, page->index != nid);
@@ -1778,10 +1782,12 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
}
 
if (step < 2) {
+   if (wbc->sync_mode == WB_SYNC_NONE && step == 1)
+   goto out;
step++;
goto next_step;
}
-
+out:
if (nwritten)
f2fs_submit_merged_write(sbi, NODE);
 
diff --git a/fs/f2fs/node.h b/fs/f2fs/node.h
index b95e49e4a928..b0da4c26eebb 100644
--- a/fs/f2fs/node.h
+++ b/fs/f2fs/node.h
@@ -135,6 +135,11 @@ static inline bool excess_cached_nats(struct f2fs_sb_info 
*sbi)
return NM_I(sbi)->nat_cnt >= DEF_NAT_CACHE_THRESHOLD;
 }
 
+static inline bool excess_dirty_nodes(struct f2fs_sb_info *sbi)
+{
+   return get_pages(sbi, F2FS_DIRTY_NODES) >= sbi->blocks_per_seg * 8;
+}
+
 enum mem_type {
FREE_NIDS,  /* indicates the free nid list */
NAT_ENTRIES,/* indicates the cached nat entry */
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 47b6595a078c..99beaf0a2dea 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -503,7 +503,8 @@ void f2fs_balance_fs_bg(struct f2fs_sb_info *sbi)
else
f2fs_build_free_nids(sbi, false, false);
 
-   if (!is_idle(sbi) && !excess_dirty_nats(sbi))
+   if (!is_idle(sbi) &&
+   (!excess_dirty_nats(sbi) && !excess_dirty_nodes(sbi)))
return;
 
/* checkpoint is the only way to shrink partial cached entries */
@@ -511,6 +512,7 @@ void f2fs_balance_fs_bg(struct f2fs_sb_info *sbi)
!f2fs_available_free_memory(sbi, INO_ENTRIES) ||
excess_prefree_segs(sbi) ||
excess_dirty_nats(sbi) ||
+   excess_dirty_nodes(sbi) ||
f2fs_time_over(sbi, CP_TIME)) {
if (test_opt(sbi, DATA_FLUSH)) {
struct blk_plug plug;
-- 
2.16.2.17.g38e79b1fd


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH v3 1/2] f2fs: fix to avoid broken of dnode block list

2018-07-08 Thread Chao Yu
From: Chao Yu 

f2fs recovery flow is relying on dnode block link list, it means fsynced
file recovery depends on previous dnode's persistence in the list, so
during fsync() we should wait on all regular inode's dnode writebacked
before issuing flush.

By this way, we can avoid dnode block list being broken by out-of-order
IO submission due to IO scheduler or driver.

Signed-off-by: Chao Yu 
---
v3:
- add a list to link all writebacked dnodes, let fsync() only wait on
necessary dnode.
 fs/f2fs/checkpoint.c |   2 +
 fs/f2fs/data.c   |   2 +
 fs/f2fs/f2fs.h   |  21 +++-
 fs/f2fs/file.c   |  20 +++
 fs/f2fs/node.c   | 148 +--
 fs/f2fs/super.c  |   4 ++
 6 files changed, 153 insertions(+), 44 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 8b698bd54490..d5e60d76362e 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -1379,6 +1379,8 @@ static int do_checkpoint(struct f2fs_sb_info *sbi, struct 
cp_control *cpc)
 
f2fs_release_ino_entry(sbi, false);
 
+   f2fs_reset_fsync_node_info(sbi);
+
if (unlikely(f2fs_cp_error(sbi)))
return -EIO;
 
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 70813a4dda3e..afe76d87575c 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -176,6 +176,8 @@ static void f2fs_write_end_io(struct bio *bio)
page->index != nid_of_node(page));
 
dec_page_count(sbi, WB_DATA_TYPE(page));
+   if (f2fs_in_warm_node_list(sbi, page))
+   f2fs_del_fsync_node_entry(sbi, page);
clear_cold_data(page);
end_page_writeback(page);
}
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index c8c865fa8450..bf5f7a336ace 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -228,6 +228,12 @@ struct inode_entry {
struct inode *inode;/* vfs inode pointer */
 };
 
+struct fsync_node_entry {
+   struct list_head list;  /* list head */
+   struct page *page;  /* warm node page pointer */
+   unsigned int seq_id;/* sequence id */
+};
+
 /* for the bitmap indicate blocks to be discarded */
 struct discard_entry {
struct list_head list;  /* list head */
@@ -1152,6 +1158,11 @@ struct f2fs_sb_info {
 
struct inode_management im[MAX_INO_ENTRY];  /* manage inode cache */
 
+   spinlock_t fsync_node_lock; /* for node entry lock */
+   struct list_head fsync_node_list;   /* node list head */
+   unsigned int fsync_seg_id;  /* sequence id */
+   unsigned int fsync_node_num;/* number of node entries */
+
/* for orphan inode, use 0'th array */
unsigned int max_orphans;   /* max orphan inodes */
 
@@ -2816,6 +2827,10 @@ struct node_info;
 
 int f2fs_check_nid_range(struct f2fs_sb_info *sbi, nid_t nid);
 bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int type);
+bool f2fs_in_warm_node_list(struct f2fs_sb_info *sbi, struct page *page);
+void f2fs_init_fsync_node_info(struct f2fs_sb_info *sbi);
+void f2fs_del_fsync_node_entry(struct f2fs_sb_info *sbi, struct page *page);
+void f2fs_reset_fsync_node_info(struct f2fs_sb_info *sbi);
 int f2fs_need_dentry_mark(struct f2fs_sb_info *sbi, nid_t nid);
 bool f2fs_is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid);
 bool f2fs_need_inode_block_update(struct f2fs_sb_info *sbi, nid_t ino);
@@ -2825,7 +2840,8 @@ pgoff_t f2fs_get_next_page_offset(struct dnode_of_data 
*dn, pgoff_t pgofs);
 int f2fs_get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode);
 int f2fs_truncate_inode_blocks(struct inode *inode, pgoff_t from);
 int f2fs_truncate_xattr_node(struct inode *inode);
-int f2fs_wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino);
+int f2fs_wait_on_node_pages_writeback(struct f2fs_sb_info *sbi,
+   unsigned int seq_id);
 int f2fs_remove_inode_page(struct inode *inode);
 struct page *f2fs_new_inode_page(struct inode *inode);
 struct page *f2fs_new_node_page(struct dnode_of_data *dn, unsigned int ofs);
@@ -2834,7 +2850,8 @@ struct page *f2fs_get_node_page(struct f2fs_sb_info *sbi, 
pgoff_t nid);
 struct page *f2fs_get_node_page_ra(struct page *parent, int start);
 void f2fs_move_node_page(struct page *node_page, int gc_type);
 int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, struct inode *inode,
-   struct writeback_control *wbc, bool atomic);
+   struct writeback_control *wbc, bool atomic,
+   unsigned int *seq_id);
 int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
struct writeback_control *wbc,
bool do_balance, enum iostat_type io_type);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 5e29d4053748..ddea2bfd4042 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -213,6 +213,7 @@ static int f2fs_do_syn