[PATCH v2] btrfs: use tagged writepage to mitigate livelock of snapshot

2018-11-02 Thread Ethan Lien
Snapshot is expected to be fast. But if there are writers steadily
create dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for
snapshot flusher as we do in generic write_cache_pages(): we quickly
tag all dirty pages with a TOWRITE tag, then do the hard work of
writepage only on those pages with TOWRITE tag, so we ommit pages dirtied
after the snapshot command.

We do a simple snapshot speed test on a Intel D-1531 box:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 1m58sec
patched:  6.54sec

This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.

For a multi writers, random write test:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 15.83sec
patched:  10.35sec

The improvement is less compared with the sequential write case, since
we omit only half of the pages dirtied after snapshot command.

Signed-off-by: Ethan Lien 
---

V2:
Add more details in commit message.
rename BTRFS_INODE_TAGGED_FLUSH to BTRFS_INODE_SNAPSHOT_FLUSH.
remove unnecessary sync_mode check.
start_delalloc_inodes use boolean argument.

 fs/btrfs/btrfs_inode.h |  1 +
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/extent_io.c   | 14 --
 fs/btrfs/inode.c   | 10 ++
 fs/btrfs/ioctl.c   |  2 +-
 5 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 1343ac57b438..ffc9a1c77375 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -29,6 +29,7 @@ enum {
BTRFS_INODE_IN_DELALLOC_LIST,
BTRFS_INODE_READDIO_NEED_LOCK,
BTRFS_INODE_HAS_PROPS,
+   BTRFS_INODE_SNAPSHOT_FLUSH,
 };
 
 /* in memory btrfs inode */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..82682da5a40d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3155,7 +3155,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
   struct inode *inode, u64 new_size,
   u32 min_type);
 
-int btrfs_start_delalloc_inodes(struct btrfs_root *root);
+int btrfs_start_delalloc_snapshot(struct btrfs_root *root);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
  unsigned int extra_bits,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4dd6faab02bb..93f2e413535d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3928,12 +3928,22 @@ static int extent_write_cache_pages(struct 
address_space *mapping,
range_whole = 1;
scanned = 1;
}
-   if (wbc->sync_mode == WB_SYNC_ALL)
+
+   /*
+* We do the tagged writepage as long as the snapshot flush bit is set
+* and we are the first one who do the filemap_flush() on this inode.
+*/
+   if (range_whole && wbc->nr_to_write == LONG_MAX &&
+   test_and_clear_bit(BTRFS_INODE_SNAPSHOT_FLUSH,
+   &BTRFS_I(inode)->runtime_flags))
+   wbc->tagged_writepages = 1;
+
+   if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
 retry:
-   if (wbc->sync_mode == WB_SYNC_ALL)
+   if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && !nr_to_write_done && (index <= end) &&
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ea5339603cf..593445d122ed 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9975,7 +9975,7 @@ static struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode
  * some fairly slow code that needs optimization. This walks the list
  * of all the inodes with pending delalloc and forces them to disk.
  */
-static int start_delalloc_inodes(struct btrfs_root *root, int nr)
+static int start_delalloc_inodes(struct btrfs_root *root, int nr, bool 
snapshot)
 {
struct btrfs_inode *binode;
struct inode *inode;
@@ -10003,6 +10003,8 @@ static int start_delalloc_inodes(struct btrfs_root 
*root, int nr)
}
spin_unlock(&root->delalloc_lock);
 
+

[PATCH] btrfs: use tagged writepage to mitigate livelock of snapshot

2018-10-31 Thread Ethan Lien
Snapshot is expected to be fast. But if there are writers steadily
create dirty pages in our subvolume, the snapshot may take a very long
time to complete. To fix the problem, we use tagged writepage for snapshot
flusher as we do in the generic write_cache_pages(), so we can ommit pages
dirtied after the snapshot command.

We do a simple snapshot speed test on a Intel D-1531 box:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
--direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 1m58sec
patched:  6.54sec

This is the best case for this patch since for a sequential write case,
we omit nearly all pages dirtied after the snapshot command.

For a multi writers, random write test:

fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
--direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
--filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio

original: 15.83sec
patched:  10.35sec

The improvement is less compared with the sequential write case, since
we omit only half of the pages dirtied after snapshot command.

Signed-off-by: Ethan Lien 
---
 fs/btrfs/btrfs_inode.h |  1 +
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/extent_io.c   | 16 ++--
 fs/btrfs/inode.c   | 10 ++
 fs/btrfs/ioctl.c   |  2 +-
 5 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 1343ac57b438..4182bfbb56be 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -29,6 +29,7 @@ enum {
BTRFS_INODE_IN_DELALLOC_LIST,
BTRFS_INODE_READDIO_NEED_LOCK,
BTRFS_INODE_HAS_PROPS,
+   BTRFS_INODE_TAGGED_FLUSH,
 };
 
 /* in memory btrfs inode */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2cddfe7806a4..82682da5a40d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3155,7 +3155,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
   struct inode *inode, u64 new_size,
   u32 min_type);
 
-int btrfs_start_delalloc_inodes(struct btrfs_root *root);
+int btrfs_start_delalloc_snapshot(struct btrfs_root *root);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
  unsigned int extra_bits,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4dd6faab02bb..c21d8a0e010a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3928,12 +3928,24 @@ static int extent_write_cache_pages(struct 
address_space *mapping,
range_whole = 1;
scanned = 1;
}
-   if (wbc->sync_mode == WB_SYNC_ALL)
+
+   /*
+* We don't care if we are the one who set BTRFS_INODE_TAGGED_FLUSH in
+* start_delalloc_inodes(). We do the tagged writepage as long as we are
+* the first one who do the filemap_flush() on this inode.
+*/
+   if (range_whole && wbc->nr_to_write == LONG_MAX &&
+   wbc->sync_mode == WB_SYNC_NONE &&
+   test_and_clear_bit(BTRFS_INODE_TAGGED_FLUSH,
+   &BTRFS_I(inode)->runtime_flags))
+   wbc->tagged_writepages = 1;
+
+   if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag = PAGECACHE_TAG_TOWRITE;
else
tag = PAGECACHE_TAG_DIRTY;
 retry:
-   if (wbc->sync_mode == WB_SYNC_ALL)
+   if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag_pages_for_writeback(mapping, index, end);
done_index = index;
while (!done && !nr_to_write_done && (index <= end) &&
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ea5339603cf..3df3cbbe91c5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9975,7 +9975,7 @@ static struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode
  * some fairly slow code that needs optimization. This walks the list
  * of all the inodes with pending delalloc and forces them to disk.
  */
-static int start_delalloc_inodes(struct btrfs_root *root, int nr)
+static int start_delalloc_inodes(struct btrfs_root *root, int nr, int snapshot)
 {
struct btrfs_inode *binode;
struct inode *inode;
@@ -10003,6 +10003,8 @@ static int start_delalloc_inodes(struct btrfs_root 
*root, int nr)
}
spin_unlock(&root->delalloc_lock);
 
+   if (snapshot)
+   set_bit(BTRFS_INODE_TAGGED_FLUSH, 
&binode->runtime_flags);
work = btrfs_alloc_delalloc_work(inode);
if (!work) {
  

[PATCH v3] btrfs: use customized batch size for total_bytes_pinned

2018-07-13 Thread Ethan Lien
In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t. So
fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk,
we can use a really large batch size and have nearly no penalty in most
cases.

[Test]
We test the patch on a 4-cores x86 machine:
1. falloate a 16GiB size test file.
2. take snapshot (so all following writes will be cow write).
3. run a 180 sec, 4 jobs, 4K random write fio on test file.

We also add a temporary lockdep class on percpu_counter's spin lock used
by total_bytes_pinned to track lock_stat.

[Results]
unpatched:
lock_stat version 0.4
---
  class namecon-bouncescontentions
waittime-min   waittime-max waittime-total   waittime-avgacq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

   total_bytes_pinned_percpu:82 82
0.21   0.61  29.46   0.36 298340
  635973   0.09  11.01  173476.25   0.27

patched:
lock_stat version 0.4
---
  class namecon-bouncescontentions
waittime-min   waittime-max waittime-total   waittime-avgacq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

   total_bytes_pinned_percpu: 1  1
0.62   0.62   0.62   0.62  13601
   31542   0.14   9.61   11016.90   0.35

[Analysis]
Since the spin lock only protect a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock get
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes")

Signed-off-by: Ethan Lien 
---

V2:
   Rewrite commit comments.
   Add lock_stat test.
   Pull dirty_metadata_bytes out to a separate patch.

V3:
   Declare batch size as a constant.

 fs/btrfs/extent-tree.c | 47 --
 1 file changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..3788e87c174e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -51,6 +51,14 @@ enum {
CHUNK_ALLOC_FORCE = 2,
 };
 
+/*
+ * Use large batch size to reduce update overhead.
+ * For reader side, we only read it when we are close to ENOSPC and
+ * the read overhead is mostly related to # of cpus, so it is OK to
+ * use arbitrary large value here.
+ */
+#define TOTAL_BYTES_PINNED_BATCH SZ_128M
+
 static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_node *node, u64 parent,
@@ -758,7 +766,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, 
s64 num_bytes,
 
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(&space_info->total_bytes_pinned, num_bytes);
+   percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes,
+   TOTAL_BYTES_PINNED_BATCH);
 }
 
 /*
@@ -2598,8 +2607,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
flags = BTRFS_BLOCK_GROUP_METADATA;
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(&space_info->total_bytes_pinned,
-  -head->num

[PATCH v2] btrfs: use customized batch size for total_bytes_pinned

2018-07-11 Thread Ethan Lien
In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t. So
fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk,
we can use a really large batch size and have nearly no penalty in most
cases.


[Test]
We test the patch on a 4-cores x86 machine:
1. falloate a 16GiB size test file.
2. take snapshot (so all following writes will be cow write).
3. run a 180 sec, 4 jobs, 4K random write fio on test file.

We also add a temporary lockdep class on percpu_counter's spin lock used
by total_bytes_pinned to track lock_stat.


[Results]
unpatched:
lock_stat version 0.4
---
  class namecon-bouncescontentions
waittime-min   waittime-max waittime-total   waittime-avgacq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

   total_bytes_pinned_percpu:82 82
0.21   0.61  29.46   0.36 298340
  635973   0.09  11.01  173476.25   0.27


patched:
lock_stat version 0.4
---
  class namecon-bouncescontentions
waittime-min   waittime-max waittime-total   waittime-avgacq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

   total_bytes_pinned_percpu: 1  1
0.62   0.62   0.62   0.62  13601
   31542   0.14   9.61   11016.90   0.35


[Analysis]
Since the spin lock only protect a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock get
transferred between different cpus, so the patch can really recude
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly
pinned bytes")

Signed-off-by: Ethan Lien 
---

V2:
Rewrite commit comments.
Add lock_stat test.
Pull dirty_metadata_bytes out to a separate patch.

 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 46 --
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346aceea9..df682a521635 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -422,6 +422,7 @@ struct btrfs_space_info {
 * time the transaction commits.
 */
struct percpu_counter total_bytes_pinned;
+   s32 total_bytes_pinned_batch;
 
struct list_head list;
/* Protected by the spinlock 'lock'. */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..937113534ef4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, 
s64 num_bytes,
 
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(&space_info->total_bytes_pinned, num_bytes);
+   percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes,
+   space_info->total_bytes_pinned_batch);
 }
 
 /*
@@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
flags = BTRFS_BLOCK_GROUP_METADATA;
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(&space_info->total_bytes_pinned,
-  -head->num_bytes);
+   percpu_counter_add_batch(&space_info->total_bytes_pinned,
+

[PATCH] btrfs: use correct compare function of dirty_metadata_bytes

2018-07-02 Thread Ethan Lien
We use customized, nodesize batch value to update dirty_metadata_bytes.
We should also use batch version of compare function or we will easily
goto fast path and get false result from percpu_counter_compare().

Signed-off-by: Ethan Lien 
---
 fs/btrfs/disk-io.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 205092dc9390..dfed08e70ec1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping,
 
fs_info = BTRFS_I(mapping->host)->root->fs_info;
/* this is a bit racy, but that's ok */
-   ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(&fs_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret < 0)
return 0;
}
@@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,
if (flush_delayed)
btrfs_balance_delayed_items(fs_info);
 
-   ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(&fs_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret > 0) {

balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
}
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: use customized batch size for total_bytes_pinned

2018-06-29 Thread Ethan Lien
We use percpu_counter to reduce lock contention of frequently updated
counter. But the default 32 batch size is suitable only for inc/dec
update, not for sector size update. The end result is we still acquire
spin lock for every percpu_counter_add(). So use customized batch size
for total_bytes_pinned as we already done for dirty_metadata_bytes
and delalloc_bytes. Also we fix dirty_metadata_bytes to use
__percpu_counter_compare of customized batch so we get precise value of
dirty_metadata_bytes.

Signed-off-by: Ethan Lien 
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c | 10 ++
 fs/btrfs/extent-tree.c | 43 +++---
 3 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346aceea9..df682a521635 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -422,6 +422,7 @@ struct btrfs_space_info {
 * time the transaction commits.
 */
struct percpu_counter total_bytes_pinned;
+   s32 total_bytes_pinned_batch;
 
struct list_head list;
/* Protected by the spinlock 'lock'. */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 205092dc9390..dfed08e70ec1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -961,8 +961,9 @@ static int btree_writepages(struct address_space *mapping,
 
fs_info = BTRFS_I(mapping->host)->root->fs_info;
/* this is a bit racy, but that's ok */
-   ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(&fs_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret < 0)
return 0;
}
@@ -4150,8 +4151,9 @@ static void __btrfs_btree_balance_dirty(struct 
btrfs_fs_info *fs_info,
if (flush_delayed)
btrfs_balance_delayed_items(fs_info);
 
-   ret = percpu_counter_compare(&fs_info->dirty_metadata_bytes,
-BTRFS_DIRTY_METADATA_THRESH);
+   ret = __percpu_counter_compare(&fs_info->dirty_metadata_bytes,
+BTRFS_DIRTY_METADATA_THRESH,
+fs_info->dirty_metadata_batch);
if (ret > 0) {

balance_dirty_pages_ratelimited(fs_info->btree_inode->i_mapping);
}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..a067c047904c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -758,7 +758,8 @@ static void add_pinned_bytes(struct btrfs_fs_info *fs_info, 
s64 num_bytes,
 
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(&space_info->total_bytes_pinned, num_bytes);
+   percpu_counter_add_batch(&space_info->total_bytes_pinned, num_bytes,
+   space_info->total_bytes_pinned_batch);
 }
 
 /*
@@ -2598,8 +2599,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
flags = BTRFS_BLOCK_GROUP_METADATA;
space_info = __find_space_info(fs_info, flags);
ASSERT(space_info);
-   percpu_counter_add(&space_info->total_bytes_pinned,
-  -head->num_bytes);
+   percpu_counter_add_batch(&space_info->total_bytes_pinned,
+  -head->num_bytes,
+  space_info->total_bytes_pinned_batch);
 
if (head->is_data) {
spin_lock(&delayed_refs->lock);
@@ -4035,6 +4037,10 @@ static int create_space_info(struct btrfs_fs_info *info, 
u64 flags)
INIT_LIST_HEAD(&space_info->ro_bgs);
INIT_LIST_HEAD(&space_info->tickets);
INIT_LIST_HEAD(&space_info->priority_tickets);
+   if (flags & BTRFS_BLOCK_GROUP_DATA)
+   space_info->total_bytes_pinned_batch = info->delalloc_batch;
+   else
+   space_info->total_bytes_pinned_batch = 
info->dirty_metadata_batch;
 
ret = kobject_init_and_add(&space_info->kobj, &space_info_ktype,
info->space_info_kobj, "%s",
@@ -4309,9 +4315,10 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
 * allocation, and no removed chunk in current transaction,
 * don't bother committing the transaction.
 */
-   have_pinned_space = percpu_counter_compare(
+   have_pinned_space = __percpu_counter_compare(
&data_

[PATCH v2] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io

2018-05-27 Thread Ethan Lien
[Problem description and how we fix it]
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far only counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
but since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
the size of dirty metadata pages.

[Reproduce steps]
To reproduce the problem, we need to do 4KiB write randomly spread in a
large btree. In our 2GiB RAM machine:
1) Create 4 subvolumes.
2) Run fio on each subvolume:

   [global]
   direct=0
   rw=randwrite
   ioengine=libaio
   bs=4k
   iodepth=16
   numjobs=1
   group_reporting
   size=128G
   runtime=1800
   norandommap
   time_based
   randrepeat=0

3) Take snapshot on each subvolume and repeat fio on existing files.
4) Repeat step (3) until we get large btrees.
   In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
   metadata in each subvolume tree and 12GiB of metadata in extent tree.
5) Stop all fio, take snapshot again, and wait until all delayed work is
   completed.
6) Start all fio. Few seconds later we hit OOM when the flusher starts
   to work.

It can be reproduced even when using nocow write.

Signed-off-by: Ethan Lien 
---

V2:
Replace btrfs_btree_balance_dirty with 
btrfs_btree_balance_dirty_nodelay.
Add reproduce steps.

 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8e604e7071f1..e54547df24ee 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3158,6 +3158,8 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
/* once for the tree */
btrfs_put_ordered_extent(ordered_extent);
 
+   btrfs_btree_balance_dirty_nodelay(fs_info);
+
return ret;
 }
 
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: Skip some btrfs_cross_ref_exist() check in nocow path

2018-05-16 Thread Ethan Lien
In nocow path, we check if the extent is snapshotted in
btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
unnecessary search into extent tree.

A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:

[global]
group_reporting
time_based
thread=1
ioengine=libaio
bs=4k
iodepth=32
size=64G
runtime=180
numjobs=8
rw=randwrite

[file1]
filename=/mnt/nocow/testfile

IOPS result:   unpatched patched

1 fio round: 4667046958
snapshot
2 fio round: 5182654498
3 fio round: 5976761289

After snapshot, the first fio get about 5% performance gain. As we
continually write to the same file, all writes will resume to nocow mode
and eventually we have no performance gain.

Signed-off-by: Ethan Lien 
---

V2:
 Add comment and performance test.

 fs/btrfs/inode.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..177630337108 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1373,6 +1373,13 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
btrfs_file_extent_encryption(leaf, fi) ||
btrfs_file_extent_other_encoding(leaf, fi))
goto out_check;
+   /*
+* We can skip the checking of generation of
+* extent item in btrfs_cross_ref_exist().
+*/
+   if (btrfs_file_extent_generation(leaf, fi) <=
+   btrfs_root_last_snapshot(&root->root_item))
+   goto out_check;
if (extent_type == BTRFS_FILE_EXTENT_REG && !force)
goto out_check;
if (btrfs_extent_readonly(fs_info, disk_bytenr))
@@ -7368,6 +7375,14 @@ noinline int can_nocow_extent(struct inode *inode, u64 
offset, u64 *len,
btrfs_file_extent_other_encoding(leaf, fi))
goto out;
 
+   /*
+* We can skip the checking of generation of
+* extent item in btrfs_cross_ref_exist().
+*/
+   if (btrfs_file_extent_generation(leaf, fi) <=
+   btrfs_root_last_snapshot(&root->root_item))
+   goto out;
+
backref_offset = btrfs_file_extent_offset(leaf, fi);
 
if (orig_start) {
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Skip some btrfs_cross_ref_exist() check in nocow path

2018-05-16 Thread Ethan Lien
In nocow path, we check if the extent is snapshotted in
btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
unnecessary search into extent tree.

Signed-off-by: Ethan Lien 
---
 fs/btrfs/inode.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..96927f2ccd4b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1373,6 +1373,9 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
btrfs_file_extent_encryption(leaf, fi) ||
btrfs_file_extent_other_encoding(leaf, fi))
goto out_check;
+   if (btrfs_file_extent_generation(leaf, fi) <=
+   btrfs_root_last_snapshot(&root->root_item))
+   goto out_check;
if (extent_type == BTRFS_FILE_EXTENT_REG && !force)
goto out_check;
if (btrfs_extent_readonly(fs_info, disk_bytenr))
@@ -7368,6 +7371,10 @@ noinline int can_nocow_extent(struct inode *inode, u64 
offset, u64 *len,
btrfs_file_extent_other_encoding(leaf, fi))
goto out;
 
+   if (btrfs_file_extent_generation(leaf, fi) <=
+   btrfs_root_last_snapshot(&root->root_item))
+   goto out;
+
backref_offset = btrfs_file_extent_offset(leaf, fi);
 
if (orig_start) {
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io

2018-04-27 Thread Ethan Lien
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far onlys counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty(), but
since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us
throttle the size of dirty metadata pages.

Signed-off-by: Ethan Lien 
---
 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..d09433ab126e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3151,6 +3151,8 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
/* once for the tree */
btrfs_put_ordered_extent(ordered_extent);
 
+   btrfs_btree_balance_dirty(fs_info);
+
return ret;
 }
 
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html