date:20210127

[PATCH] btrfs: Simplify the calculation of variables

2021-01-27 Thread Abaci Team

Fix the following coccicheck warnings:

./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is
equivalent to !A || B.

Reported-by: Abaci Robot 
Suggested-by: Jiapeng Zhong 
Signed-off-by: Abaci Team 
---
 fs/btrfs/delayed-inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 70c0340..ec0b50b8 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1154,7 +1154,7 @@ static int __btrfs_run_delayed_items(struct 
btrfs_trans_handle *trans, int nr)
delayed_root = fs_info->delayed_root;
 
curr_node = btrfs_first_delayed_node(delayed_root);
-   while (curr_node && (!count || (count && nr--))) {
+   while (curr_node && (!count || nr--)) {
ret = __btrfs_commit_inode_delayed_items(trans, path,
 curr_node);
if (ret) {
-- 
1.8.3.1

Re: btrfs becomes read-only

2021-01-27 Thread Alexey Isaev


kernel version:

aleksey@host:~$ sudo uname --all
Linux host 4.15.0-132-generic #136~16.04.1-Ubuntu SMP Tue Jan 12 
18:22:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux


drive make/model:

Drive is external 5 bay HDD enclosure with raid-5 connected via usb-3 
(made by Orico https://www.orico.cc/us/product/detail/3622.html)

with 5 WD Red 10 Tb. We use this drive for backups.

When i try to run btrfs check i get error message:

aleksey@host:~$ sudo btrfs check --readonly /dev/sdg
Couldn't open file system

aleksey@host:~$ sudo smartctl -x /dev/sdg
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.15.0-132-generic] (local 
build)

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: JMicron H/W RAID5
Serial Number:    1IZ6AZCKMIFZYJ8A7V0W
Firmware Version: 0964
User Capacity:    40,003,191,177,216 bytes [40.0 TB]
Sector Size:  512 bytes logical/physical
Device is:    Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA/ATAPI-7 (minor revision not indicated)
Local Time is:    Wed Jan 27 08:51:02 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Disabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers 
missing

SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
    was never started.
    Auto Offline Data Collection: 
Disabled.

Total time to complete Offline
data collection:    (    0) seconds.
Offline data collection
capabilities:    (0x00) Offline data collection 
not supported.
SMART capabilities:    (0x) Automatic saving of SMART 
data  is not implemented.

Error logging capability:    (0x00) Error logging NOT supported.
    No General Purpose Logging support.

General Purpose Log Directory not supported

SMART Log Directory Version 0
Address    Access  R/W   Size  Description
0x00   SL  R/O  1  Log Directory

SMART Extended Comprehensive Error Log (GP Log 0x03) not supported

SMART Error Log not supported

SMART Extended Self-test Log (GP Log 0x07) not supported

SMART Self-test Log not supported

Selective Self-tests/Logging not supported

SCT Commands not supported

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11) not supported

С уважением,
Алексей Исаев,
RQC

27.01.2021 10:38, Chris Murphy пишет:

On Wed, Jan 27, 2021 at 12:22 AM Alexey Isaev  wrote:

Hello!

BTRFS volume becomes read-only with this messages in dmesg.
What can i do to repair btrfs partition?

[Jan25 08:18] BTRFS error (device sdg): parent transid verify failed on
52180048330752 wanted 132477 found 132432
[  +0.007587] BTRFS error (device sdg): parent transid verify failed on
52180048330752 wanted 132477 found 132432
[  +0.000132] BTRFS error (device sdg): qgroup scan failed with -5

[Jan25 19:52] BTRFS error (device sdg): parent transid verify failed on
52180048330752 wanted 132477 found 132432
[  +0.009783] BTRFS error (device sdg): parent transid verify failed on
52180048330752 wanted 132477 found 132432
[  +0.000132] BTRFS: error (device sdg) in __btrfs_cow_block:1176:
errno=-5 IO failure
[  +0.60] BTRFS info (device sdg): forced readonly
[  +0.04] BTRFS info (device sdg): failed to delete reference to
ftrace.h, inode 2986197 parent 2989315
[  +0.02] BTRFS: error (device sdg) in __btrfs_unlink_inode:4220:
errno=-5 IO failure
[  +0.006071] BTRFS error (device sdg): pending csums is 430080

What kernel version? What drive make/model?

wanted 132477 found 132432 indicates the drive has lost ~45
transactions, that's not good and also weird. There's no crash or any
other errors? A complete dmesg might be more revealing. And also

smartctl -x /dev/sdg
btrfs check --readonly /dev/sdg

After that I suggest
https://btrfs.wiki.kernel.org/index.php/Restore

And try to get any important data out if it's not backed up. You can
try btrfs-find-root to get a listing of roots, most recent to oldest.
Start at the top, and plug that address in as 'btrfs restore -t' and
see if it'll pull anything out. You likely need -i and -v options as
well.

[PATCH] fs: btrfs: Select SHA256 in Kconfig

2021-01-27 Thread matthias . bgg

From: Matthias Brugger 

Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms")
btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select
it. This leads to compilation errors:
fs/built-in.o: In function `hash_sha256':
fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts'
fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update'
fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish'

Signed-off-by: Matthias Brugger 

---

 fs/btrfs/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index f302b1fbef..2a32f42ad1 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -4,6 +4,7 @@ config FS_BTRFS
select LZO
select ZSTD
select RBTREE
+   select SHA256
help
  This provides a single-device read-only BTRFS support. BTRFS is a
  next-generation Linux file system based on the copy-on-write
-- 
2.30.0

Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig

2021-01-27 Thread Qu Wenruo





On 2021/1/27 下午5:42, matthias@kernel.org wrote:

From: Matthias Brugger 

Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms")
btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select
it. This leads to compilation errors:
fs/built-in.o: In function `hash_sha256':
fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts'
fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update'
fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish'

Signed-off-by: Matthias Brugger 


Oh, forgot to select sha256.

Thanks for the fix.

Reviewed-by: Qu Wenruo 

Thanks,
Qu


---

  fs/btrfs/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index f302b1fbef..2a32f42ad1 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -4,6 +4,7 @@ config FS_BTRFS
select LZO
select ZSTD
select RBTREE
+   select SHA256
help
  This provides a single-device read-only BTRFS support. BTRFS is a
  next-generation Linux file system based on the copy-on-write

[PATCH] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-27 Thread Michal Rostecki

From: Michal Rostecki 

Before this change, the btrfs_get_io_geometry() function was calling
btrfs_get_chunk_map() to get the extent mapping, necessary for
calculating the I/O geometry. It was using that extent mapping only
internally and freeing the pointer after its execution.

That resulted in calling btrfs_get_chunk_map() de facto twice by the
__btrfs_map_block() function. It was calling btrfs_get_io_geometry()
first and then calling btrfs_get_chunk_map() directly to get the extent
mapping, used by the rest of the function.

This change fixes that by passing the extent mapping to the
btrfs_get_io_geometry() function as an argument.

Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately")
Signed-off-by: Michal Rostecki 
---
 fs/btrfs/inode.c   | 37 -
 fs/btrfs/volumes.c | 39 ---
 fs/btrfs/volumes.h |  5 +++--
 3 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0dbe1aaa0b71..a4ce8501ed4d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
size, struct bio *bio,
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
u64 logical = bio->bi_iter.bi_sector << 9;
+   struct extent_map *em;
u64 length = 0;
u64 map_length;
-   int ret;
+   int ret = 0;
struct btrfs_io_geometry geom;
 
if (bio_flags & EXTENT_BIO_COMPRESSED)
@@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
size, struct bio *bio,
 
length = bio->bi_iter.bi_size;
map_length = length;
-   ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, map_length,
-   &geom);
+   em = btrfs_get_chunk_map(fs_info, logical, map_length);
+   if (IS_ERR(em))
+   return PTR_ERR(em);
+   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
+   map_length, &geom);
if (ret < 0)
-   return ret;
+   goto out;
 
-   if (geom.len < length + size)
-   return 1;
-   return 0;
+   if (geom.len < length + size) {
+   ret = 1;
+   goto out;
+   }
+out:
+   free_extent_map(em);
+   return ret;
 }
 
 /*
@@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode 
*inode, struct iomap *iomap,
u64 submit_len;
int clone_offset = 0;
int clone_len;
+   int logical;
int ret;
blk_status_t status;
struct btrfs_io_geometry geom;
struct btrfs_dio_data *dio_data = iomap->private;
+   struct extent_map *em;
 
dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
if (!dip) {
@@ -7970,11 +7980,17 @@ static blk_qc_t btrfs_submit_direct(struct inode 
*inode, struct iomap *iomap,
}
 
start_sector = dio_bio->bi_iter.bi_sector;
+   logical = start_sector << 9;
submit_len = dio_bio->bi_iter.bi_size;
 
do {
-   ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
-   start_sector << 9, submit_len,
+   em = btrfs_get_chunk_map(fs_info, logical, submit_len);
+   if (IS_ERR(em)) {
+   status = errno_to_blk_status(ret);
+   goto out_err;
+   }
+   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
+   logical, submit_len,
&geom);
if (ret) {
status = errno_to_blk_status(ret);
@@ -8030,12 +8046,15 @@ static blk_qc_t btrfs_submit_direct(struct inode 
*inode, struct iomap *iomap,
clone_offset += clone_len;
start_sector += clone_len >> 9;
file_offset += clone_len;
+
+   free_extent_map(em);
} while (submit_len > 0);
return BLK_QC_T_NONE;
 
 out_err:
dip->dio_bio->bi_status = status;
btrfs_dio_private_put(dip);
+   free_extent_map(em);
return BLK_QC_T_NONE;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a8ec8539cd8d..4c753b17c0a2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5940,23 +5940,24 @@ static bool need_full_stripe(enum btrfs_map_op op)
 }
 
 /*
- * btrfs_get_io_geometry - calculates the geomery of a particular (address, 
len)
+ * btrfs_get_io_geometry - calculates the geometry of a particular (address, 
len)
  *tuple. This information is used to calculate how big a
  *particular bio can get before it straddles a stripe.
  *
- * @fs_info - the filesystem
- * @logical - address that we want to figure out the geometry of
- * @len- the length o

Re: [PATCH 03/17] blk-crypto: use bio_kmalloc in blk_crypto_clone_bio

2021-01-27 Thread Eric Biggers

On Tue, Jan 26, 2021 at 03:52:33PM +0100, Christoph Hellwig wrote:
> Use bio_kmalloc instead of open coding it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-crypto-fallback.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c
> index 50c225398e4d60..e8327c50d7c9f4 100644
> --- a/block/blk-crypto-fallback.c
> +++ b/block/blk-crypto-fallback.c
> @@ -164,7 +164,7 @@ static struct bio *blk_crypto_clone_bio(struct bio 
> *bio_src)
>   struct bio_vec bv;
>   struct bio *bio;
>  
> - bio = bio_alloc_bioset(GFP_NOIO, bio_segments(bio_src), NULL);
> + bio = bio_kmalloc(GFP_NOIO, bio_segments(bio_src));
>   if (!bio)
>   return NULL;
>   bio->bi_bdev= bio_src->bi_bdev;
> -- 

Looks good,

Reviewed-by: Eric Biggers

[PATCH v4 11/12] btrfs: adjust the flush trace point to include the source

2021-01-27 Thread Josef Bacik

Since we have normal ticketed flushing and preemptive flushing, adjust
the tracepoint so that we know the source of the flushing action to make
it easier to debug problems.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/space-info.c| 20 
 include/trace/events/btrfs.h | 10 ++
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 6afb9cac694a..48c2a4eae235 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -667,7 +667,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
  */
 static void flush_space(struct btrfs_fs_info *fs_info,
   struct btrfs_space_info *space_info, u64 num_bytes,
-  enum btrfs_flush_state state)
+  enum btrfs_flush_state state, bool for_preempt)
 {
struct btrfs_root *root = fs_info->extent_root;
struct btrfs_trans_handle *trans;
@@ -750,7 +750,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
}
 
trace_btrfs_flush_space(fs_info, space_info->flags, num_bytes, state,
-   ret);
+   ret, for_preempt);
return;
 }
 
@@ -989,7 +989,8 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
 
flush_state = FLUSH_DELAYED_ITEMS_NR;
do {
-   flush_space(fs_info, space_info, to_reclaim, flush_state);
+   flush_space(fs_info, space_info, to_reclaim, flush_state,
+   false);
spin_lock(&space_info->lock);
if (list_empty(&space_info->tickets)) {
space_info->flush = 0;
@@ -1125,7 +1126,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct 
work_struct *work)
to_reclaim >>= 2;
if (!to_reclaim)
to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 
1);
-   flush_space(fs_info, space_info, to_reclaim, flush);
+   flush_space(fs_info, space_info, to_reclaim, flush, true);
cond_resched();
spin_lock(&space_info->lock);
}
@@ -1222,7 +1223,8 @@ static void btrfs_async_reclaim_data_space(struct 
work_struct *work)
spin_unlock(&space_info->lock);
 
while (!space_info->full) {
-   flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE);
+   flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE,
+   false);
spin_lock(&space_info->lock);
if (list_empty(&space_info->tickets)) {
space_info->flush = 0;
@@ -1235,7 +1237,7 @@ static void btrfs_async_reclaim_data_space(struct 
work_struct *work)
 
while (flush_state < ARRAY_SIZE(data_flush_states)) {
flush_space(fs_info, space_info, U64_MAX,
-   data_flush_states[flush_state]);
+   data_flush_states[flush_state], false);
spin_lock(&space_info->lock);
if (list_empty(&space_info->tickets)) {
space_info->flush = 0;
@@ -1308,7 +1310,8 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
 
flush_state = 0;
do {
-   flush_space(fs_info, space_info, to_reclaim, 
states[flush_state]);
+   flush_space(fs_info, space_info, to_reclaim, 
states[flush_state],
+   false);
flush_state++;
spin_lock(&space_info->lock);
if (ticket->bytes == 0) {
@@ -1324,7 +1327,8 @@ static void priority_reclaim_data_space(struct 
btrfs_fs_info *fs_info,
struct reserve_ticket *ticket)
 {
while (!space_info->full) {
-   flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE);
+   flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE,
+   false);
spin_lock(&space_info->lock);
if (ticket->bytes == 0) {
spin_unlock(&space_info->lock);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0cf02dfd4c01..74e5cc247b80 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1113,15 +1113,16 @@ TRACE_EVENT(btrfs_trigger_flush,
 TRACE_EVENT(btrfs_flush_space,
 
TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 num_bytes,
-int state, int ret),
+int state, int ret, bool for_preempt),
 
-   TP_ARGS(fs_info, flags, num_bytes, state, ret),
+   TP_ARGS(fs_info, flags, num_bytes, state, ret, for_preempt),
 
TP_STRUCT__entry_btrfs(
__field(u64,flags   )
__field(u64,num_bytes   )
__fi

[PATCH v4 08/12] btrfs: rework btrfs_calc_reclaim_metadata_size

2021-01-27 Thread Josef Bacik

Currently btrfs_calc_reclaim_metadata_size does two things, it returns
the space currently required for flushing by the tickets, and if there
are no tickets it calculates a value for the preemptive flushing.

However for the normal ticketed flushing we really only care about the
space required for tickets.  We will accidentally come in and flush one
time, but as soon as we see there are no tickets we bail out of our
flushing.

Fix this by making btrfs_calc_reclaim_metadata_size really only tell us
what is required for flushing if we have people waiting on space.  Then
move the preemptive flushing logic into need_preemptive_reclaim().  We
ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim()
because if we are in this path then we made our reservation and there
are not pending tickets currently, so we do not need to check it, simply
do the fuzzy logic to check if we're getting low on space.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/space-info.c | 44 ---
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index c3c586b33b4b..8f3b4cc8b812 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -759,7 +759,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info 
*fs_info,
 {
u64 used;
u64 avail;
-   u64 expected;
u64 to_reclaim = space_info->reclaim_size;
 
lockdep_assert_held(&space_info->lock);
@@ -777,28 +776,6 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info 
*fs_info,
if (space_info->total_bytes + avail < used)
to_reclaim += used - (space_info->total_bytes + avail);
 
-   if (to_reclaim)
-   return to_reclaim;
-
-   to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
-   if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
-BTRFS_RESERVE_FLUSH_ALL))
-   return 0;
-
-   used = btrfs_space_info_used(space_info, true);
-
-   if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
-BTRFS_RESERVE_FLUSH_ALL))
-   expected = div_factor_fine(space_info->total_bytes, 95);
-   else
-   expected = div_factor_fine(space_info->total_bytes, 90);
-
-   if (used > expected)
-   to_reclaim = used - expected;
-   else
-   to_reclaim = 0;
-   to_reclaim = min(to_reclaim, space_info->bytes_may_use +
-space_info->bytes_reserved);
return to_reclaim;
 }
 
@@ -807,6 +784,7 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info 
*fs_info,
u64 used)
 {
u64 thresh = div_factor_fine(space_info->total_bytes, 98);
+   u64 to_reclaim, expected;
 
/* If we're just plain full then async reclaim just slows us down. */
if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
@@ -819,7 +797,25 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info 
*fs_info,
if (space_info->reclaim_size)
return 0;
 
-   if (!btrfs_calc_reclaim_metadata_size(fs_info, space_info))
+   to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
+   if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
+BTRFS_RESERVE_FLUSH_ALL))
+   return 0;
+
+   used = btrfs_space_info_used(space_info, true);
+   if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
+BTRFS_RESERVE_FLUSH_ALL))
+   expected = div_factor_fine(space_info->total_bytes, 95);
+   else
+   expected = div_factor_fine(space_info->total_bytes, 90);
+
+   if (used > expected)
+   to_reclaim = used - expected;
+   else
+   to_reclaim = 0;
+   to_reclaim = min(to_reclaim, space_info->bytes_may_use +
+space_info->bytes_reserved);
+   if (!to_reclaim)
return 0;
 
return (used >= thresh && !btrfs_fs_closing(fs_info) &&
-- 
2.26.2

[PATCH v4 09/12] btrfs: simplify the logic in need_preemptive_flushing

2021-01-27 Thread Josef Bacik

A lot of this was added all in one go with no explanation, and is a bit
unwieldy and confusing.  Simplify the logic to start preemptive flushing
if we've reserved more than half of our available free space.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/space-info.c | 73 ---
 1 file changed, 48 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 8f3b4cc8b812..1c4226f78e27 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -780,11 +780,11 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info 
*fs_info,
 }
 
 static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
-   struct btrfs_space_info *space_info,
-   u64 used)
+   struct btrfs_space_info *space_info)
 {
+   u64 ordered, delalloc;
u64 thresh = div_factor_fine(space_info->total_bytes, 98);
-   u64 to_reclaim, expected;
+   u64 used;
 
/* If we're just plain full then async reclaim just slows us down. */
if ((space_info->bytes_used + space_info->bytes_reserved) >= thresh)
@@ -797,26 +797,52 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info 
*fs_info,
if (space_info->reclaim_size)
return 0;
 
-   to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, SZ_16M);
-   if (btrfs_can_overcommit(fs_info, space_info, to_reclaim,
-BTRFS_RESERVE_FLUSH_ALL))
-   return 0;
+   /*
+* If we have over half of the free space occupied by reservations or
+* pinned then we want to start flushing.
+*
+* We do not do the traditional thing here, which is to say
+*
+*   if (used >= ((total_bytes + avail) >> 1))
+* return 1;
+*
+* because this doesn't quite work how we want.  If we had more than 50%
+* of the space_info used by bytes_used and we had 0 available we'd just
+* constantly run the background flusher.  Instead we want it to kick in
+* if our reclaimable space exceeds 50% of our available free space.
+*/
+   thresh = calc_available_free_space(fs_info, space_info,
+  BTRFS_RESERVE_FLUSH_ALL);
+   thresh += (space_info->total_bytes - space_info->bytes_used -
+  space_info->bytes_reserved - space_info->bytes_readonly);
+   thresh >>= 1;
 
-   used = btrfs_space_info_used(space_info, true);
-   if (btrfs_can_overcommit(fs_info, space_info, SZ_1M,
-BTRFS_RESERVE_FLUSH_ALL))
-   expected = div_factor_fine(space_info->total_bytes, 95);
-   else
-   expected = div_factor_fine(space_info->total_bytes, 90);
+   used = space_info->bytes_pinned;
 
-   if (used > expected)
-   to_reclaim = used - expected;
+   /*
+* If we have more ordered bytes than delalloc bytes then we're either
+* doing a lot of DIO, or we simply don't have a lot of delalloc waiting
+* around.  Preemptive flushing is only useful in that it can free up
+* space before tickets need to wait for things to finish.  In the case
+* of ordered extents, preemptively waiting on ordered extents gets us
+* nothing, if our reservations are tied up in ordered extents we'll
+* simply have to slow down writers by forcing them to wait on ordered
+* extents.
+*
+* In the case that ordered is larger than delalloc, only include the
+* block reserves that we would actually be able to directly reclaim
+* from.  In this case if we're heavy on metadata operations this will
+* clearly be heavy enough to warrant preemptive flushing.  In the case
+* of heavy DIO or ordered reservations, preemptive flushing will just
+* waste time and cause us to slow down.
+*/
+   ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes);
+   delalloc = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
+   if (ordered >= delalloc)
+   used += fs_info->delayed_refs_rsv.reserved +
+   fs_info->delayed_block_rsv.reserved;
else
-   to_reclaim = 0;
-   to_reclaim = min(to_reclaim, space_info->bytes_may_use +
-space_info->bytes_reserved);
-   if (!to_reclaim)
-   return 0;
+   used += space_info->bytes_may_use;
 
return (used >= thresh && !btrfs_fs_closing(fs_info) &&
!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
@@ -1013,7 +1039,6 @@ static void btrfs_preempt_reclaim_metadata_space(struct 
work_struct *work)
struct btrfs_block_rsv *delayed_refs_rsv;
struct btrfs_block_rsv *global_rsv;
struct btrfs_block_rsv *trans_rsv;

[PATCH v4 02/12] btrfs: add a trace point for reserve tickets

2021-01-27 Thread Josef Bacik

While debugging a ENOSPC related performance problem I needed to see the
time difference between start and end of a reserve ticket, so add a
trace point to report when we handle a reserve ticket.

I opted to spit out start_ns itself without calculating the difference
because there could be a gap between enabling the tracpoint and setting
start_ns.  Doing it this way allows us to filter on 0 start_ns so we
don't get bogus entries, and we can easily calculate the time difference
with bpftrace or something else.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/space-info.c| 12 +++-
 include/trace/events/btrfs.h | 29 +
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 975bb109e8b9..af6ab30e36e7 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -1220,6 +1220,8 @@ static void wait_reserve_ticket(struct btrfs_fs_info 
*fs_info,
  * @fs_info:the filesystem
  * @space_info: space info for the reservation
  * @ticket: ticket for the reservation
+ * @start_ns:  timestamp when the reservation started
+ * @orig_bytes:amount of bytes originally reserved
  * @flush:  how much we can flush
  *
  * This does the work of figuring out how to flush for the ticket, waiting for
@@ -1228,6 +1230,7 @@ static void wait_reserve_ticket(struct btrfs_fs_info 
*fs_info,
 static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
 struct btrfs_space_info *space_info,
 struct reserve_ticket *ticket,
+u64 start_ns, u64 orig_bytes,
 enum btrfs_reserve_flush_enum flush)
 {
int ret;
@@ -1283,6 +1286,8 @@ static int handle_reserve_ticket(struct btrfs_fs_info 
*fs_info,
 * space wasn't reserved at all).
 */
ASSERT(!(ticket->bytes == 0 && ticket->error));
+   trace_btrfs_reserve_ticket(fs_info, space_info->flags, orig_bytes,
+  start_ns, flush, ticket->error);
return ret;
 }
 
@@ -1317,6 +1322,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 {
struct work_struct *async_work;
struct reserve_ticket ticket;
+   u64 start_ns = 0;
u64 used;
int ret = 0;
bool pending_tickets;
@@ -1369,6 +1375,9 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
space_info->reclaim_size += ticket.bytes;
init_waitqueue_head(&ticket.wait);
ticket.steal = (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL);
+   if (trace_btrfs_reserve_ticket_enabled())
+   start_ns = ktime_get_ns();
+
if (flush == BTRFS_RESERVE_FLUSH_ALL ||
flush == BTRFS_RESERVE_FLUSH_ALL_STEAL ||
flush == BTRFS_RESERVE_FLUSH_DATA) {
@@ -1405,7 +1414,8 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
if (!ret || flush == BTRFS_RESERVE_NO_FLUSH)
return ret;
 
-   return handle_reserve_ticket(fs_info, space_info, &ticket, flush);
+   return handle_reserve_ticket(fs_info, space_info, &ticket, start_ns,
+orig_bytes, flush);
 }
 
 /**
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b9896fc06160..b0ea2a108be3 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -2026,6 +2026,35 @@ TRACE_EVENT(btrfs_convert_extent_bit,
  __print_flags(__entry->clear_bits, "|", EXTENT_FLAGS))
 );
 
+TRACE_EVENT(btrfs_reserve_ticket,
+   TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 bytes,
+u64 start_ns, int flush, int error),
+
+   TP_ARGS(fs_info, flags, bytes, start_ns, flush, error),
+
+   TP_STRUCT__entry_btrfs(
+   __field(u64,flags   )
+   __field(u64,bytes   )
+   __field(u64,start_ns)
+   __field(int,flush   )
+   __field(int,error   )
+   ),
+
+   TP_fast_assign_btrfs(fs_info,
+   __entry->flags  = flags;
+   __entry->bytes  = bytes;
+   __entry->start_ns   = start_ns;
+   __entry->flush  = flush;
+   __entry->error  = error;
+   ),
+
+   TP_printk_btrfs("flags=%s bytes=%llu start_ns=%llu flush=%s error=%d",
+   __print_flags(__entry->flags, "|", BTRFS_GROUP_FLAGS),
+   __entry->bytes, __entry->start_ns,
+   __print_symbolic(__entry->flush, FLUSH_ACTIONS),
+   __entry->error)
+);
+
 DECLARE_EVENT_CLASS(btrfs_sleep_tree_lock,
TP_PROTO(const struct extent_buffer *eb, u64 start_ns),
 
-- 
2.26.2

[PATCH v4 04/12] btrfs: introduce a FORCE_COMMIT_TRANS flush operation

2021-01-27 Thread Josef Bacik

Sole-y for preemptive flushing, we want to be able to force the
transaction commit without any of the ambiguity of
may_commit_transaction().  This is because may_commit_transaction()
checks tickets and such, and in preemptive flushing we already know
it'll be helpful, so use this to keep the code nice and clean and
straightforward.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/space-info.c| 14 ++
 include/trace/events/btrfs.h |  3 ++-
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7d8660227520..90726954b883 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2740,6 +2740,7 @@ enum btrfs_flush_state {
ALLOC_CHUNK_FORCE   =   8,
RUN_DELAYED_IPUTS   =   9,
COMMIT_TRANS=   10,
+   FORCE_COMMIT_TRANS  =   11,
 };
 
 int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index ff138cef7d0b..94c9534505c5 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -735,6 +735,14 @@ static void flush_space(struct btrfs_fs_info *fs_info,
case COMMIT_TRANS:
ret = may_commit_transaction(fs_info, space_info);
break;
+   case FORCE_COMMIT_TRANS:
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   break;
+   }
+   ret = btrfs_commit_transaction(trans);
+   break;
default:
ret = -ENOSPC;
break;
@@ -1037,6 +1045,12 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
  *   For data we start with alloc chunk force, however we could have been full
  *   before, and then the transaction commit could have freed new block groups,
  *   so if we now have space to allocate do the force chunk allocation.
+ *
+ * FORCE_COMMIT_TRANS
+ *   For use by the preemptive flusher.  We use this to bypass the ticketing
+ *   checks in may_commit_transaction, as we have more information about the
+ *   overall state of the system and may want to commit the transaction ahead 
of
+ *   actual ENOSPC conditions.
  */
 static const enum btrfs_flush_state data_flush_states[] = {
FLUSH_DELALLOC_WAIT,
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b0ea2a108be3..0cf02dfd4c01 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -99,7 +99,8 @@ struct btrfs_space_info;
EM( ALLOC_CHUNK,"ALLOC_CHUNK")  \
EM( ALLOC_CHUNK_FORCE,  "ALLOC_CHUNK_FORCE")\
EM( RUN_DELAYED_IPUTS,  "RUN_DELAYED_IPUTS")\
-   EMe(COMMIT_TRANS,   "COMMIT_TRANS")
+   EM(COMMIT_TRANS,"COMMIT_TRANS") \
+   EMe(FORCE_COMMIT_TRANS, "FORCE_COMMIT_TRANS")
 
 /*
  * First define the enums in the above macros to be exported to userspace via
-- 
2.26.2

Re: [PATCH v3 01/12] btrfs: make flush_space take a enum btrfs_flush_state instead of int

2021-01-27 Thread Josef Bacik


On 1/26/21 1:36 PM, David Sterba wrote:

On Fri, Oct 09, 2020 at 09:28:18AM -0400, Josef Bacik wrote:

I got a automated message from somebody who runs clang against our
kernels and it's because I used the wrong enum type for what I passed
into flush_space.  Change the argument to be explicitly the enum we're
expecting to make everything consistent.  Maybe eventually gcc will
catch errors like this.


I can't find any such public report and none of the clang warnings seem
to be specific about that. Local run with clang 11 does not produce any
warning.



IDK, it was a private email just to me with the following output from clang

s/btrfs/space-info.c:1115:12: warning: implicit conversion from
enumeration type 'enum btrfs_flush_state' to different enumeration type
'enum btrfs_reserve_flush_enum' [-Wenum-conversion]
 flush = FLUSH_DELALLOC;
   ~ ^~
fs/btrfs/space-info.c:1120:12: warning: implicit conversion from
enumeration type 'enum btrfs_flush_state' to different enumeration type
'enum btrfs_reserve_flush_enum' [-Wenum-conversion]
 flush = FORCE_COMMIT_TRANS;
   ~ ^~
fs/btrfs/space-info.c:1124:12: warning: implicit conversion from
enumeration type 'enum btrfs_flush_state' to different enumeration type
'enum btrfs_reserve_flush_enum' [-Wenum-conversion]
 flush = FLUSH_DELAYED_ITEMS_NR;
   ~ ^~
fs/btrfs/space-info.c:1127:12: warning: implicit conversion from
enumeration type 'enum btrfs_flush_state' to different enumeration type
'enum btrfs_reserve_flush_enum' [-Wenum-conversion]
 flush = FLUSH_DELAYED_REFS_NR;
   ~ ^

I figure it made sense, might as well fix it.  Do we have that option for our 
shiny new -W build options?  Thanks,


Josef

[PATCH v4 00/12] Improve preemptive ENOSPC flushing

2021-01-27 Thread Josef Bacik

v3->v4:
- Added some comments.
- Updated the tracepoint for the preemptive flushing to take a bool instead of
  an int.
- Removed 'inline' from need_preemptive_flushing.

v2->v3:
- Added a cleanup to make sure we pass the right enum around for flush_space().
- Fixed a problem reported by a clang build where I used the wrong enum in the
  preemptive flushing for flush_space().
- Fixed up some nits pointed out by Nikolay.

v1->v2:
- Added a FORCE_COMMIT_TRANS flush operation so we can keep the flush_space
  stuff consistent and get all the normal tracepoints.
- Renamed fs_info->dio_bytes to ->ordered_bytes and changed it to count all
  ordered extents that were pending, not just DIO ordered extents that were
  pending.
- Reworked the clamping to not apply if we're not doing a lot of delalloc
  reservations.
- Reworked the preempt flushing loop to be more straightforward.
- Fixed the need_preemptive_flushing() helper to take into account DIO heavy
  workloads.

--- Original email ---

A while ago Nikolay started digging into a problem where they were seeing an
around 20% regression on random writes, and he bisected it down to

  btrfs: don't end the transaction for delayed refs in throttle

However this wasn't actually the cause of the problem.

This patch removed the code that would preemptively end the transactions if we
were low on space.  Because we had just introduced the ticketing code, this was
no longer necessary and was causing a lot of transaction commits.

And in Nikolay's testing he validated this, we would see like 100x more
transaction commits without that patch than with it, but the write regression
clearly appeared when this patch was applied.

The root cause of this is that the transaction commits were essentially
happening so quickly that we didn't end up needing to wait on space in the
ENOSPC ticketing code as much, and thus were able to write pretty quickly.  With
this gone, we now were getting a sawtoothy sort of behavior where we'd run up,
stop while we flushed metadata space, run some more, stop again etc.

When I implemented the ticketing infrastructure, I was trying to get us out of
excessively flushing space because we would sometimes over create block groups,
and thus short circuited flushing if we no longer had tickets.  This had the
side effect of breaking the preemptive flushing code, where we attempted to
flush space in the background before we were forced to wait for space.

Enter this patchset.  We still have some of this preemption logic sprinkled
everywhere, so I've separated it out of the normal ticketed flushing code, and
made preemptive flushing it's own thing.

The preemptive flushing logic is more specialized than the standard flushing
logic.  It attempts to flush in whichever pool has the highest usage.  This
means that if most of our space is tied up in pinned extents, we'll commit the
transaction.  If most of the space is tied up in delalloc, we'll flush delalloc,
etc.

To test this out I used the fio job that Nikolay used, this needs to be adjusted
so the overall IO size is at least 2x the RAM size for the box you are testing

fio --direct=0 --ioengine=sync --thread --directory=/mnt/test --invalidate=1 \
--group_reporting=1 --runtime=300 --fallocate=none --ramp_time=10 \
--name=RandomWrites-async-64512-4k-4 --new_group --rw=randwrite \
--size=2g --numjobs=4 --bs=4k --fsync_on_close=0 --end_fsync=0 \
--filename_format=FioWorkloads.\$jobnum

I got the following results

misc-next: bw=13.4MiB/s (14.0MB/s), 13.4MiB/s-13.4MiB/s (14.0MB/s-14.0MB/s), 
io=4015MiB (4210MB), run=300323-300323msec
pre-throttling: bw=16.9MiB/s (17.7MB/s), 16.9MiB/s-16.9MiB/s 
(17.7MB/s-17.7MB/s), io=5068MiB (5314MB), run=300069-300069msec
my patches: bw=18.0MiB/s (18.9MB/s), 18.0MiB/s-18.0MiB/s (18.9MB/s-18.9MB/s), 
io=5403MiB (5666MB), run=31-31msec

Thanks,

Josef

Josef Bacik (12):
  btrfs: make flush_space take a enum btrfs_flush_state instead of int
  btrfs: add a trace point for reserve tickets
  btrfs: track ordered bytes instead of just dio ordered bytes
  btrfs: introduce a FORCE_COMMIT_TRANS flush operation
  btrfs: improve preemptive background space flushing
  btrfs: rename need_do_async_reclaim
  btrfs: check reclaim_size in need_preemptive_reclaim
  btrfs: rework btrfs_calc_reclaim_metadata_size
  btrfs: simplify the logic in need_preemptive_flushing
  btrfs: implement space clamping for preemptive flushing
  btrfs: adjust the flush trace point to include the source
  btrfs: add a trace class for dumping the current ENOSPC state

 fs/btrfs/ctree.h |   4 +-
 fs/btrfs/disk-io.c   |   9 +-
 fs/btrfs/ordered-data.c  |  13 +-
 fs/btrfs/space-info.c| 295 +--
 fs/btrfs/space-info.h|   4 +
 include/trace/events/btrfs.h | 104 +++-
 6 files changed, 362 insertions(+), 67 deletions(-)

-- 
2.26.2

[PATCH v4 01/12] btrfs: make flush_space take a enum btrfs_flush_state instead of int

2021-01-27 Thread Josef Bacik

I got a automated message from somebody who runs clang against our
kernels and it's because I used the wrong enum type for what I passed
into flush_space.  Change the argument to be explicitly the enum we're
expecting to make everything consistent.  Maybe eventually gcc will
catch errors like this.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/space-info.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index fd8e79e3c10e..975bb109e8b9 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -670,7 +670,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
  */
 static void flush_space(struct btrfs_fs_info *fs_info,
   struct btrfs_space_info *space_info, u64 num_bytes,
-  int state)
+  enum btrfs_flush_state state)
 {
struct btrfs_root *root = fs_info->extent_root;
struct btrfs_trans_handle *trans;
@@ -923,7 +923,7 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
struct btrfs_fs_info *fs_info;
struct btrfs_space_info *space_info;
u64 to_reclaim;
-   int flush_state;
+   enum btrfs_flush_state flush_state;
int commit_cycles = 0;
u64 last_tickets_id;
 
-- 
2.26.2

Re: [PATCH] btrfs: send: use struct send_ctx *sctx for btrfs_compare_trees and changed_cb

2021-01-27 Thread David Sterba

On Mon, Jan 25, 2021 at 08:43:25PM +0100, Roman Anasal wrote:
> btrfs_compare_trees and changed_cb use a void *ctx parameter instead of
> struct send_ctx *sctx but when used in changed_cb it is immediately
> casted to `struct send_ctx *sctx = ctx;`.
> 
> changed_cb is only ever called from btrfs_compare_trees and full_send_tree:
> - full_send_tree already passes a struct send_ctx *sctx
> - btrfs_compare_trees is only called by send_subvol with a struct send_ctx 
> *sctx
> - void *ctx in btrfs_compare_trees is only used to be passed to changed_cb
> 
> So casting to/from void *ctx seems unnecessary and directly using
> struct send_ctx *sctx instead provides better type-safety.
> 
> The original reason for using void *ctx in the first place seems to have
> been dropped with
> 1b51d6f ("btrfs: send: remove indirect callback parameter for changed_cb")
> 
> Signed-off-by: Roman Anasal 

Makes sense, added to misc-next, thanks.

Re: [PATCH v3 06/12] btrfs: rename need_do_async_reclaim

2021-01-27 Thread David Sterba

On Fri, Oct 09, 2020 at 09:28:23AM -0400, Josef Bacik wrote:
> All of our normal flushing is asynchronous reclaim, so this helper is
> poorly named.  This is more checking if we need to preemptively flush
> space, so rename it to need_preemptive_reclaim.
> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/space-info.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 0f84bee57c29..f37ead28bd05 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -799,9 +799,9 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info 
> *fs_info,
>   return to_reclaim;
>  }
>  
> -static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info,
> - struct btrfs_space_info *space_info,
> - u64 used)
> +static inline bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,

Following patches add more code to that function so it's not really
suitable for 'static inline', I'd rather drop it.

Re: [PATCH 02/17] btrfs: use bio_kmalloc in __alloc_device

2021-01-27 Thread Chaitanya Kulkarni

On 1/26/21 7:04 AM, Christoph Hellwig wrote:
> Use bio_kmalloc instead of open coding it.
>
> Signed-off-by: Christoph Hellwig 
Looks good.

Reviewed-by: Chaitanya Kulkarni

Re: [PATCH 01/17] zonefs: use bio_alloc in zonefs_file_dio_append

2021-01-27 Thread Chaitanya Kulkarni

On 1/26/21 7:01 AM, Christoph Hellwig wrote:
> Use bio_alloc instead of open coding it.
>
> Signed-off-by: Christoph Hellwig 

Ha Ha I was going send out a patch for this :P.

Reviewed-by: Chaitanya Kulkarni

Re: [PATCH v3 11/12] btrfs: adjust the flush trace point to include the source

2021-01-27 Thread David Sterba

On Fri, Oct 09, 2020 at 09:28:28AM -0400, Josef Bacik wrote:
> Since we have normal ticketed flushing and preemptive flushing, adjust
> the tracepoint so that we know the source of the flushing action to make
> it easier to debug problems.
> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/space-info.c| 20 
>  include/trace/events/btrfs.h | 10 ++
>  2 files changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 5ee698c12a7b..30474fa30985 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -664,7 +664,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>   */
>  static void flush_space(struct btrfs_fs_info *fs_info,
>  struct btrfs_space_info *space_info, u64 num_bytes,
> -enum btrfs_flush_state state)
> +enum btrfs_flush_state state, bool for_preempt)
>  {
>   struct btrfs_root *root = fs_info->extent_root;
>   struct btrfs_trans_handle *trans;
> @@ -747,7 +747,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
>   }
>  
>   trace_btrfs_flush_space(fs_info, space_info->flags, num_bytes, state,
> - ret);
> + ret, for_preempt);
>   return;
>  }
>  
> @@ -973,7 +973,8 @@ static void btrfs_async_reclaim_metadata_space(struct 
> work_struct *work)
>  
>   flush_state = FLUSH_DELAYED_ITEMS_NR;
>   do {
> - flush_space(fs_info, space_info, to_reclaim, flush_state);
> + flush_space(fs_info, space_info, to_reclaim, flush_state,
> + false);
>   spin_lock(&space_info->lock);
>   if (list_empty(&space_info->tickets)) {
>   space_info->flush = 0;
> @@ -1109,7 +1110,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct 
> work_struct *work)
>   to_reclaim >>= 2;
>   if (!to_reclaim)
>   to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 
> 1);
> - flush_space(fs_info, space_info, to_reclaim, flush);
> + flush_space(fs_info, space_info, to_reclaim, flush, true);
>   cond_resched();
>   spin_lock(&space_info->lock);
>   }
> @@ -1200,7 +1201,8 @@ static void btrfs_async_reclaim_data_space(struct 
> work_struct *work)
>   spin_unlock(&space_info->lock);
>  
>   while (!space_info->full) {
> - flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE);
> + flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE,
> + false);
>   spin_lock(&space_info->lock);
>   if (list_empty(&space_info->tickets)) {
>   space_info->flush = 0;
> @@ -1213,7 +1215,7 @@ static void btrfs_async_reclaim_data_space(struct 
> work_struct *work)
>  
>   while (flush_state < ARRAY_SIZE(data_flush_states)) {
>   flush_space(fs_info, space_info, U64_MAX,
> - data_flush_states[flush_state]);
> + data_flush_states[flush_state], false);
>   spin_lock(&space_info->lock);
>   if (list_empty(&space_info->tickets)) {
>   space_info->flush = 0;
> @@ -1286,7 +1288,8 @@ static void priority_reclaim_metadata_space(struct 
> btrfs_fs_info *fs_info,
>  
>   flush_state = 0;
>   do {
> - flush_space(fs_info, space_info, to_reclaim, 
> states[flush_state]);
> + flush_space(fs_info, space_info, to_reclaim, 
> states[flush_state],
> + false);
>   flush_state++;
>   spin_lock(&space_info->lock);
>   if (ticket->bytes == 0) {
> @@ -1302,7 +1305,8 @@ static void priority_reclaim_data_space(struct 
> btrfs_fs_info *fs_info,
>   struct reserve_ticket *ticket)
>  {
>   while (!space_info->full) {
> - flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE);
> + flush_space(fs_info, space_info, U64_MAX, ALLOC_CHUNK_FORCE,
> + false);
>   spin_lock(&space_info->lock);
>   if (ticket->bytes == 0) {
>   spin_unlock(&space_info->lock);
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index 0a3d35d952c4..6d93637bae02 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -1112,15 +1112,16 @@ TRACE_EVENT(btrfs_trigger_flush,
>  TRACE_EVENT(btrfs_flush_space,
>  
>   TP_PROTO(const struct btrfs_fs_info *fs_info, u64 flags, u64 num_bytes,
> -  int state, int ret),
> +  int state, int ret, int for_preempt),

for_preempt should be bool

>  
> - TP_ARGS(fs_info, flags, num_bytes, state, ret),
> + TP_ARGS(fs_info, flags, num_bytes, state, ret, for_pre

Re: [PATCH 05/17] block: use an on-stack bio in blkdev_issue_flush

2021-01-27 Thread Chaitanya Kulkarni

On 1/26/21 7:07 AM, Christoph Hellwig wrote:
> There is no point in allocating memory for a synchronous flush.
>
> Signed-off-by: Christoph Hellwig 

True, looks good.

Reviewed-by: Chaitanya Kulkarni

Re: [PATCH 16/17] nilfs2: remove cruft in nilfs_alloc_seg_bio

2021-01-27 Thread Chaitanya Kulkarni

On 1/26/21 7:27 AM, Christoph Hellwig wrote:
> bio_alloc never returns NULL when it can sleep.
>
> Signed-off-by: Christoph Hellwig 
> ---
Looks good.

Reviewed-by: Chaitanya Kulkarni

Re: [PATCH v2 3/4] btrfs: send: fix invalid commands for inodes with changed rdev but same gen

2021-01-27 Thread Roman Anasal | BDSU

Am Montag, den 25.01.2021, 20:51 + schrieb Filipe Manana:
> On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal 
> wrote:
> > This is analogous to the preceding patch ("btrfs: send: fix invalid
> > commands for inodes with changed type but same gen") but for
> > changed
> > rdev:
> > 
> > When doing an incremental send, if a new inode has the same number
> > as an
> > inode in the parent subvolume, was created with the same generation
> > but
> > has differing rdev it will not be detected as changed and thus not
> > recreated. This will lead to incorrect results on the receiver
> > where the
> > inode will keep the rdev of the inode in the parent subvolume or
> > even
> > fail when also the ref is unchanged.
> > 
> > This case does not happen when doing incremental sends with
> > snapshots
> > that are kept read-only by the user all the time, but it may happen
> > if
> > - a snapshot was modified in the same transaction as its parent
> > after it
> >   was created
> > - the subvol used as parent was created independently from the sent
> > subvol
> > 
> > Example reproducers:
> > 
> >   # case 1: same ino at same path
> >   btrfs subvolume create subvol1
> >   btrfs subvolume create subvol2
> >   mknod subvol1/a c 1 3
> >   mknod subvol2/a c 1 5
> >   btrfs property set subvol1 ro true
> >   btrfs property set subvol2 ro true
> >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > 
> > The produced tree state here is:
> >   |-- subvol1
> >   |   `-- a (ino 257, c 1,3)
> >   |
> >   `-- subvol2
> >   `-- a (ino 257, c 1,5)
> > 
> > Where subvol1/a and subvol2/a are character devices with differing
> > minor
> > numbers but same inode number and same generation.
> > 
> > Example output of the receive command:
> >   At subvol subvol2
> >   snapshot./subvol2   uuid=7513941c-
> > 4ef7-f847-b05e-4fdfe003af7b transid=9 parent_uuid=b66f015b-c226-
> > 2548-9e39-048c7fdbec99 parent_transid=9
> >   utimes  ./subvol2/  atime=2021-01-
> > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01-
> > 25T17:14:36+
> >   link./subvol2/a dest=a
> >   unlink  ./subvol2/a
> >   utimes  ./subvol2/  atime=2021-01-
> > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01-
> > 25T17:14:36+
> >   utimes  ./subvol2/a atime=2021-01-
> > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01-
> > 25T17:14:36+
> > 
> > => the `link` command causes the receiver to fail with:
> >ERROR: link a -> a failed: File exists
> > 
> > Second example:
> >   # case 2: same ino at different path
> >   btrfs subvolume create subvol1
> >   btrfs subvolume create subvol2
> >   mknod subvol1/a c 1 3
> >   mknod subvol2/b c 1 5
> >   btrfs property set subvol1 ro true
> >   btrfs property set subvol2 ro true
> >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> 
> As I've told you before for the v1 patchset from a week or two ago,
> this is not a supported scenario for incremental sends.
> Incremental sends are meant to be used on RO snapshots of the same
> subvolume, and those snapshots must never be changed after they were
> created.
> 
> Incremental sends were simply not designed for these cases, and can
> never be guaranteed to work with such cases.
> 
> The bug is not having incremental sends fail right away, with an
> explicit error message, when the send and parent roots aren't RO
> snapshots of the same subvolume.

I am sorry, I didn't want to anger you or to appear to be just stubborn
by posting this.

As I wrote in the cover letter I am aware that this is not a supported
use case and I understand that that makes the patches likely to be
rejected.
As said the reason I _wrote_ the patches was simply to learn more about
the btrfs code and its internals and see if I would be able to
understand it enough. The reason I _submitted_ them was just to
document what I found out so others could have a look into it and just
in case it maybe useful at a later time.

I also don't want to claim that these will add full support for sending
unrelated roots - they don't! They only handle those very specific edge
cases I found, which are currently _possible_, although still not
supported.

I took a deeper look into the rest to see if it could be supported:
the comparing algorithm actually works fine, even with completely
unrelated subvolumes (i.e. btrfs_compare_trees, changed_cb,
changed_inode etc.), but the processing of the changes (i.e.
process_recorded_refs etc.) is heavily based on (ino, gen) as
identifying handle, which can not be changed without the high risk of
regression - just as you said in your earlier comments - since side
effects of any changes are hard to see or understand without a very
deep understanding of the whole code; which is why I didn't even try to
touch that parts.

I apologize if I appeared to be stubborn or ignorant of

Re: [PATCH v3 02/12] btrfs: add a trace point for reserve tickets

2021-01-27 Thread David Sterba

On Fri, Oct 09, 2020 at 09:28:19AM -0400, Josef Bacik wrote:
> While debugging a ENOSPC related performance problem I needed to see the
> time difference between start and end of a reserve ticket, so add a
> trace point to report when we handle a reserve ticket.
> 
> I opted to spit out start_ns itself without calculating the difference
> because there could be a gap between enabling the tracpoint and setting
> start_ns.  Doing it this way allows us to filter on 0 start_ns so we
> don't get bogus entries, and we can easily calculate the time difference
> with bpftrace or something else.
> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/space-info.c| 10 +-
>  include/trace/events/btrfs.h | 29 +
>  2 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index ba2b72409d46..ac7269cf1904 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -1224,6 +1224,7 @@ static void wait_reserve_ticket(struct btrfs_fs_info 
> *fs_info,
>  static int handle_reserve_ticket(struct btrfs_fs_info *fs_info,
>struct btrfs_space_info *space_info,
>struct reserve_ticket *ticket,
> +  u64 start_ns, u64 orig_bytes,

Added

+ * @start_ns:   timestamp when the reservation started
+ * @orig_bytes: amount of bytes originally reserved

caught by the recently added kdoc verification.

>enum btrfs_reserve_flush_enum flush)

Re: [PATCH v5 0/8] A variety of lock contention fixes

2021-01-27 Thread David Sterba

On Fri, Dec 18, 2020 at 02:24:18PM -0500, Josef Bacik wrote:
> v4->v5:
> - Added "btrfs: remove bogus BUG_ON in alloc_reserved_tree_block", as Nikolay
>   pointed out I needed to explain why we no longer needed one of the delayed 
> ref
>   flushes, which led me down the rabbit hole of trying to understand why it
>   wasn't a problem anymore.  Turned out the BUG_ON() is bogus.
> - Added "btrfs: move delayed ref flushing for qgroup into qgroup helper",
>   instead of removing the flushing for qgroups completely, we still sort of 
> need
>   it, even though it's actually still broken, so I've moved it into the qgroup
>   helper.
> - Added Nikolay's rb for the last patch.

Moved from topic branch to misc-next, thanks.

Re: [PATCH v5 2/8] btrfs: only let one thread pre-flush delayed refs in commit

2021-01-27 Thread David Sterba

On Tue, Jan 12, 2021 at 11:17:45AM +0200, Nikolay Borisov wrote:
> 
> 
> On 11.01.21 г. 23:50 ч., David Sterba wrote:
> > On Mon, Jan 11, 2021 at 10:33:42AM +0200, Nikolay Borisov wrote:
> >> On 8.01.21 г. 18:01 ч., David Sterba wrote:
> >>> On Fri, Dec 18, 2020 at 02:24:20PM -0500, Josef Bacik wrote:
>  @@ -2043,23 +2043,22 @@ int btrfs_commit_transaction(struct 
>  btrfs_trans_handle *trans)
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>   
>  -/* make a pass through all the delayed refs we have so far
>  - * any runnings procs may add more while we are here
>  - */
>  -ret = btrfs_run_delayed_refs(trans, 0);
>  -if (ret) {
>  -btrfs_end_transaction(trans);
>  -return ret;
>  -}
>  -
>  -cur_trans = trans->transaction;
>  -
>   /*
>  - * set the flushing flag so procs in this transaction have to
>  - * start sending their work down.
>  + * We only want one transaction commit doing the flushing so we 
>  do not
>  + * waste a bunch of time on lock contention on the extent root 
>  node.
>    */
>  -cur_trans->delayed_refs.flushing = 1;
>  -smp_wmb();
> >>>
> >>> This barrier obviously separates the flushing = 1 and the rest of the
> >>> code, now implemented as test_and_set_bit, which implies full barrier.
> >>>
> >>> However, hunk in btrfs_should_end_transaction removes the barrier and
> >>> I'm not sure whether this is correct:
> >>>
> >>> - smp_mb();
> >>>   if (cur_trans->state >= TRANS_STATE_COMMIT_START ||
> >>> - cur_trans->delayed_refs.flushing)
> >>> + test_bit(BTRFS_DELAYED_REFS_FLUSHING,
> >>> +  &cur_trans->delayed_refs.flags))
> >>>   return true;
> >>>
> >>> This is never called under locks so we don't have complete
> >>> synchronization of neither the transaction state nor the flushing bit.
> >>> btrfs_should_end_transaction is merely a hint and not called in critical
> >>> places so we could probably afford to keep it without a barrier, or keep
> >>> it with comment(s).
> >>
> >> I think the point is moot in this case, because the test_bit either sees
> >> the flag or it doesn't. It's not possible for the flag to be set AND
> >> should_end_transaction return false that would be gross violation of
> >> program correctness.
> > 
> > So that's for the flushing part, but what about cur_trans->state?
> 
> Looking at the code, the barrier was there to order the publishing of
> the delayed_ref.flushing (now replaced by the bit flag) against
> surrounding code.
> 
> So independently of this patch, let's reason about trans state. In
> should_end_transaction it's read without holding any locks. (U)
> 
> It's modified in btrfs_cleanup_transaction without holding the
> fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set.
> 
> set in cleanup_transaction under fs_info->trans_lock (L)
> set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L)
> set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L)
> set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L)
> 
> set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this
> point the transaction is finished and fs_info->running_trans is NULL (U
> but irrelevant).
> 
> So by the looks of it we can have a concurrent READ race with a Write,
> due to reads not taking a lock. In this case what we want to ensure is
> we either see new or old state. I consulted with Will Deacon and he said
> that in such a case we'd want to annotate the accesses to ->state with
> (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I
> don't think this could happen but I imagine at some point kcsan would
> flag such an access as racy (which it is).

Thanks for the analysis, I've copied it to the changelog as there's
probably no shorter way to explain it.

[PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode

2021-01-27 Thread fdmanana

From: Filipe Manana 

When we fsync a new file, created in the current transaction, we check
all its ancestor inodes and always log them if they were created in the
current transaction - even if we have already logged them before, which
is a waste of time.

So avoid logging new ancestor inodes if they were already logged before
and have no xattrs added/updated/removed since they were last logged.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 35 +--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index be62759f0aac..105cf316ee27 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5272,6 +5272,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
if (S_ISDIR(inode->vfs_inode.i_mode)) {
int max_key_type = BTRFS_DIR_LOG_INDEX_KEY;
 
+   clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags);
if (inode_only == LOG_INODE_EXISTS)
max_key_type = BTRFS_XATTR_ITEM_KEY;
ret = drop_objectid_items(trans, log, path, ino, max_key_type);
@@ -5520,6 +5521,34 @@ static noinline int check_parent_dirs_for_sync(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
+/*
+ * Check if we need to log an inode. This is used in contexts where while
+ * logging an inode we need to log another inode (either that it exists or in
+ * full mode). This is used instead of btrfs_inode_in_log() because the later
+ * requires the inode to be in the log and have the log transaction committed,
+ * while here we do not care if the log transaction was already committed - our
+ * caller will commit the log later - and we want to avoid logging an inode
+ * multiple times when multiple tasks have joined the same log transaction.
+ */
+static bool need_log_inode(struct btrfs_trans_handle *trans,
+  struct btrfs_inode *inode)
+{
+   /*
+* If this inode does not have new/updated/deleted xattrs since the last
+* time it was logged and is flagged as logged in the current 
transaction,
+* we can skip logging it. As for new/deleted names, those are updated 
in
+* the log by link/unlink/rename operations.
+* In case the inode was logged and then evicted and reloaded, its
+* logged_trans will be 0, in which case we have to fully log it since
+* logged_trans is a transient field, not persisted.
+*/
+   if (inode->logged_trans == trans->transid &&
+   !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags))
+   return false;
+
+   return true;
+}
+
 struct btrfs_dir_list {
u64 ino;
struct list_head list;
@@ -5848,7 +5877,8 @@ static int log_new_ancestors(struct btrfs_trans_handle 
*trans,
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   if (BTRFS_I(inode)->generation >= trans->transid)
+   if (BTRFS_I(inode)->generation >= trans->transid &&
+   need_log_inode(trans, BTRFS_I(inode)))
ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
  LOG_INODE_EXISTS, ctx);
btrfs_add_delayed_iput(inode);
@@ -5902,7 +5932,8 @@ static int log_new_ancestors_fast(struct 
btrfs_trans_handle *trans,
if (root != inode->root)
break;
 
-   if (inode->generation >= trans->transid) {
+   if (inode->generation >= trans->transid &&
+   need_log_inode(trans, inode)) {
ret = btrfs_log_inode(trans, root, inode,
  LOG_INODE_EXISTS, ctx);
if (ret)
-- 
2.28.0

[PATCH 4/7] btrfs: skip logging directories already logged when logging all parents

2021-01-27 Thread fdmanana

From: Filipe Manana 

Some times when we fsync an inode we need to do a full log of all its
ancestors (due to unlink, link or rename operations), which can be an
expensive operation, specially if the directories are large.

However if we find an ancestor directory inode that is already logged in
the current transaction, and has no inserted/updated/deleted xattrs since
it was last logged, we can skip logging the directory again. We are safe
to skip that since we know that for logged directories, any link, unlink
or rename operations that implicate the directory will update the log as
necessary.

So use the helper need_log_dir(), introduced in a previous commit, to
detect already logged directories that can be skipped.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 105cf316ee27..c0dce99c2c14 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5826,6 +5826,11 @@ static int btrfs_log_all_parents(struct 
btrfs_trans_handle *trans,
goto out;
}
 
+   if (!need_log_inode(trans, BTRFS_I(dir_inode))) {
+   btrfs_add_delayed_iput(dir_inode);
+   continue;
+   }
+
if (ctx)
ctx->log_new_dentries = false;
ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode),
-- 
2.28.0

[PATCH 0/7] btrfs: more performance improvements for dbench workloads

2021-01-27 Thread fdmanana

From: Filipe Manana 

The following patchset brings one more batch of performance improvements
with dbench workloads, or anything that mixes file creation, file writes,
renames, unlinks, etc with fsync like dbench does. This patchset is mostly
based on avoiding logging directory inodes multiple times when not necessary,
falling back to transaction commits less frequently and often waiting less
time for transaction commits to complete. Performance results are listed in
the change log of the last patch, but in short, I've experienced a reduction
of maximum latency up to about -40% and throuhput gains up to about +6%.

Filipe Manana (7):
  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

 fs/btrfs/file.c|   1 +
 fs/btrfs/transaction.c |  39 +++--
 fs/btrfs/transaction.h |   2 +
 fs/btrfs/tree-log.c| 195 -
 4 files changed, 92 insertions(+), 145 deletions(-)

-- 
2.28.0

[PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

2021-01-27 Thread fdmanana

From: Filipe Manana 

Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).

In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.

So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:

  $ cat dbench-test.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).

Before applying patchset, 32 clients:

 Operation  CountAvgLatMaxLat
 
 NTCreateX9627107 0.15361.938
 Close7072076 0.001 3.175
 Rename407633 1.22244.439
 Unlink   1943895 0.65844.440
 Deltree  25617.339   110.891
 Mkdir128 0.003 0.009
 Qpathinfo8725406 0.06417.850
 Qfileinfo1529516 0.001 2.188
 Qfsinfo  1599884 0.002 1.457
 Sfileinfo 784200 0.005 3.562
 Find 3373513 0.41130.312
 WriteX   4802132 0.05329.054
 ReadX15089959 0.002 5.801
 LockX  31344 0.002 0.425
 UnlockX31344 0.001 0.173
 Flush 674724 5.952   341.830

Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms

After applying patchset, 32 clients:

After patchset, with 32 clients:

 Operation  CountAvgLatMaxLat
 
 NTCreateX9931568 0.11125.597
 Close7295730 0.001 2.171
 Rename420549 0.98249.714
 Unlink   2005366 0.49739.015
 Deltree  25611.14989.242
 Mkdir128 0.002 0.014
 Qpathinfo9001863 0.04920.761
 Qfileinfo1577730 0.001 2.546
 Qfsinfo  1650508 0.002 3.531
 Sfileinfo 809031 0.005 5.846
 Find 3480259 0.30923.977
 WriteX   4952505 0.04341.283
 ReadX15568127 0.002 5.476
 LockX  32338 0.002 0.978
 UnlockX32338 0.001 2.032
 Flush 696017 7.485   228.835

Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms

 --> +4.1% throughput, -39.6% max latency

Before applying patchset, 64 clients:

 Operation  CountAvgLatMaxLat
 
 NTCreateX8956748 0.342   108.312
 Close6579660 0.001 3.823
 Rename379209 2.39681.897
 Unlink   1808625 1.108   131.148
 Deltree  25625.632   172.176
 Mkdir128 0.003 0.018
 Qpathinfo8117615 0.13155.916
 Qfileinfo1423495 0.001 2.635
 Qfsinfo  1488496 0.002 5.412
 Sfileinfo 729472

[PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry

2021-01-27 Thread fdmanana

From: Filipe Manana 

When we remove a directory entry, as part of an unlink operation, if the
directory was logged before we must remove the directory index items from
the log. We are also updating the inode item of the directory to update
its i_size, but that is not necessary because during log replay we do not
need it and we correctly adjust the i_size in the inode item of the
subvolume as we process directory index items and replay deletes.

This is not needed since commit d555438b6e1dad ("Btrfs: drop dir i_size
when adding new names on replay"), where we explicitly ignore the i_size
of directory inode items on log replay. Before that we used it but it
was buggy as mentioned in that commit's change log (i_size got a larger
value then it should have).

So stop updating the i_size of the directory inode item in the log, as
that is a waste of time, adds more log contention to the log tree and
often results in COWing more extent buffers for the log tree.

This code path is triggered often during dbench workloads for example.
This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 39 ---
 1 file changed, 4 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8ee0700a980f..5d87afc6058a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3379,7 +3379,6 @@ int btrfs_del_dir_entries_in_log(struct 
btrfs_trans_handle *trans,
struct btrfs_path *path;
int ret;
int err = 0;
-   int bytes_del = 0;
u64 dir_ino = btrfs_ino(dir);
 
if (!inode_logged(trans, dir))
@@ -3406,7 +3405,6 @@ int btrfs_del_dir_entries_in_log(struct 
btrfs_trans_handle *trans,
}
if (di) {
ret = btrfs_delete_one_dir_name(trans, log, path, di);
-   bytes_del += name_len;
if (ret) {
err = ret;
goto fail;
@@ -3421,46 +3419,17 @@ int btrfs_del_dir_entries_in_log(struct 
btrfs_trans_handle *trans,
}
if (di) {
ret = btrfs_delete_one_dir_name(trans, log, path, di);
-   bytes_del += name_len;
if (ret) {
err = ret;
goto fail;
}
}
 
-   /* update the directory size in the log to reflect the names
-* we have removed
+   /*
+* We do not need to update the size field of the directory's inode item
+* because on log replay we update the field to reflect all existing
+* entries in the directory (see overwrite_item()).
 */
-   if (bytes_del) {
-   struct btrfs_key key;
-
-   key.objectid = dir_ino;
-   key.offset = 0;
-   key.type = BTRFS_INODE_ITEM_KEY;
-   btrfs_release_path(path);
-
-   ret = btrfs_search_slot(trans, log, &key, path, 0, 1);
-   if (ret < 0) {
-   err = ret;
-   goto fail;
-   }
-   if (ret == 0) {
-   struct btrfs_inode_item *item;
-   u64 i_size;
-
-   item = btrfs_item_ptr(path->nodes[0], path->slots[0],
- struct btrfs_inode_item);
-   i_size = btrfs_inode_size(path->nodes[0], item);
-   if (i_size > bytes_del)
-   i_size -= bytes_del;
-   else
-   i_size = 0;
-   btrfs_set_inode_size(path->nodes[0], item, i_size);
-   btrfs_mark_buffer_dirty(path->nodes[0]);
-   } else
-   ret = 0;
-   btrfs_release_path(path);
-   }
 fail:
btrfs_free_path(path);
 out_unlock:
-- 
2.28.0

[PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries

2021-01-27 Thread fdmanana

From: Filipe Manana 

When logging new directory entries of a directory, we log the inodes of
new dentries and the inodes of dentries pointing to directories that
may have been created in past transactions. For the case of directories
we log in full mode, which can be particularly expensive for large
directories.

We do use btrfs_inode_in_log() to skip already logged inodes, however for
that helper to return true, it requires that the log transaction used to
log the inode to be already committed. This means that when we have more
than one task using the same log transaction we can end up logging an
inode multiple times, which is a waste of time and not necessary since
the log will be committed by one of the tasks and the others will wait for
the log transaction to be committed before returning to user space.

So simply replace the use of btrfs_inode_in_log() with the new helper
function need_log_inode(), introduced in a previous commit.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c0dce99c2c14..6dc376a16cf2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5676,7 +5676,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle 
*trans,
goto next_dir_inode;
}
 
-   if (btrfs_inode_in_log(BTRFS_I(di_inode), 
trans->transid)) {
+   if (!need_log_inode(trans, BTRFS_I(di_inode))) {
btrfs_add_delayed_iput(di_inode);
break;
}
-- 
2.28.0

[PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging

2021-01-27 Thread fdmanana

From: Filipe Manana 

When we fill an inode item for logging we are setting its nbytes field
with the value returned by inode_get_bytes() (a VFS API), however we do
not need it because it is not used during log replay. In fact, for fast
fsyncs, when we call inode_get_bytes() we may even get an outdated value
for nbytes because the nbytes field of the inode is only updated when
ordered extents complete, and a fast fsync only waits for writeback to
complete, it does not wait for ordered extent completion.

So just remove the setup of nbytes and add an explicit comment mentioning
why we do not set it. This also avoids adding contention on the inode's
i_lock (VFS) with concurrent stat() calls, since that spinlock is used by
inode_get_bytes() which is also called by our stat callback
(btrfs_getattr()).

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5d87afc6058a..be62759f0aac 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3858,7 +3858,14 @@ static void fill_inode_item(struct btrfs_trans_handle 
*trans,
btrfs_set_token_timespec_nsec(&token, &item->ctime,
  inode->i_ctime.tv_nsec);
 
-   btrfs_set_token_inode_nbytes(&token, item, inode_get_bytes(inode));
+   /*
+* We do not need to set the nbytes field, in fact during a fast fsync
+* its value may not even be correct, since a fast fsync does not wait
+* for ordered extent completion, which is where we update nbytes, it
+* only waits for writeback to complete. During log replay as we find
+* file extent items and replay them, we adjust the nbytes field of the
+* inode item in subvolume tree as needed (see overwrite_item()).
+*/
 
btrfs_set_token_inode_sequence(&token, item, 
inode_peek_iversion(inode));
btrfs_set_token_inode_transid(&token, item, trans->transid);
-- 
2.28.0

[PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()

2021-01-27 Thread fdmanana

From: Filipe Manana 

Whenever we fsync an inode, if it is a directory, a regular file that was
created in the current transaction or has last_unlink_trans set to the
generation of the current transaction, we check if any of its ancestor
inodes (and the inode itself if it is a directory) can not be logged and
need a fallback to a full transaction commit - if so, we return with a
value of 1 in order to fallback to a transaction commit.

However we often do not need to fallback to a transaction commit because:

1) The ancestor inode is not an immediate parent, and therefore there is
   not an explicit request to log it and it is not needed neither to
   guarantee the consistency of the inode originally asked to be logged
   (fsynced) nor its immediate parent;

2) The ancestor inode was already logged before, in which case any link,
   unlink or rename operation updates the log as needed.

So for these two cases we can avoid an unnecessary transaction commit.
Therefore remove check_parent_dirs_for_sync() and add a check at the top
of btrfs_log_inode() to make us fallback immediately to a transaction
commit when we are logging a directory inode that can not be logged and
needs a full transaction commit. All we need to protect is the case where
after renaming a file someone fsyncs only the old directory, which would
result is losing the renamed file after a log replay.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 121 ++--
 1 file changed, 15 insertions(+), 106 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6dc376a16cf2..4c7b283ed2b2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5265,6 +5265,21 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
mutex_lock(&inode->log_mutex);
}
 
+   /*
+* This is for cases where logging a directory could result in losing a
+* a file after replaying the log. For example, if we move a file from a
+* directory A to a directory B, then fsync directory A, we have no way
+* to known the file was moved from A to B, so logging just A would
+* result in losing the file after a log replay.
+*/
+   if (S_ISDIR(inode->vfs_inode.i_mode) &&
+   inode_only == LOG_INODE_ALL &&
+   inode->last_unlink_trans >= trans->transid) {
+   btrfs_set_log_full_commit(trans);
+   err = 1;
+   goto out_unlock;
+   }
+
/*
 * a brute force approach to making sure we get the most uptodate
 * copies of everything.
@@ -5428,99 +5443,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
return err;
 }
 
-/*
- * Check if we must fallback to a transaction commit when logging an inode.
- * This must be called after logging the inode and is used only in the context
- * when fsyncing an inode requires the need to log some other inode - in which
- * case we can't lock the i_mutex of each other inode we need to log as that
- * can lead to deadlocks with concurrent fsync against other inodes (as we can
- * log inodes up or down in the hierarchy) or rename operations for example. So
- * we take the log_mutex of the inode after we have logged it and then check 
for
- * its last_unlink_trans value - this is safe because any task setting
- * last_unlink_trans must take the log_mutex and it must do this before it does
- * the actual unlink operation, so if we do this check before a concurrent task
- * sets last_unlink_trans it means we've logged a consistent version/state of
- * all the inode items, otherwise we are not sure and must do a transaction
- * commit (the concurrent task might have only updated last_unlink_trans before
- * we logged the inode or it might have also done the unlink).
- */
-static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans,
- struct btrfs_inode *inode)
-{
-   bool ret = false;
-
-   mutex_lock(&inode->log_mutex);
-   if (inode->last_unlink_trans >= trans->transid) {
-   /*
-* Make sure any commits to the log are forced to be full
-* commits.
-*/
-   btrfs_set_log_full_commit(trans);
-   ret = true;
-

Re: [PATCH] btrfs: fix a bug that btrfs_invalidapge() can double account ordered extent for subpage

2021-01-27 Thread Filipe Manana

On Wed, Jan 27, 2021 at 8:29 AM Qu Wenruo  wrote:
>
> Commit dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could
> span across a page") make btrfs_invalidapage() to search all ordered
> extents.
>
> The offending code looks like this:
>
> again:
> start = page_start;
> ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 
> 1);
> if (ordred) {
> end = min(page_end,
>   ordered->file_offset + ordered->num_bytes - 1);
>
> /* Do the cleanup */
>
> start = end + 1;
> if (start < page_end)
> goto again;
> }
>
> The behavior is indeed necessary for the incoming subpage support, but
> when it iterate through all the ordered extents, it also resets the
> search range @start.
>
> This means, for the following cases, we can double account the ordered
> extents, causing its bytes_left underflow:
>
> Page offset
> 0   16K 32K
> |<--- OE 1  --->|<--- OE 2 >|
>
> As the first iteration will find OE 1, which doesn't cover the full
> page, thus after cleanup code, we need to retry again.
> But again label will reset start to page_start, and we got OE 1 again,
> which causes double account on OE1, and cause OE1's byte_left to
> underflow.
>
> The only good news is, this problem can only happen for subpage case, as
> for regular sectorsize == PAGE_SIZE case, we will always find a OE ends
> at or after page end, thus no way to trigger the problem.
>
> This patch will move the again label after start = page_start, to fix
> the bug.
> This is just a quick fix, which is easy to backport.

Hum? Why does it need to be backported?
There are no kernel releases that support subpage sector size yet and
this does not affect the case where sector size is the same as the
page size.

>
> There will be more comprehensive rework to convert the open coded loop to
> a proper while loop.
>
> Fixes: dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could span 
> across a page")
> Signed-off-by: Qu Wenruo 

Anyway, it looks good to me, thanks.

Reviewed-by: Filipe Manana 

> ---
>  fs/btrfs/inode.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index ef6cb7b620d0..2eea7d22405a 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8184,8 +8184,9 @@ static void btrfs_invalidatepage(struct page *page, 
> unsigned int offset,
>
> if (!inode_evicting)
> lock_extent_bits(tree, page_start, page_end, &cached_state);
> -again:
> +
> start = page_start;
> +again:
> ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 
> 1);
> if (ordered) {
> found_ordered = true;
> --
> 2.30.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

Re: [PATCH] btrfs: fix a bug that btrfs_invalidapge() can double account ordered extent for subpage

2021-01-27 Thread Qu Wenruo





On 2021/1/27 下午6:44, Filipe Manana wrote:

On Wed, Jan 27, 2021 at 8:29 AM Qu Wenruo  wrote:


Commit dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could
span across a page") make btrfs_invalidapage() to search all ordered
extents.

The offending code looks like this:

again:
 start = page_start;
 ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 
1);
 if (ordred) {
 end = min(page_end,
   ordered->file_offset + ordered->num_bytes - 1);

 /* Do the cleanup */

 start = end + 1;
 if (start < page_end)
 goto again;
 }

The behavior is indeed necessary for the incoming subpage support, but
when it iterate through all the ordered extents, it also resets the
search range @start.

This means, for the following cases, we can double account the ordered
extents, causing its bytes_left underflow:

 Page offset
 0   16K 32K
 |<--- OE 1  --->|<--- OE 2 >|

As the first iteration will find OE 1, which doesn't cover the full
page, thus after cleanup code, we need to retry again.
But again label will reset start to page_start, and we got OE 1 again,
which causes double account on OE1, and cause OE1's byte_left to
underflow.

The only good news is, this problem can only happen for subpage case, as
for regular sectorsize == PAGE_SIZE case, we will always find a OE ends
at or after page end, thus no way to trigger the problem.

This patch will move the again label after start = page_start, to fix
the bug.
This is just a quick fix, which is easy to backport.


Hum? Why does it need to be backported?
There are no kernel releases that support subpage sector size yet and
this does not affect the case where sector size is the same as the
page size.



Hmmm, right, there is no need to backport.

Hopes David can remove that line when merging.

Thanks,
Qu





There will be more comprehensive rework to convert the open coded loop to
a proper while loop.

Fixes: dbfdb6d1b369 ("Btrfs: Search for all ordered extents that could span across a 
page")
Signed-off-by: Qu Wenruo 


Anyway, it looks good to me, thanks.

Reviewed-by: Filipe Manana 


---
  fs/btrfs/inode.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef6cb7b620d0..2eea7d22405a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8184,8 +8184,9 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,

 if (!inode_evicting)
 lock_extent_bits(tree, page_start, page_end, &cached_state);
-again:
+
 start = page_start;
+again:
 ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 
1);
 if (ordered) {
 found_ordered = true;
--
2.30.0

Re: [PATCH v2 3/4] btrfs: send: fix invalid commands for inodes with changed rdev but same gen

2021-01-27 Thread Filipe Manana

On Tue, Jan 26, 2021 at 7:19 PM Roman Anasal | BDSU
 wrote:
>
> Am Montag, den 25.01.2021, 20:51 + schrieb Filipe Manana:
> > On Mon, Jan 25, 2021 at 7:51 PM Roman Anasal 
> > wrote:
> > > This is analogous to the preceding patch ("btrfs: send: fix invalid
> > > commands for inodes with changed type but same gen") but for
> > > changed
> > > rdev:
> > >
> > > When doing an incremental send, if a new inode has the same number
> > > as an
> > > inode in the parent subvolume, was created with the same generation
> > > but
> > > has differing rdev it will not be detected as changed and thus not
> > > recreated. This will lead to incorrect results on the receiver
> > > where the
> > > inode will keep the rdev of the inode in the parent subvolume or
> > > even
> > > fail when also the ref is unchanged.
> > >
> > > This case does not happen when doing incremental sends with
> > > snapshots
> > > that are kept read-only by the user all the time, but it may happen
> > > if
> > > - a snapshot was modified in the same transaction as its parent
> > > after it
> > >   was created
> > > - the subvol used as parent was created independently from the sent
> > > subvol
> > >
> > > Example reproducers:
> > >
> > >   # case 1: same ino at same path
> > >   btrfs subvolume create subvol1
> > >   btrfs subvolume create subvol2
> > >   mknod subvol1/a c 1 3
> > >   mknod subvol2/a c 1 5
> > >   btrfs property set subvol1 ro true
> > >   btrfs property set subvol2 ro true
> > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> > >
> > > The produced tree state here is:
> > >   |-- subvol1
> > >   |   `-- a (ino 257, c 1,3)
> > >   |
> > >   `-- subvol2
> > >   `-- a (ino 257, c 1,5)
> > >
> > > Where subvol1/a and subvol2/a are character devices with differing
> > > minor
> > > numbers but same inode number and same generation.
> > >
> > > Example output of the receive command:
> > >   At subvol subvol2
> > >   snapshot./subvol2   uuid=7513941c-
> > > 4ef7-f847-b05e-4fdfe003af7b transid=9 parent_uuid=b66f015b-c226-
> > > 2548-9e39-048c7fdbec99 parent_transid=9
> > >   utimes  ./subvol2/  atime=2021-01-
> > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01-
> > > 25T17:14:36+
> > >   link./subvol2/a dest=a
> > >   unlink  ./subvol2/a
> > >   utimes  ./subvol2/  atime=2021-01-
> > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01-
> > > 25T17:14:36+
> > >   utimes  ./subvol2/a atime=2021-01-
> > > 25T17:14:36+ mtime=2021-01-25T17:14:36+ ctime=2021-01-
> > > 25T17:14:36+
> > >
> > > => the `link` command causes the receiver to fail with:
> > >ERROR: link a -> a failed: File exists
> > >
> > > Second example:
> > >   # case 2: same ino at different path
> > >   btrfs subvolume create subvol1
> > >   btrfs subvolume create subvol2
> > >   mknod subvol1/a c 1 3
> > >   mknod subvol2/b c 1 5
> > >   btrfs property set subvol1 ro true
> > >   btrfs property set subvol2 ro true
> > >   btrfs send -p subvol1 subvol2 | btrfs receive --dump
> >
> > As I've told you before for the v1 patchset from a week or two ago,
> > this is not a supported scenario for incremental sends.
> > Incremental sends are meant to be used on RO snapshots of the same
> > subvolume, and those snapshots must never be changed after they were
> > created.
> >
> > Incremental sends were simply not designed for these cases, and can
> > never be guaranteed to work with such cases.
> >
> > The bug is not having incremental sends fail right away, with an
> > explicit error message, when the send and parent roots aren't RO
> > snapshots of the same subvolume.
>
> I am sorry, I didn't want to anger you or to appear to be just stubborn
> by posting this.
>
> As I wrote in the cover letter I am aware that this is not a supported
> use case and I understand that that makes the patches likely to be
> rejected.

Ok, now I got the cover letter and the remaining v2 patches.
Vger has been having some lag this week, only got the mails during the
last evening.

Thanks.

> As said the reason I _wrote_ the patches was simply to learn more about
> the btrfs code and its internals and see if I would be able to
> understand it enough. The reason I _submitted_ them was just to
> document what I found out so others could have a look into it and just
> in case it maybe useful at a later time.
>
> I also don't want to claim that these will add full support for sending
> unrelated roots - they don't! They only handle those very specific edge
> cases I found, which are currently _possible_, although still not
> supported.
>
> I took a deeper look into the rest to see if it could be supported:
> the comparing algorithm actually works fine, even with completely
> unrelated subvolumes (i.e. btrfs_compare_trees, changed_cb,
> changed_inode etc.), but the processing of

Re: [PATCH] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-27 Thread Filipe Manana

On Wed, Jan 27, 2021 at 9:59 AM Michal Rostecki  wrote:
>
> From: Michal Rostecki 
>
> Before this change, the btrfs_get_io_geometry() function was calling
> btrfs_get_chunk_map() to get the extent mapping, necessary for
> calculating the I/O geometry. It was using that extent mapping only
> internally and freeing the pointer after its execution.
>
> That resulted in calling btrfs_get_chunk_map() de facto twice by the
> __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
> first and then calling btrfs_get_chunk_map() directly to get the extent
> mapping, used by the rest of the function.
>
> This change fixes that by passing the extent mapping to the
> btrfs_get_io_geometry() function as an argument.
>
> Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately")

Generally we only use the Fixes tag for bug fixes or serious
performance regressions.
Have you seen here a serious performance regression?

> Signed-off-by: Michal Rostecki 
> ---
>  fs/btrfs/inode.c   | 37 -
>  fs/btrfs/volumes.c | 39 ---
>  fs/btrfs/volumes.h |  5 +++--
>  3 files changed, 47 insertions(+), 34 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0dbe1aaa0b71..a4ce8501ed4d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
> size, struct bio *bio,
> struct inode *inode = page->mapping->host;
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> u64 logical = bio->bi_iter.bi_sector << 9;
> +   struct extent_map *em;
> u64 length = 0;
> u64 map_length;
> -   int ret;
> +   int ret = 0;
> struct btrfs_io_geometry geom;
>
> if (bio_flags & EXTENT_BIO_COMPRESSED)
> @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, 
> size_t size, struct bio *bio,
>
> length = bio->bi_iter.bi_size;
> map_length = length;
> -   ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, 
> map_length,
> -   &geom);
> +   em = btrfs_get_chunk_map(fs_info, logical, map_length);
> +   if (IS_ERR(em))
> +   return PTR_ERR(em);
> +   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
> +   map_length, &geom);
> if (ret < 0)
> -   return ret;
> +   goto out;
>
> -   if (geom.len < length + size)
> -   return 1;
> -   return 0;
> +   if (geom.len < length + size) {
> +   ret = 1;
> +   goto out;
> +   }
> +out:
> +   free_extent_map(em);
> +   return ret;
>  }
>
>  /*
> @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
> u64 submit_len;
> int clone_offset = 0;
> int clone_len;
> +   int logical;
> int ret;
> blk_status_t status;
> struct btrfs_io_geometry geom;
> struct btrfs_dio_data *dio_data = iomap->private;
> +   struct extent_map *em;
>
> dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
> if (!dip) {
> @@ -7970,11 +7980,17 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
> }
>
> start_sector = dio_bio->bi_iter.bi_sector;
> +   logical = start_sector << 9;
> submit_len = dio_bio->bi_iter.bi_size;
>
> do {
> -   ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
> -   start_sector << 9, submit_len,
> +   em = btrfs_get_chunk_map(fs_info, logical, submit_len);
> +   if (IS_ERR(em)) {
> +   status = errno_to_blk_status(ret);
> +   goto out_err;

em must be set to NULL before going to "out_err", otherwise we get a
crash due to an invalid memory access.

Also, status should be set to "errno_to_blk_status(PTR_ERR(em))". The
value of ret at this point is undefined.

Other than that, it looks good.

Thanks.

> +   }
> +   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
> +   logical, submit_len,
> &geom);
> if (ret) {
> status = errno_to_blk_status(ret);
> @@ -8030,12 +8046,15 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
> clone_offset += clone_len;
> start_sector += clone_len >> 9;
> file_offset += clone_len;
> +
> +   free_extent_map(em);
> } while (submit_len > 0);
> return BLK_QC_T_NONE;
>
>  out_err:
> dip->dio_bio->bi_status = status;
> btrfs_dio_private_put(dip);
> +   free_extent_map(em);
> return BLK_QC_T_NONE;
>  }

Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig

2021-01-27 Thread David Sterba

On Wed, Jan 27, 2021 at 10:42:30AM +0100, matthias@kernel.org wrote:
> From: Matthias Brugger 
> 
> Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms")
> btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select
> it. This leads to compilation errors:
> fs/built-in.o: In function `hash_sha256':
> fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts'
> fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update'
> fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish'
> 
> Signed-off-by: Matthias Brugger 

So this is a fix for u-boot, got me confused and not for the first time
as there's Kconfig and the same fs/btrfs/ directory structure.

Re: [PATCH] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-27 Thread Michal Rostecki

On Wed, Jan 27, 2021 at 11:20:55AM +, Filipe Manana wrote:
> On Wed, Jan 27, 2021 at 9:59 AM Michal Rostecki  wrote:
> >
> > From: Michal Rostecki 
> >
> > Before this change, the btrfs_get_io_geometry() function was calling
> > btrfs_get_chunk_map() to get the extent mapping, necessary for
> > calculating the I/O geometry. It was using that extent mapping only
> > internally and freeing the pointer after its execution.
> >
> > That resulted in calling btrfs_get_chunk_map() de facto twice by the
> > __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
> > first and then calling btrfs_get_chunk_map() directly to get the extent
> > mapping, used by the rest of the function.
> >
> > This change fixes that by passing the extent mapping to the
> > btrfs_get_io_geometry() function as an argument.
> >
> > Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately")
> 
> Generally we only use the Fixes tag for bug fixes or serious
> performance regressions.
> Have you seen here a serious performance regression?
> 

No, I didn't see any big difference in terms of performance (at least in
seq reads). I will remove the tag in v2.

> > Signed-off-by: Michal Rostecki 
> > ---
> >  fs/btrfs/inode.c   | 37 -
> >  fs/btrfs/volumes.c | 39 ---
> >  fs/btrfs/volumes.h |  5 +++--
> >  3 files changed, 47 insertions(+), 34 deletions(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 0dbe1aaa0b71..a4ce8501ed4d 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, 
> > size_t size, struct bio *bio,
> > struct inode *inode = page->mapping->host;
> > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> > u64 logical = bio->bi_iter.bi_sector << 9;
> > +   struct extent_map *em;
> > u64 length = 0;
> > u64 map_length;
> > -   int ret;
> > +   int ret = 0;
> > struct btrfs_io_geometry geom;
> >
> > if (bio_flags & EXTENT_BIO_COMPRESSED)
> > @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, 
> > size_t size, struct bio *bio,
> >
> > length = bio->bi_iter.bi_size;
> > map_length = length;
> > -   ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, 
> > map_length,
> > -   &geom);
> > +   em = btrfs_get_chunk_map(fs_info, logical, map_length);
> > +   if (IS_ERR(em))
> > +   return PTR_ERR(em);
> > +   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
> > +   map_length, &geom);
> > if (ret < 0)
> > -   return ret;
> > +   goto out;
> >
> > -   if (geom.len < length + size)
> > -   return 1;
> > -   return 0;
> > +   if (geom.len < length + size) {
> > +   ret = 1;
> > +   goto out;
> > +   }
> > +out:
> > +   free_extent_map(em);
> > +   return ret;
> >  }
> >
> >  /*
> > @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> > *inode, struct iomap *iomap,
> > u64 submit_len;
> > int clone_offset = 0;
> > int clone_len;
> > +   int logical;
> > int ret;
> > blk_status_t status;
> > struct btrfs_io_geometry geom;
> > struct btrfs_dio_data *dio_data = iomap->private;
> > +   struct extent_map *em;
> >
> > dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
> > if (!dip) {
> > @@ -7970,11 +7980,17 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> > *inode, struct iomap *iomap,
> > }
> >
> > start_sector = dio_bio->bi_iter.bi_sector;
> > +   logical = start_sector << 9;
> > submit_len = dio_bio->bi_iter.bi_size;
> >
> > do {
> > -   ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
> > -   start_sector << 9, submit_len,
> > +   em = btrfs_get_chunk_map(fs_info, logical, submit_len);
> > +   if (IS_ERR(em)) {
> > +   status = errno_to_blk_status(ret);
> > +   goto out_err;
> 
> em must be set to NULL before going to "out_err", otherwise we get a
> crash due to an invalid memory access.
> 
> Also, status should be set to "errno_to_blk_status(PTR_ERR(em))". The
> value of ret at this point is undefined.
> 
> Other than that, it looks good.
> 
> Thanks.
> 

Will fix in v2. Thanks!

> > +   }
> > +   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
> > +   logical, submit_len,
> > &geom);
> > if (ret) {
> > status = errno_to_blk_status(ret);
> > @@ -8030,12 +8046,15 @@ static blk_qc_t btrfs_submit_direct(struc

Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig

2021-01-27 Thread Qu Wenruo





On 2021/1/27 下午8:01, David Sterba wrote:

On Wed, Jan 27, 2021 at 10:42:30AM +0100, matthias@kernel.org wrote:

From: Matthias Brugger 

Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms")
btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select
it. This leads to compilation errors:
fs/built-in.o: In function `hash_sha256':
fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts'
fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update'
fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish'

Signed-off-by: Matthias Brugger 


So this is a fix for u-boot, got me confused and not for the first time
as there's Kconfig and the same fs/btrfs/ directory structure.


Well, sometimes too unified file structure/code base can also be a problem.

Considering I'm also going to continue cross-porting more code to 
U-boot, any recommendation on this?

Using different prefix?

Thanks,
Qu

Re: Only one subvolume can be mounted after replace/balance

2021-01-27 Thread Jakob Schöttl


Thank you Chris, it's resolved now, see below.

Am 25.01.21 um 23:47 schrieb Chris Murphy:

On Sat, Jan 23, 2021 at 7:50 AM Jakob Schöttl  wrote:

Hi,

In short:
When mounting a second subvolume from a pool, I get this error:
"mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda,
missing code page or helper program, or other."
dmesg | grep BTRFS only shows this error:
info (device sda): disk space caching is enabled
error (device sda): Remounting read-write after error is not allowed

It went read-only before this because it's confused. You need to
unmount it before it can be mounted rw. In some cases a reboot is
needed.

Oh, I didn't notice that the pool was already mounted (via fstab).
The filesystem where out of space and I had to resize both disks 
separately. And I had to mount with -o skip_balance for that. Now it 
works again.



What happened:

In my RAID1 pool with two disk, I successfully replaced one disk with

btrfs replace start 2 /dev/sdx

After that, I mounted the pool and did

I don't understand this sequence. In order to do a replace, the file
system is already mounted.
That was, what I did before my actual problem occurred. But it's 
resolved now.



btrfs fi show /mnt

which showed WARNINGs about
"filesystems with multiple block group profiles detected"
(don't remember exactly)

I thought it is a good idea to do

btrfs balance start /mnt

which finished without errors.

Balance alone does not convert block groups to a new profile. You have
to explicitly select a conversion filter, e.g.

btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mnt
I didn't want to convert to a new profile. I thought btrfs replace 
automatically uses the same profile as the pool?


Regards, Jakob

Re: btrfs becomes read-only

2021-01-27 Thread Alexey Isaev


I managed to run btrs check, but it didn't solve the problem:

aleksey@host:~$ sudo btrfs check --repair /dev/sdg
[sudo] password for aleksey:
enabling repair mode
Checking filesystem on /dev/sdg
UUID: 070ce9af-6511-4b89-a501-0823514320c1
checking extents
parent transid verify failed on 52180048330752 wanted 132477 found 132432
parent transid verify failed on 52180048330752 wanted 132477 found 132432
parent transid verify failed on 52180048330752 wanted 132477 found 132432
parent transid verify failed on 52180048330752 wanted 132477 found 132432
Ignoring transid failure
leaf parent key incorrect 52180048330752
bad block 52180048330752
Errors found in extent allocation tree or chunk allocation
parent transid verify failed on 52180048330752 wanted 132477 found 132432
Ignoring transid failure
Fixed 0 roots.
checking free space cache
parent transid verify failed on 52180048330752 wanted 132477 found 132432
Ignoring transid failure
There is no free space entry for 5100308647936-5101376765952
cache appears valid but isnt 5100303024128
block group 5143252697088 has wrong amount of free space
failed to load free space cache for block group 5143252697088
block group 10289697259520 has wrong amount of free space
failed to load free space cache for block group 10289697259520
block group 10447537307648 has wrong amount of free space
failed to load free space cache for block group 10447537307648
block group 10856632942592 has wrong amount of free space
failed to load free space cache for block group 10856632942592
block group 10877034037248 has wrong amount of free space
failed to load free space cache for block group 10877034037248
block group 10901730099200 has wrong amount of free space
failed to load free space cache for block group 10901730099200
block group 37427448119296 has wrong amount of free space
failed to load free space cache for block group 37427448119296
block group 40151531126784 has wrong amount of free space
failed to load free space cache for block group 40151531126784
block group 40155826094080 has wrong amount of free space
failed to load free space cache for block group 40155826094080
block group 45729619902464 has wrong amount of free space
failed to load free space cache for block group 45729619902464
block group 45751094738944 has wrong amount of free space
failed to load free space cache for block group 45751094738944
block group 45775790800896 has wrong amount of free space
failed to load free space cache for block group 45775790800896
block group 45797265637376 has wrong amount of free space
failed to load free space cache for block group 45797265637376
block group 45839141568512 has wrong amount of free space
failed to load free space cache for block group 45839141568512
found 31266648615205 bytes used err is -22
total csum bytes: 0
total tree bytes: 2949169152
total fs tree bytes: 0
total extent tree bytes: 2927755264
btree space waste bytes: 813795673
file data blocks allocated: 9188147200
 referenced 9188147200

С уважением,
Алексей Исаев,
RQC

27.01.2021 10:38, Chris Murphy пишет:

On Wed, Jan 27, 2021 at 12:22 AM Alexey Isaev  wrote:

Hello!

BTRFS volume becomes read-only with this messages in dmesg.
What can i do to repair btrfs partition?

[Jan25 08:18] BTRFS error (device sdg): parent transid verify failed on
52180048330752 wanted 132477 found 132432
[  +0.007587] BTRFS error (device sdg): parent transid verify failed on
52180048330752 wanted 132477 found 132432
[  +0.000132] BTRFS error (device sdg): qgroup scan failed with -5

[Jan25 19:52] BTRFS error (device sdg): parent transid verify failed on
52180048330752 wanted 132477 found 132432
[  +0.009783] BTRFS error (device sdg): parent transid verify failed on
52180048330752 wanted 132477 found 132432
[  +0.000132] BTRFS: error (device sdg) in __btrfs_cow_block:1176:
errno=-5 IO failure
[  +0.60] BTRFS info (device sdg): forced readonly
[  +0.04] BTRFS info (device sdg): failed to delete reference to
ftrace.h, inode 2986197 parent 2989315
[  +0.02] BTRFS: error (device sdg) in __btrfs_unlink_inode:4220:
errno=-5 IO failure
[  +0.006071] BTRFS error (device sdg): pending csums is 430080

What kernel version? What drive make/model?

wanted 132477 found 132432 indicates the drive has lost ~45
transactions, that's not good and also weird. There's no crash or any
other errors? A complete dmesg might be more revealing. And also

smartctl -x /dev/sdg
btrfs check --readonly /dev/sdg

After that I suggest
https://btrfs.wiki.kernel.org/index.php/Restore

And try to get any important data out if it's not backed up. You can
try btrfs-find-root to get a listing of roots, most recent to oldest.
Start at the top, and plug that address in as 'btrfs restore -t' and
see if it'll pull anything out. You likely need -i and -v options as
well.

Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig

2021-01-27 Thread David Sterba

On Wed, Jan 27, 2021 at 08:14:31PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/1/27 下午8:01, David Sterba wrote:
> > On Wed, Jan 27, 2021 at 10:42:30AM +0100, matthias@kernel.org wrote:
> >> From: Matthias Brugger 
> >>
> >> Since commit 565a4147d17a ("fs: btrfs: Add more checksum algorithms")
> >> btrfs uses the sha256 checksum algorithm. But Kconfig lacks to select
> >> it. This leads to compilation errors:
> >> fs/built-in.o: In function `hash_sha256':
> >> fs/btrfs/crypto/hash.c:25: undefined reference to `sha256_starts'
> >> fs/btrfs/crypto/hash.c:26: undefined reference to `sha256_update'
> >> fs/btrfs/crypto/hash.c:27: undefined reference to `sha256_finish'
> >>
> >> Signed-off-by: Matthias Brugger 
> > 
> > So this is a fix for u-boot, got me confused and not for the first time
> > as there's Kconfig and the same fs/btrfs/ directory structure.
> > 
> Well, sometimes too unified file structure/code base can also be a problem.
> 
> Considering I'm also going to continue cross-porting more code to 
> U-boot, any recommendation on this?
> Using different prefix?

If it's a series then please mention u-boot in the cover letter, no need
to change the patches, I'll go check CC if I'm too confused about the
patch.

Re: [PATCH] fs: btrfs: Select SHA256 in Kconfig

2021-01-27 Thread Marek Behun

On Wed, 27 Jan 2021 14:47:58 +0100
David Sterba  wrote:

> If it's a series then please mention u-boot in the cover letter, no need
> to change the patches, I'll go check CC if I'm too confused about the
> patch.

I would suggest using subject prefix [PATCH u-boot]

[PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-27 Thread Michal Rostecki

From: Michal Rostecki 

Before this change, the btrfs_get_io_geometry() function was calling
btrfs_get_chunk_map() to get the extent mapping, necessary for
calculating the I/O geometry. It was using that extent mapping only
internally and freeing the pointer after its execution.

That resulted in calling btrfs_get_chunk_map() de facto twice by the
__btrfs_map_block() function. It was calling btrfs_get_io_geometry()
first and then calling btrfs_get_chunk_map() directly to get the extent
mapping, used by the rest of the function.

This change fixes that by passing the extent mapping to the
btrfs_get_io_geometry() function as an argument.

v2:
When btrfs_get_chunk_map() returns an error in btrfs_submit_direct():
- Use errno_to_blk_status(PTR_ERR(em)) as the status
- Set em to NULL

Signed-off-by: Michal Rostecki 
---
 fs/btrfs/inode.c   | 38 +-
 fs/btrfs/volumes.c | 39 ---
 fs/btrfs/volumes.h |  5 +++--
 3 files changed, 48 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0dbe1aaa0b71..e2ee3a9c1140 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
size, struct bio *bio,
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
u64 logical = bio->bi_iter.bi_sector << 9;
+   struct extent_map *em;
u64 length = 0;
u64 map_length;
-   int ret;
+   int ret = 0;
struct btrfs_io_geometry geom;
 
if (bio_flags & EXTENT_BIO_COMPRESSED)
@@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
size, struct bio *bio,
 
length = bio->bi_iter.bi_size;
map_length = length;
-   ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, map_length,
-   &geom);
+   em = btrfs_get_chunk_map(fs_info, logical, map_length);
+   if (IS_ERR(em))
+   return PTR_ERR(em);
+   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
+   map_length, &geom);
if (ret < 0)
-   return ret;
+   goto out;
 
-   if (geom.len < length + size)
-   return 1;
-   return 0;
+   if (geom.len < length + size) {
+   ret = 1;
+   goto out;
+   }
+out:
+   free_extent_map(em);
+   return ret;
 }
 
 /*
@@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode 
*inode, struct iomap *iomap,
u64 submit_len;
int clone_offset = 0;
int clone_len;
+   int logical;
int ret;
blk_status_t status;
struct btrfs_io_geometry geom;
struct btrfs_dio_data *dio_data = iomap->private;
+   struct extent_map *em;
 
dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
if (!dip) {
@@ -7970,11 +7980,18 @@ static blk_qc_t btrfs_submit_direct(struct inode 
*inode, struct iomap *iomap,
}
 
start_sector = dio_bio->bi_iter.bi_sector;
+   logical = start_sector << 9;
submit_len = dio_bio->bi_iter.bi_size;
 
do {
-   ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
-   start_sector << 9, submit_len,
+   em = btrfs_get_chunk_map(fs_info, logical, submit_len);
+   if (IS_ERR(em)) {
+   status = errno_to_blk_status(PTR_ERR(em));
+   em = NULL;
+   goto out_err;
+   }
+   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
+   logical, submit_len,
&geom);
if (ret) {
status = errno_to_blk_status(ret);
@@ -8030,12 +8047,15 @@ static blk_qc_t btrfs_submit_direct(struct inode 
*inode, struct iomap *iomap,
clone_offset += clone_len;
start_sector += clone_len >> 9;
file_offset += clone_len;
+
+   free_extent_map(em);
} while (submit_len > 0);
return BLK_QC_T_NONE;
 
 out_err:
dip->dio_bio->bi_status = status;
btrfs_dio_private_put(dip);
+   free_extent_map(em);
return BLK_QC_T_NONE;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a8ec8539cd8d..4c753b17c0a2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5940,23 +5940,24 @@ static bool need_full_stripe(enum btrfs_map_op op)
 }
 
 /*
- * btrfs_get_io_geometry - calculates the geomery of a particular (address, 
len)
+ * btrfs_get_io_geometry - calculates the geometry of a particular (address, 
len)
  *tuple. This information is used to calculate how big a
  *particular bio can get before it straddles a stripe.
  *
- * @fs_info

Re: [PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-27 Thread Nikolay Borisov




On 27.01.21 г. 15:57 ч., Michal Rostecki wrote:
> From: Michal Rostecki 
> 
> Before this change, the btrfs_get_io_geometry() function was calling
> btrfs_get_chunk_map() to get the extent mapping, necessary for
> calculating the I/O geometry. It was using that extent mapping only
> internally and freeing the pointer after its execution.
> 
> That resulted in calling btrfs_get_chunk_map() de facto twice by the
> __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
> first and then calling btrfs_get_chunk_map() directly to get the extent
> mapping, used by the rest of the function.
> 
> This change fixes that by passing the extent mapping to the
> btrfs_get_io_geometry() function as an argument.
> 
> v2:
> When btrfs_get_chunk_map() returns an error in btrfs_submit_direct():
> - Use errno_to_blk_status(PTR_ERR(em)) as the status
> - Set em to NULL
> 
> Signed-off-by: Michal Rostecki 
> ---
>  fs/btrfs/inode.c   | 38 +-
>  fs/btrfs/volumes.c | 39 ---
>  fs/btrfs/volumes.h |  5 +++--
>  3 files changed, 48 insertions(+), 34 deletions(-)

So this adds more code but for what benefit? In your reply to Filipe you
said you didn't observe this being a performance-affecting change so

> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0dbe1aaa0b71..e2ee3a9c1140 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
> size, struct bio *bio,
>   struct inode *inode = page->mapping->host;
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   u64 logical = bio->bi_iter.bi_sector << 9;
> + struct extent_map *em;
>   u64 length = 0;
>   u64 map_length;
> - int ret;
> + int ret = 0;
>   struct btrfs_io_geometry geom;
>  
>   if (bio_flags & EXTENT_BIO_COMPRESSED)
> @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, 
> size_t size, struct bio *bio,
>  
>   length = bio->bi_iter.bi_size;
>   map_length = length;
> - ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, map_length,
> - &geom);
> + em = btrfs_get_chunk_map(fs_info, logical, map_length);
> + if (IS_ERR(em))
> + return PTR_ERR(em);
> + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
> + map_length, &geom);
>   if (ret < 0)
> - return ret;
> + goto out;
>  
> - if (geom.len < length + size)
> - return 1;
> - return 0;
> + if (geom.len < length + size) {
> + ret = 1;
> + goto out;
> + }

this could be simply

if (geom.len  +out:
> + free_extent_map(em);
> + return ret;
>  }
>  
>  /*
> @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
>   u64 submit_len;
>   int clone_offset = 0;
>   int clone_len;
> + int logical;
>   int ret;
>   blk_status_t status;
>   struct btrfs_io_geometry geom;
>   struct btrfs_dio_data *dio_data = iomap->private;
> + struct extent_map *em;
>  
>   dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
>   if (!dip) {
> @@ -7970,11 +7980,18 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
>   }
>  
>   start_sector = dio_bio->bi_iter.bi_sector;
> + logical = start_sector << 9;
>   submit_len = dio_bio->bi_iter.bi_size;
>  
>   do {
> - ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
> - start_sector << 9, submit_len,
> + em = btrfs_get_chunk_map(fs_info, logical, submit_len);
> + if (IS_ERR(em)) {
> + status = errno_to_blk_status(PTR_ERR(em));
> + em = NULL;
> + goto out_err;
> + }
> + ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
> + logical, submit_len,
>   &geom);
>   if (ret) {
>   status = errno_to_blk_status(ret);
> @@ -8030,12 +8047,15 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
>   clone_offset += clone_len;
>   start_sector += clone_len >> 9;
>   file_offset += clone_len;
> +
> + free_extent_map(em);
>   } while (submit_len > 0);
>   return BLK_QC_T_NONE;
>  
>  out_err:
>   dip->dio_bio->bi_status = status;
>   btrfs_dio_private_put(dip);
> + free_extent_map(em);
>   return BLK_QC_T_NONE;
>  }

For example in this function you increase complexity by having to deal
with free_extent_map as well so I'm not sure this is a net-win.

>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a8ec8539cd8d..4c753b17

[PATCH] btrfs: remove wrong comment for can_nocow_extent()

2021-01-27 Thread fdmanana

From: Filipe Manana 

The comment for can_nocow_extent() says that the function will flush
ordered extents, however that never happens and was never true before the
comment was added in commit e4ecaf90bc13 ("btrfs: add comments for
btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for
the function btrfs_check_can_nocow(), which after that commit was renamed
to check_can_nocow(). So just remove that part of the comment.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0dbe1aaa0b71..589030cefd90 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7105,9 +7105,6 @@ static struct extent_map *btrfs_new_extent_direct(struct 
btrfs_inode *inode,
  * @strict:if true, omit optimizations that might force us into unnecessary
  * cow. e.g., don't trust generation number.
  *
- * This function will flush ordered extents in the range to ensure proper
- * nocow checks for (nowait == false) case.
- *
  * Return:
  * >0  and update @len if we can do nocow write
  *  0  if we can't do nocow write
-- 
2.28.0

Re: [PATCH v4 00/12] Improve preemptive ENOSPC flushing

2021-01-27 Thread David Sterba

On Tue, Jan 26, 2021 at 04:24:24PM -0500, Josef Bacik wrote:
> v3->v4:
> - Added some comments.
> - Updated the tracepoint for the preemptive flushing to take a bool instead of
>   an int.
> - Removed 'inline' from need_preemptive_flushing.

Thanks, I'll add it to misc-next.

Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()

2021-01-27 Thread Josef Bacik


On 1/27/21 5:34 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

Whenever we fsync an inode, if it is a directory, a regular file that was
created in the current transaction or has last_unlink_trans set to the
generation of the current transaction, we check if any of its ancestor
inodes (and the inode itself if it is a directory) can not be logged and
need a fallback to a full transaction commit - if so, we return with a
value of 1 in order to fallback to a transaction commit.

However we often do not need to fallback to a transaction commit because:

1) The ancestor inode is not an immediate parent, and therefore there is
not an explicit request to log it and it is not needed neither to
guarantee the consistency of the inode originally asked to be logged
(fsynced) nor its immediate parent;

2) The ancestor inode was already logged before, in which case any link,
unlink or rename operation updates the log as needed.

So for these two cases we can avoid an unnecessary transaction commit.
Therefore remove check_parent_dirs_for_sync() and add a check at the top
of btrfs_log_inode() to make us fallback immediately to a transaction
commit when we are logging a directory inode that can not be logged and
needs a full transaction commit. All we need to protect is the case where
after renaming a file someone fsyncs only the old directory, which would
result is losing the renamed file after a log replay.

This patch is part of a patchset comprised of the following patches:

   btrfs: remove unnecessary directory inode item update when deleting dir entry
   btrfs: stop setting nbytes when filling inode item for logging
   btrfs: avoid logging new ancestor inodes when logging new inode
   btrfs: skip logging directories already logged when logging all parents
   btrfs: skip logging inodes already logged when logging new entries
   btrfs: remove unnecessary check_parent_dirs_for_sync()
   btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana 


I'm having a hard time with this one.

Previously we would commit the transaction if the inode was a regular file, that 
was created in this current transaction, and had been renamed.  Now with this 
patch you're only committing the transaction if we are a directory and were 
renamed ourselves.  Before if you already had directories A and B and then did 
something like


echo "foo" > /mnt/test/A/blah
fsync(/mnt/test/A/blah);
fsync(/mnt/test/A);
mv /mnt/test/A/blah /mnt/test/B
fsync(/mnt/test/B/blah);

we would commit the transaction on this second fsync, but with your patch we are 
not.  I suppose that's keeping in line with how fsync is allowed to work, but 
it's definitely a change in behavior from what we used to do.  Not sure if 
that's good or not, I'll have to think about it.  Thanks,


Josef

Re: [PATCH] btrfs: remove wrong comment for can_nocow_extent()

2021-01-27 Thread Josef Bacik


On 1/27/21 10:05 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

The comment for can_nocow_extent() says that the function will flush
ordered extents, however that never happens and was never true before the
comment was added in commit e4ecaf90bc13 ("btrfs: add comments for
btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for
the function btrfs_check_can_nocow(), which after that commit was renamed
to check_can_nocow(). So just remove that part of the comment.

Signed-off-by: Filipe Manana 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v3 01/12] btrfs: make flush_space take a enum btrfs_flush_state instead of int

2021-01-27 Thread David Sterba

On Tue, Jan 26, 2021 at 03:32:36PM -0500, Josef Bacik wrote:
> On 1/26/21 1:36 PM, David Sterba wrote:
> > On Fri, Oct 09, 2020 at 09:28:18AM -0400, Josef Bacik wrote:
> >> I got a automated message from somebody who runs clang against our
> >> kernels and it's because I used the wrong enum type for what I passed
> >> into flush_space.  Change the argument to be explicitly the enum we're
> >> expecting to make everything consistent.  Maybe eventually gcc will
> >> catch errors like this.
> > 
> > I can't find any such public report and none of the clang warnings seem
> > to be specific about that. Local run with clang 11 does not produce any
> > warning.
> > 
> 
> IDK, it was a private email just to me with the following output from clang
> 
> s/btrfs/space-info.c:1115:12: warning: implicit conversion from
> enumeration type 'enum btrfs_flush_state' to different enumeration type
> 'enum btrfs_reserve_flush_enum' [-Wenum-conversion]
>   flush = FLUSH_DELALLOC;
> ~ ^~
> fs/btrfs/space-info.c:1120:12: warning: implicit conversion from
> enumeration type 'enum btrfs_flush_state' to different enumeration type
> 'enum btrfs_reserve_flush_enum' [-Wenum-conversion]
>   flush = FORCE_COMMIT_TRANS;
> ~ ^~
> fs/btrfs/space-info.c:1124:12: warning: implicit conversion from
> enumeration type 'enum btrfs_flush_state' to different enumeration type
> 'enum btrfs_reserve_flush_enum' [-Wenum-conversion]
>   flush = FLUSH_DELAYED_ITEMS_NR;
> ~ ^~
> fs/btrfs/space-info.c:1127:12: warning: implicit conversion from
> enumeration type 'enum btrfs_flush_state' to different enumeration type
> 'enum btrfs_reserve_flush_enum' [-Wenum-conversion]
>   flush = FLUSH_DELAYED_REFS_NR;
> ~ ^
> 
> I figure it made sense, might as well fix it.  Do we have that option for our 
> shiny new -W build options?  Thanks,

We can add -Wenum-conversion sure, but neither gcc (10.2.1) nor clang
(11.0.1) reproduces the warning for me so I wonder how useful it would
be. I've also tried adding -Wextra and -Wall and the build is clean.

Per gcc documentation, -Wenum-conversion is not exactly what the code
did:

 "Warn when a value of enumerated type is implicitly converted to a
  different enumerated type.  This warning is enabled by -Wextra."

And now that I read the warning carefuly it does not apply to our code,
"btrfs_reserve_flush_enum = btrfs_flush_state" does not happen so it
must have been some older revision of the code.

Changing 
@@ -1073,7 +1073,7 @@ static void btrfs_preempt_reclaim_metadata_space(struct 
work_struct *work)
 
spin_lock(&space_info->lock);
while (need_preemptive_reclaim(fs_info, space_info)) {
-   enum btrfs_flush_state flush;
+   enum btrfs_reserve_flush_enum flush;

---

gcc says (on misc-next without this patchset)

fs/btrfs/space-info.c: In function ‘btrfs_preempt_reclaim_metadata_space’:
fs/btrfs/space-info.c:1112:10: warning: implicit conversion from ‘enum 
btrfs_flush_state’ to ‘enum btrfs_reserve_flush_enum’ [-Wenum-conversion]
 1112 |flush = FLUSH_DELALLOC;
  |  ^
  CC [M]  fs/btrfs/block-group.o
fs/btrfs/space-info.c:1117:10: warning: implicit conversion from ‘enum 
btrfs_flush_state’ to ‘enum btrfs_reserve_flush_enum’ [-Wenum-conversion]
 1117 |flush = FORCE_COMMIT_TRANS;
  |  ^
fs/btrfs/space-info.c:1121:10: warning: implicit conversion from ‘enum 
btrfs_flush_state’ to ‘enum btrfs_reserve_flush_enum’ [-Wenum-conversion]
 1121 |flush = FLUSH_DELAYED_ITEMS_NR;
  |  ^
fs/btrfs/space-info.c:1124:10: warning: implicit conversion from ‘enum 
btrfs_flush_state’ to ‘enum btrfs_reserve_flush_enum’ [-Wenum-conversion]
 1124 |flush = FLUSH_DELAYED_REFS_NR;
  |  ^
fs/btrfs/space-info.c:1135:48: warning: implicit conversion from ‘enum 
btrfs_reserve_flush_enum’ to ‘enum btrfs_flush_state’ [-Wenum-conversion]
 1135 |   flush_space(fs_info, space_info, to_reclaim, flush, true);
  |
---

So you weren't fixing the reported problem by this patch. Assigning enum
to an int should be natural so thre's no reason to fix that but for
clarity I don't mind, the patch stays.

Re: [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

2021-01-27 Thread Josef Bacik


On 1/27/21 5:35 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).

In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.

So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).

This patch is part of a patchset comprised of the following patches:

   btrfs: remove unnecessary directory inode item update when deleting dir entry
   btrfs: stop setting nbytes when filling inode item for logging
   btrfs: avoid logging new ancestor inodes when logging new inode
   btrfs: skip logging directories already logged when logging all parents
   btrfs: skip logging inodes already logged when logging new entries
   btrfs: remove unnecessary check_parent_dirs_for_sync()
   btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:

   $ cat dbench-test.sh
   #!/bin/bash

   DEV=/dev/sdk
   MNT=/mnt/sdk
   MOUNT_OPTIONS="-o ssd"
   MKFS_OPTIONS="-m single -d single"

   echo "performance" | tee 
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

   umount $DEV &> /dev/null
   mkfs.btrfs -f $MKFS_OPTIONS $DEV
   mount $MOUNT_OPTIONS $DEV $MNT

   dbench -D $MNT -t 300 64

   umount $MNT

The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).

Before applying patchset, 32 clients:

  Operation  CountAvgLatMaxLat
  
  NTCreateX9627107 0.15361.938
  Close7072076 0.001 3.175
  Rename407633 1.22244.439
  Unlink   1943895 0.65844.440
  Deltree  25617.339   110.891
  Mkdir128 0.003 0.009
  Qpathinfo8725406 0.06417.850
  Qfileinfo1529516 0.001 2.188
  Qfsinfo  1599884 0.002 1.457
  Sfileinfo 784200 0.005 3.562
  Find 3373513 0.41130.312
  WriteX   4802132 0.05329.054
  ReadX15089959 0.002 5.801
  LockX  31344 0.002 0.425
  UnlockX31344 0.001 0.173
  Flush 674724 5.952   341.830

Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms

After applying patchset, 32 clients:

After patchset, with 32 clients:

  Operation  CountAvgLatMaxLat
  
  NTCreateX9931568 0.11125.597
  Close7295730 0.001 2.171
  Rename420549 0.98249.714
  Unlink   2005366 0.49739.015
  Deltree  25611.14989.242
  Mkdir128 0.002 0.014
  Qpathinfo9001863 0.04920.761
  Qfileinfo1577730 0.001 2.546
  Qfsinfo  1650508 0.002 3.531
  Sfileinfo 809031 0.005 5.846
  Find 3480259 0.30923.977
  WriteX   4952505 0.04341.283
  ReadX15568127 0.002 5.476
  LockX  32338 0.002 0.978
  UnlockX32338 0.001 2.032
  Flush 696017 7.485   228.835

Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms

  --> +4.1% throughput, -39.6% max latency

Before applying patchset, 64 clients:

  Operation  CountAvgLatMaxLat
  
  NTCreateX8956748 0.342   108.312
  Close6579660 0.001 3.823
  Rename379209 2.39681.897
  Unlink   1808625 1.108   131.148
  Deltree  25625.632   172.176
  Mkdir128 0.003 0.018
  Qpathinfo8117615 0.13155

Re: [PATCH] btrfs: Simplify the calculation of variables

2021-01-27 Thread Josef Bacik


On 1/27/21 3:11 AM, Abaci Team wrote:

Fix the following coccicheck warnings:

./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is
equivalent to !A || B.

Reported-by: Abaci Robot 
Suggested-by: Jiapeng Zhong 
Signed-off-by: Abaci Team 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()

2021-01-27 Thread Filipe Manana

On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik  wrote:
>
> On 1/27/21 5:34 AM, fdman...@kernel.org wrote:
> > From: Filipe Manana 
> >
> > Whenever we fsync an inode, if it is a directory, a regular file that was
> > created in the current transaction or has last_unlink_trans set to the
> > generation of the current transaction, we check if any of its ancestor
> > inodes (and the inode itself if it is a directory) can not be logged and
> > need a fallback to a full transaction commit - if so, we return with a
> > value of 1 in order to fallback to a transaction commit.
> >
> > However we often do not need to fallback to a transaction commit because:
> >
> > 1) The ancestor inode is not an immediate parent, and therefore there is
> > not an explicit request to log it and it is not needed neither to
> > guarantee the consistency of the inode originally asked to be logged
> > (fsynced) nor its immediate parent;
> >
> > 2) The ancestor inode was already logged before, in which case any link,
> > unlink or rename operation updates the log as needed.
> >
> > So for these two cases we can avoid an unnecessary transaction commit.
> > Therefore remove check_parent_dirs_for_sync() and add a check at the top
> > of btrfs_log_inode() to make us fallback immediately to a transaction
> > commit when we are logging a directory inode that can not be logged and
> > needs a full transaction commit. All we need to protect is the case where
> > after renaming a file someone fsyncs only the old directory, which would
> > result is losing the renamed file after a log replay.
> >
> > This patch is part of a patchset comprised of the following patches:
> >
> >btrfs: remove unnecessary directory inode item update when deleting dir 
> > entry
> >btrfs: stop setting nbytes when filling inode item for logging
> >btrfs: avoid logging new ancestor inodes when logging new inode
> >btrfs: skip logging directories already logged when logging all parents
> >btrfs: skip logging inodes already logged when logging new entries
> >btrfs: remove unnecessary check_parent_dirs_for_sync()
> >btrfs: make concurrent fsyncs wait less when waiting for a transaction 
> > commit
> >
> > Performance results, after applying all patches, are mentioned in the
> > change log of the last patch.
> >
> > Signed-off-by: Filipe Manana 
>
> I'm having a hard time with this one.
>
> Previously we would commit the transaction if the inode was a regular file, 
> that
> was created in this current transaction, and had been renamed.  Now with this
> patch you're only committing the transaction if we are a directory and were
> renamed ourselves.  Before if you already had directories A and B and then did
> something like
>
> echo "foo" > /mnt/test/A/blah
> fsync(/mnt/test/A/blah);
> fsync(/mnt/test/A);
> mv /mnt/test/A/blah /mnt/test/B
> fsync(/mnt/test/B/blah);
>
> we would commit the transaction on this second fsync, but with your patch we 
> are
> not.  I suppose that's keeping in line with how fsync is allowed to work, but
> it's definitely a change in behavior from what we used to do.  Not sure if
> that's good or not, I'll have to think about it.  Thanks,

Yes. Because of the rename (or a link), we will set last_unlink_trans
to the current transaction, and when logging the file that will cause
logging of all its old parents (A). That was added several years ago
to fix corruptions, and it turned out to be needed later as well to
ensure we have
a behaviour similar to xfs and ext4 (and others) regarding strictly
ordered metadata updates (I added several tests to fstests over the
years for all the cases).
There's also the fact that on replay we will delete any inode refs
that aren't in the log (that one was added in commit 1f250e929a9c
("Btrfs: fix log replay failure after unlink and link combination").

For that example we also have A updated in the log by the rename. So
we know the log is consistent.

So that's why the whole check_parents_for_sync() is not needed.

Thanks.

>
> Josef

Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()

2021-01-27 Thread Josef Bacik


On 1/27/21 10:36 AM, Filipe Manana wrote:

On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik  wrote:


On 1/27/21 5:34 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

Whenever we fsync an inode, if it is a directory, a regular file that was
created in the current transaction or has last_unlink_trans set to the
generation of the current transaction, we check if any of its ancestor
inodes (and the inode itself if it is a directory) can not be logged and
need a fallback to a full transaction commit - if so, we return with a
value of 1 in order to fallback to a transaction commit.

However we often do not need to fallback to a transaction commit because:

1) The ancestor inode is not an immediate parent, and therefore there is
 not an explicit request to log it and it is not needed neither to
 guarantee the consistency of the inode originally asked to be logged
 (fsynced) nor its immediate parent;

2) The ancestor inode was already logged before, in which case any link,
 unlink or rename operation updates the log as needed.

So for these two cases we can avoid an unnecessary transaction commit.
Therefore remove check_parent_dirs_for_sync() and add a check at the top
of btrfs_log_inode() to make us fallback immediately to a transaction
commit when we are logging a directory inode that can not be logged and
needs a full transaction commit. All we need to protect is the case where
after renaming a file someone fsyncs only the old directory, which would
result is losing the renamed file after a log replay.

This patch is part of a patchset comprised of the following patches:

btrfs: remove unnecessary directory inode item update when deleting dir 
entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction 
commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana 


I'm having a hard time with this one.

Previously we would commit the transaction if the inode was a regular file, that
was created in this current transaction, and had been renamed.  Now with this
patch you're only committing the transaction if we are a directory and were
renamed ourselves.  Before if you already had directories A and B and then did
something like

echo "foo" > /mnt/test/A/blah
fsync(/mnt/test/A/blah);
fsync(/mnt/test/A);
mv /mnt/test/A/blah /mnt/test/B
fsync(/mnt/test/B/blah);

we would commit the transaction on this second fsync, but with your patch we are
not.  I suppose that's keeping in line with how fsync is allowed to work, but
it's definitely a change in behavior from what we used to do.  Not sure if
that's good or not, I'll have to think about it.  Thanks,


Yes. Because of the rename (or a link), we will set last_unlink_trans
to the current transaction, and when logging the file that will cause
logging of all its old parents (A). That was added several years ago
to fix corruptions, and it turned out to be needed later as well to
ensure we have
a behaviour similar to xfs and ext4 (and others) regarding strictly
ordered metadata updates (I added several tests to fstests over the
years for all the cases).
There's also the fact that on replay we will delete any inode refs
that aren't in the log (that one was added in commit 1f250e929a9c
("Btrfs: fix log replay failure after unlink and link combination").

For that example we also have A updated in the log by the rename. So
we know the log is consistent.

So that's why the whole check_parents_for_sync() is not needed.



Ok that's reasonable, thanks,

Josef

Re: [PATCH 0/7] btrfs: more performance improvements for dbench workloads

2021-01-27 Thread Josef Bacik


On 1/27/21 5:34 AM, fdman...@kernel.org wrote:

From: Filipe Manana 

The following patchset brings one more batch of performance improvements
with dbench workloads, or anything that mixes file creation, file writes,
renames, unlinks, etc with fsync like dbench does. This patchset is mostly
based on avoiding logging directory inodes multiple times when not necessary,
falling back to transaction commits less frequently and often waiting less
time for transaction commits to complete. Performance results are listed in
the change log of the last patch, but in short, I've experienced a reduction
of maximum latency up to about -40% and throuhput gains up to about +6%.

Filipe Manana (7):
   btrfs: remove unnecessary directory inode item update when deleting dir entry
   btrfs: stop setting nbytes when filling inode item for logging
   btrfs: avoid logging new ancestor inodes when logging new inode
   btrfs: skip logging directories already logged when logging all parents
   btrfs: skip logging inodes already logged when logging new entries
   btrfs: remove unnecessary check_parent_dirs_for_sync()
   btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

  fs/btrfs/file.c|   1 +
  fs/btrfs/transaction.c |  39 +++--
  fs/btrfs/transaction.h |   2 +
  fs/btrfs/tree-log.c| 195 -
  4 files changed, 92 insertions(+), 145 deletions(-)



You can add

Reviewed-by: Josef Bacik 

to the whole series, thanks,

Josef

Re: [PATCH v5 02/18] btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

For the incoming subpage support, UNMAPPED extent buffer will have
different behavior in btrfs_release_extent_buffer().

This means we need to set UNMAPPED bit early before calling
btrfs_release_extent_buffer().

Currently there is only one caller which relies on
btrfs_release_extent_buffer() in its error path while set UNMAPPED bit
late:
- btrfs_clone_extent_buffer()

Make it subpage compatible by setting the UNMAPPED bit early, since
we're here, also move the UPTODATE bit early.

There is another caller, __alloc_dummy_extent_buffer(), setting UNAMPPED
bit late, but that function clean up the allocated page manually, thus
no need for any modification.

Signed-off-by: Qu Wenruo 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 04/18] btrfs: make attach_extent_buffer_page() handle subpage case

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

For subpage case, we need to allocate additional memory for each
metadata page.

So we need to:

- Allow attach_extent_buffer_page() to return int to indicate allocation
   failure

- Allow manually pre-allocate subpage memory for alloc_extent_buffer()
   As we don't want to use GFP_ATOMIC under spinlock, we introduce
   btrfs_alloc_subpage() and btrfs_free_subpage() functions for this
   purpose.
   (The simple wrap for btrfs_free_subpage() is for later convert to
kmem_cache. Already internally tested without problem)

- Preallocate btrfs_subpage structure for alloc_extent_buffer()
   We don't want to call memory allocation with spinlock held, so
   do preallocation before we acquire mapping->private_lock.

- Handle subpage and regular case differently in
   attach_extent_buffer_page()
   For regular case, no change, just do the usual thing.
   For subpage case, allocate new memory or use the preallocated memory.

For future subpage metadata, we will make use of radix tree to grab
extent buffer.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 01/18] btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
__process_pages_contig(), to let the function know to clear page dirty
bit and then set page writeback.

However page writeback and dirty bits are conflicting (at least for
sector size == PAGE_SIZE case), this means these two have to be always
updated together.

This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
PAGE_START_WRITEBACK.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 00/18] btrfs: add read-only support for subpage sector size

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

Patches can be fetched from github:
https://github.com/adam900710/linux/tree/subpage
Currently the branch also contains partial RW data support (still some
ordered extent and data csum mismatch problems)

Great thanks to David/Nikolay/Josef for their effort reviewing and
merging the preparation patches into misc-next.

=== What works ===
Just from the patchset:
- Data read
   Both regular and compressed data, with csum check.

- Metadata read

This means, with these patchset, 64K page systems can at least mount
btrfs with 4K sector size read-only.
This should provide the ability to migrate data at least.

While on the github branch, there are already experimental RW supports,
there are still ordered extent related bugs for me to fix.
Thus only the RO part is sent for review and testing.

=== Patchset structure ===
Patch 01~02:Preparation patches which don't have functional change
Patch 03~12:Subpage metadata allocation and freeing
Patch 13~15:Subpage metadata read path
Patch 16~17:Subpage data read path
Patch 18:   Enable subpage RO support

=== Changelog ===
v1:
- Separate the main implementation from previous huge patchset
   Huge patchset doesn't make much sense.

- Use bitmap implementation
   Now page::private will be a pointer to btrfs_subpage structure, which
   contains bitmaps for various page status.

v2:
- Use page::private as btrfs_subpage for extra info
   This replace old extent io tree based solution, which reduces latency
   and don't require memory allocation for its operations.

- Cherry-pick new preparation patches from RW development
   Those new preparation patches improves the readability by their own.

v3:
- Make dummy extent buffer to follow the same subpage accessors
   Fsstress exposed several ASSERT() for dummy extent buffers.
   It turns out we need to make dummy extent buffer to own the same
   btrfs_subpage structure to make eb accessors to work properly

- Two new small __process_pages_contig() related preparation patches
   One to make __process_pages_contig() to enhance the error handling
   path for locked_page, one to merge one macro.

- Extent buffers refs count update
   Except try_release_extent_buffer(), all other eb uses will try to
   increase the ref count of the eb.
   For try_release_extent_buffer(), the eb refs check will happen inside
   the rcu critical section to avoid eb being freed.

- Comment updates
   Addressing the comments from the mail list.

v4:
- Get rid of btrfs_subpage::tree_block_bitmap
   This is to reduce lock complexity (no need to bother extra subpage
   lock for metadata, all locks are existing locks)
   Now eb looking up mostly depends on radix tree, with small help from
   btrfs_subpage::under_alloc.
   Now I haven't experieneced metadata related problems any more during
   my local fsstress tests.

- Fix a race where metadata page dirty bit can race
   Fixed in the metadata RW patchset though.

- Rebased to latest misc-next branch
   With 4 patches removed, as they are already in misc-next.

v5:
- Use the updated version from David as base
   Most comment/commit message update should be kept as is.

- A new separate patch to move UNMAPPED bit set timing

- New comment on why we need to prealloc subpage inside a loop
   Mostly for further 16K page size support, where we can have
   eb across multiple pages.

- Remove one patch which is too RW specific
   Since it introduces functional change which only makes sense for RW
   support, it's not a good idea to include it in RO support.

- Error handling fixes
   Great thanks to Josef.

- Refactor btrfs_subpage allocation/freeing
   Now we have btrfs_alloc_subpage() and btrfs_free_subpage() helpers to
   do all the allocation/freeing.
   It's pretty easy to convert to kmem_cache using above helpers.
   (already internally tested using kmem_cache without problem, in fact
it's all the problems found in kmem_cache test leads to the new
interface)

- Use btrfs_subpage::eb_refs to replace old under_alloc
   This makes checking whether the page has any eb left much easier.

Qu Wenruo (18):
   btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
 PAGE_START_WRITEBACK
   btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for
 subpage support
   btrfs: introduce the skeleton of btrfs_subpage structure
   btrfs: make attach_extent_buffer_page() handle subpage case
   btrfs: make grab_extent_buffer_from_page() handle subpage case
   btrfs: support subpage for extent buffer page release


I don't have this patch in my inbox so I can't reply to it directly, but you 
include refcount.h, but then use normal atomics.  Please used the actual 
refcount_t, as it gets us all the debugging stuff that makes finding problems 
much easier.  Thanks,


Josef

Re: [PATCH v5 05/18] btrfs: make grab_extent_buffer_from_page() handle subpage case

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

For subpage case, grab_extent_buffer() can't really get an extent buffer
just from btrfs_subpage.

We have radix tree lock protecting us from inserting the same eb into
the tree.  Thus we don't really need to do the extra hassle, just let
alloc_extent_buffer() handle the existing eb in radix tree.

Now if two ebs are being allocated as the same time, one will fail with
-EEIXST when inserting into the radix tree.

So for grab_extent_buffer(), just always return NULL for subpage case.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 06/18] btrfs: support subpage for extent buffer page release

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.

Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.

For subpage case, handle detaching page private.

For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.

For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.

But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:

   Extent buffer A is the new extent buffer which will be allocated,
   while extent buffer B is the last existing extent buffer of the page.

T1 (eb A)|  T2 (eb B)
   ---+--
   alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
   |- p = find_or_create_page()   | |
   |- attach_extent_buffer_page() | |
   | | |- detach_extent_buffer_page()
   | ||- if (!page_range_has_eb())
   | ||  No new eb in the page range yet
   | ||  As new eb A hasn't yet been
   | ||  inserted into radix tree.
   | ||- btrfs_detach_subpage()
   | |   |- detach_page_private();
   |- radix_tree_insert()|

   Then we have a metadata eb whose page has no private bit.

To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.

In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock.  Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.

The section is marked by:

- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 
---
  fs/btrfs/extent_io.c | 94 +---
  fs/btrfs/subpage.c   | 42 
  fs/btrfs/subpage.h   | 13 +-
  3 files changed, 133 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 16a29f63cfd1..118874926179 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4993,25 +4993,39 @@ int extent_buffer_under_io(const struct extent_buffer 
*eb)
test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
  }
  
-/*

- * Release all pages attached to the extent buffer.
- */
-static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
+static bool page_range_has_eb(struct btrfs_fs_info *fs_info, struct page *page)
  {
-   int i;
-   int num_pages;
-   int mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
+   struct btrfs_subpage *subpage;
  
-	BUG_ON(extent_buffer_under_io(eb));

+   lockdep_assert_held(&page->mapping->private_lock);
  
-	num_pages = num_extent_pages(eb);

-   for (i = 0; i < num_pages; i++) {
-   struct page *page = eb->pages[i];
+   if (PagePrivate(page)) {
+   subpage = (struct btrfs_subpage *)page->private;
+   if (atomic_read(&subpage->eb_refs))
+   return true;
+   }
+   return false;
+}
  
-		if (!page)

-   continue;
+static void detach_extent_buffer_page(struct extent_buffer *eb, struct page 
*page)
+{
+   struct btrfs_fs_info *fs_info = eb->fs_info;
+   const bool mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
+
+   /*
+* For mapped eb, we're going to change the page private, which should
+* be done under the private_lock.
+*/
+   if (mapped)
+   spin_lock(&page->mapping->private_lock);
+
+   if (!PagePrivate(page)) {
if (mapped)
-   spin_lock(&page->mapping->private_lock);
+   spin_unlock(&page->mapping->private_lock);
+   return;
+   }
+
+   if (fs_info->sectorsize == PAGE_SIZE) {
/*
 * We do this since we'll remove the pages after we've
 * removed the eb from the radix tree, so we could race
@@ -5030,9 +5044,49 @@ static void btrfs_release_extent_buffer_pages(struct 
extent_buffer *eb)
 */
detach_page_private(page);
}
-
if (mapped)
spin_unlock(&page->mapping->private_lock);
+   return;
+   }
+
+   /*
+* For subpage, we can have dummy eb with page private.  In this case,
+* we can directly detach the private as such page is only attached to
+* one dummy eb, no sharing.
+*/
+

Re: [PATCH v5 07/18] btrfs: attach private to dummy extent buffer pages

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

There are locations where we allocate dummy extent buffers for temporary
usage, like in tree_mod_log_rewind() or get_old_root().

These dummy extent buffers will be handled by the same eb accessors, and
if they don't have page::private subpage eb accessors could fail.

To address such problems, make __alloc_dummy_extent_buffer() attach
page private for dummy extent buffers too.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Yup I'm slow,

Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 08/18] btrfs: introduce helpers for subpage uptodate status

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

Introduce the following functions to handle subpage uptodate status:

- btrfs_subpage_set_uptodate()
- btrfs_subpage_clear_uptodate()
- btrfs_subpage_test_uptodate()
   These helpers can only be called when the page has subpage attached
   and the range is ensured to be inside the page.

- btrfs_page_set_uptodate()
- btrfs_page_clear_uptodate()
- btrfs_page_test_uptodate()
   These helpers can handle both regular sector size and subpage.
   Although caller should still ensure that the range is inside the page.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 09/18] btrfs: introduce helpers for subpage error status

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

Introduce the following functions to handle subpage error status:

- btrfs_subpage_set_error()
- btrfs_subpage_clear_error()
- btrfs_subpage_test_error()
   These helpers can only be called when the page has subpage attached
   and the range is ensured to be inside the page.

- btrfs_page_set_error()
- btrfs_page_clear_error()
- btrfs_page_test_error()
   These helpers can handle both regular sector size and subpage without
   problem.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 10/18] btrfs: support subpage in set/clear_extent_buffer_uptodate()

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

To support subpage in set_extent_buffer_uptodate and
clear_extent_buffer_uptodate we only need to use the subpage-aware
helpers to update the page bits.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 11/18] btrfs: support subpage in btrfs_clone_extent_buffer

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

For btrfs_clone_extent_buffer(), it's mostly the same code of
__alloc_dummy_extent_buffer(), except it has extra page copy.

So to make it subpage compatible, we only need to:

- Call set_extent_buffer_uptodate() instead of SetPageUptodate()
   This will set correct uptodate bit for subpage and regular sector size
   cases.

Since we're calling set_extent_buffer_uptodate() which will also set
EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit
either.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 12/18] btrfs: support subpage in try_release_extent_buffer()

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

Unlike the original try_release_extent_buffer(),
try_release_subpage_extent_buffer() will iterate through all the ebs in
the page, and try to release each.

We can release the full page only after there's no private attached,
which means all ebs of that page have been released as well.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 13/18] btrfs: introduce read_extent_buffer_subpage()

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

Introduce a helper, read_extent_buffer_subpage(), to do the subpage
extent buffer read.

The difference between regular and subpage routines are:

- No page locking
   Here we completely rely on extent locking.
   Page locking can reduce the concurrency greatly, as if we lock one
   page to read one extent buffer, all the other extent buffers in the
   same page will have to wait.

- Extent uptodate condition
   Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check,
   We also need to check btrfs_subpage::uptodate_bitmap.

- No page iteration
   Just one page, no need to loop, this greatly simplified the subpage
   routine.

This patch only implements the bio submit part, no endio support yet.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 14/18] btrfs: support subpage in endio_readpage_update_page_status()

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

To handle subpage status update, add the following:

- Use btrfs_page_*() subpage-aware helpers to update page status
   Now we can handle both cases well.

- No page unlock for subpage metadata
   Since subpage metadata doesn't utilize page locking at all, skip it.
   For subpage data locking, it's handled in later commits.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 15/18] btrfs: introduce subpage metadata validation check

2021-01-27 Thread Josef Bacik


On 1/26/21 3:33 AM, Qu Wenruo wrote:

For subpage metadata validation check, there are some differences:

- Read must finish in one bvec
   Since we're just reading one subpage range in one page, it should
   never be split into two bios nor two bvecs.

- How to grab the existing eb
   Instead of grabbing eb using page->private, we have to go search radix
   tree as we don't have any direct pointer at hand.

Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: misc bio allocation cleanups

2021-01-27 Thread Jens Axboe

On 1/26/21 7:52 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> this series contains various cleanups for how bios are allocated or
> initialized plus related fallout.

Applied, thanks.

-- 
Jens Axboe

Re: [PATCH v5 16/18] btrfs: introduce btrfs_subpage for data inodes

2021-01-27 Thread Josef Bacik


On 1/26/21 3:34 AM, Qu Wenruo wrote:

To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...

This patch will make pages for data inodes to get btrfs_subpage
structure attached, and detached when the page is freed.

This patch also slightly changes the timing when
set_page_extent_mapped() to make sure:

- We have page->mapping set
   page->mapping->host is used to grab btrfs_fs_info, thus we can only
   call this function after page is mapped to an inode.

   One call site attaches pages to inode manually, thus we have to modify
   the timing of set_page_extent_mapped() a little.

- As soon as possible, before other operations
   Since memory allocation can fail, we have to do extra error handling.
   Calling set_page_extent_mapped() as soon as possible can simply the
   error handling for several call sites.

The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.

Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).

So we will stick to btrfs specific bitmap for now.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
  fs/btrfs/compression.c  | 10 ++--
  fs/btrfs/extent_io.c| 46 +
  fs/btrfs/extent_io.h|  3 ++-
  fs/btrfs/file.c | 24 ---
  fs/btrfs/free-space-cache.c | 15 +---
  fs/btrfs/inode.c| 14 +++
  fs/btrfs/ioctl.c|  8 ++-
  fs/btrfs/reflink.c  |  5 +++-
  fs/btrfs/relocation.c   | 11 +++--
  9 files changed, 103 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5ae3fa0386b7..6d203acfdeb3 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -542,13 +542,19 @@ static noinline int add_ra_bio_pages(struct inode *inode,
goto next;
}
  
-		end = last_offset + PAGE_SIZE - 1;

/*
 * at this point, we have a locked page in the page cache
 * for these bytes in the file.  But, we have to make
 * sure they map to this compressed extent on disk.
 */
-   set_page_extent_mapped(page);
+   ret = set_page_extent_mapped(page);
+   if (ret < 0) {
+   unlock_page(page);
+   put_page(page);
+   break;
+   }
+
+   end = last_offset + PAGE_SIZE - 1;
lock_extent(tree, last_offset, end);
read_lock(&em_tree->lock);
em = lookup_extent_mapping(em_tree, last_offset,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 139a8a77ed72..3213daaa 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3190,10 +3190,39 @@ static int attach_extent_buffer_page(struct 
extent_buffer *eb,
return ret;
  }
  
-void set_page_extent_mapped(struct page *page)

+int set_page_extent_mapped(struct page *page)
  {
+   struct btrfs_fs_info *fs_info;
+
+   ASSERT(page->mapping);
+
+   if (PagePrivate(page))
+   return 0;
+
+   fs_info = btrfs_sb(page->mapping->host->i_sb);
+
+   if (fs_info->sectorsize < PAGE_SIZE)
+   return btrfs_attach_subpage(fs_info, page, BTRFS_SUBPAGE_DATA);
+
+   attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
+   return 0;
+
+}
+
+void clear_page_extent_mapped(struct page *page)
+{
+   struct btrfs_fs_info *fs_info;
+
+   ASSERT(page->mapping);
+
if (!PagePrivate(page))
-   attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
+   return;
+
+   fs_info = btrfs_sb(page->mapping->host->i_sb);
+   if (fs_info->sectorsize < PAGE_SIZE)
+   return btrfs_detach_subpage(fs_info, page);
+
+   detach_page_private(page);
  }
  
  static struct extent_map *

@@ -3250,7 +3279,12 @@ int btrfs_do_readpage(struct page *page, struct 
extent_map **em_cached,
unsigned long this_bio_flag = 0;
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
  
-	set_page_extent_mapped(page);

+   ret = set_page_extent_mapped(page);
+   if (ret < 0) {
+   unlock_extent(tree, start, end);
+   SetPageError(page);
+   goto out;
+   }
  
  	if (!PageUptodate(page)) {

if (cleancache_get_page(page) == 0) {
@@ -3690,7 +3724,11 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
flush_dcache_page(page);
}
  
-	set_page_extent_mapped(page);

+   ret = set_page_extent_mapped(page);
+   if (ret < 0) {
+   SetPageError(page);
+   goto done;
+   }
  
  	if (!epd->extent_locked) {

ret = writepage_delall

Re: [PATCH v5 17/18] btrfs: integrate page status update for data read path into begin/end_page_read()

2021-01-27 Thread Josef Bacik


On 1/26/21 3:34 AM, Qu Wenruo wrote:

In btrfs data page read path, the page status update are handled in two
different locations:

   btrfs_do_read_page()
   {
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
   }

   end_bio_extent_readpage()
   {
endio_readpage_uptodate_page_status();
   }

This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.

But for subpage, there are more works to be done in page status update:
- Page Unlock condition
   Unlike regular page size == sectorsize case, we can no longer just
   unlock a page without a brain.
   Only the last reader of the page can unlock the page.
   This means, we can unlock the page either in the while() loop, or in
   the endio function.

- Page uptodate condition
   Since we have multiple sectors to read for a page, we can only mark
   the full page uptodate if all sectors are uptodate.

To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:

- being_page_read()
   For regular case, it does nothing.
   For subpage case, it update the reader counters so that later
   end_page_read() can know who is the last one to unlock the page.

- end_page_read()
   This is just endio_readpage_uptodate_page_status() renamed.
   The original name is a little too long and too specific for endio.

   The only new trick added is the condition for page unlock.
   Now for subage data, we unlock the page if we're the last reader.

This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
  fs/btrfs/extent_io.c | 38 --
  fs/btrfs/subpage.c   | 56 ++--
  fs/btrfs/subpage.h   |  8 +++
  3 files changed, 78 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3213daaa..7fc2c62d4eb9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2839,8 +2839,17 @@ static void endio_readpage_release_extent(struct 
processed_extent *processed,
processed->uptodate = uptodate;
  }
  
-static void endio_readpage_update_page_status(struct page *page, bool uptodate,

- u64 start, u32 len)
+static void begin_data_page_read(struct btrfs_fs_info *fs_info, struct page 
*page)
+{
+   ASSERT(PageLocked(page));
+   if (fs_info->sectorsize == PAGE_SIZE)
+   return;
+
+   ASSERT(PagePrivate(page));
+   btrfs_subpage_start_reader(fs_info, page, page_offset(page), PAGE_SIZE);
+}
+
+static void end_page_read(struct page *page, bool uptodate, u64 start, u32 len)
  {
struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
  
@@ -2856,7 +2865,12 @@ static void endio_readpage_update_page_status(struct page *page, bool uptodate,
  
  	if (fs_info->sectorsize == PAGE_SIZE)

unlock_page(page);
-   /* Subpage locking will be handled in later patches */
+   else if (is_data_inode(page->mapping->host))
+   /*
+* For subpage data, unlock the page if we're the last reader.
+* For subpage metadata, page lock is not utilized for read.
+*/
+   btrfs_subpage_end_reader(fs_info, page, start, len);
  }
  
  /*

@@ -2993,7 +3007,7 @@ static void end_bio_extent_readpage(struct bio *bio)
bio_offset += len;
  
  		/* Update page status and unlock */

-   endio_readpage_update_page_status(page, uptodate, start, len);
+   end_page_read(page, uptodate, start, len);
endio_readpage_release_extent(&processed, BTRFS_I(inode),
  start, end, uptodate);
}
@@ -3263,6 +3277,7 @@ int btrfs_do_readpage(struct page *page, struct 
extent_map **em_cached,
  unsigned int read_flags, u64 *prev_em_start)
  {
struct inode *inode = page->mapping->host;
+   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
u64 start = page_offset(page);
const u64 end = start + PAGE_SIZE - 1;
u64 cur = start;
@@ -3306,6 +3321,7 @@ int btrfs_do_readpage(struct page *page, struct 
extent_map **em_cached,
kunmap_atomic(userpage);
}
}


You have two error cases above this

ret = set_page_extent_mapped(page);
if (ret < 0) {
unlock_extent(tree, start, end);
SetPageError(page);
goto out;
}

and

if (!PageUptodat

Re: [PATCH v5 18/18] btrfs: allow RO mount of 4K sector size fs on 64K page system

2021-01-27 Thread Josef Bacik


On 1/26/21 3:34 AM, Qu Wenruo wrote:

This adds the basic RO mount ability for 4K sector size on 64K page
system.

Currently we only plan to support 4K and 64K page system.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 


Reviewed-by: Josef Bacik 

Thanks,

Josef

ENOSPC in btrfs_run_delayed_refs with 5.10.8 + zstd

2021-01-27 Thread Martin Raiber

Hi,

seems 5.10.8 still has the ENOSPC issue when compression is used 
(compress-force=zstd,space_cache=v2):

Jan 27 11:02:14  kernel: [248571.569840] [ cut here ]
Jan 27 11:02:14  kernel: [248571.569843] BTRFS: Transaction aborted (error -28)
Jan 27 11:02:14  kernel: [248571.569845] BTRFS: error (device dm-0) in 
add_to_free_space_tree:1039: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569848] BTRFS info (device dm-0): forced 
readonly
Jan 27 11:02:14  kernel: [248571.569851] BTRFS: error (device dm-0) in 
add_to_free_space_tree:1039: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569852] BTRFS: error (device dm-0) in 
__btrfs_free_extent:3270: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569854] BTRFS: error (device dm-0) in 
btrfs_run_delayed_refs:2191: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at 
fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130
Jan 27 11:02:14  kernel: [248571.569913] BTRFS: error (device dm-0) in 
__btrfs_free_extent:3270: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569939] Modules linked in:
Jan 27 11:02:14  kernel: [248571.569966] BTRFS: error (device dm-0) in 
btrfs_run_delayed_refs:2191: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.569992]  bfq zram bcache crc64 loop dm_crypt 
xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter bridge stp llc 
intel_powerclamp coretemp k$
Jan 27 11:02:14  kernel: [248571.570075] CPU: 3 PID: 21255 Comm: kworker/u50:22 
Tainted: G  I   5.10.8 #1
Jan 27 11:02:14  kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge 
R510/0DPRKF, BIOS 1.13.0 03/02/2018
Jan 27 11:02:14  kernel: [248571.570079] Workqueue: events_unbound 
btrfs_async_reclaim_metadata_space
Jan 27 11:02:14  kernel: [248571.570081] RIP: 
0010:add_to_free_space_tree+0xe8/0x130
Jan 27 11:02:14  kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 00 00 
02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 44 24 04 
e8 8a 99 4a 00 <0f> 0b 8$
Jan 27 11:02:14  kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: 
00010282
Jan 27 11:02:14  kernel: [248571.570084] RAX:  RBX: 
4000 RCX: 0027
Jan 27 11:02:14  kernel: [248571.570085] RDX: 0027 RSI: 
0004 RDI: 888617a58b88
Jan 27 11:02:14  kernel: [248571.570086] RBP: 8889ecb874e0 R08: 
888617a58b80 R09: 
Jan 27 11:02:14  kernel: [248571.570087] R10: 0001 R11: 
822372e0 R12: 00574151
Jan 27 11:02:14  kernel: [248571.570087] R13: 8884e05727e0 R14: 
88815ae4fc00 R15: 88815ae4fdd8
Jan 27 11:02:14  kernel: [248571.570088] FS:  () 
GS:888617a4() knlGS:
Jan 27 11:02:14  kernel: [248571.570089] CS:  0010 DS:  ES:  CR0: 
80050033
Jan 27 11:02:14  kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: 
0260a005 CR4: 000206e0
Jan 27 11:02:14  kernel: [248571.570091] Call Trace:
Jan 27 11:02:14  kernel: [248571.570097]  __btrfs_free_extent.isra.0+0x56a/0xa10
Jan 27 11:02:14  kernel: [248571.570100]  __btrfs_run_delayed_refs+0x659/0xf20
Jan 27 11:02:14  kernel: [248571.570102]  btrfs_run_delayed_refs+0x73/0x200
Jan 27 11:02:14  kernel: [248571.570103]  flush_space+0x4e8/0x5e0
Jan 27 11:02:14  kernel: [248571.570105]  ? btrfs_get_alloc_profile+0x66/0x1b0
Jan 27 11:02:14  kernel: [248571.570106]  ? btrfs_get_alloc_profile+0x66/0x1b0
Jan 27 11:02:14  kernel: [248571.570107]  
btrfs_async_reclaim_metadata_space+0x107/0x3a0
Jan 27 11:02:14  kernel: [248571.570111]  process_one_work+0x1b6/0x350
Jan 27 11:02:14  kernel: [248571.570112]  worker_thread+0x50/0x3b0
Jan 27 11:02:14  kernel: [248571.570114]  ? process_one_work+0x350/0x350
Jan 27 11:02:14  kernel: [248571.570116]  kthread+0xfe/0x140
Jan 27 11:02:14  kernel: [248571.570117]  ? kthread_park+0x90/0x90
Jan 27 11:02:14  kernel: [248571.570120]  ret_from_fork+0x22/0x30
Jan 27 11:02:14  kernel: [248571.570122] ---[ end trace 568d2f30de65b1c0 ]---
Jan 27 11:02:14  kernel: [248571.570123] BTRFS: error (device dm-0) in 
add_to_free_space_tree:1039: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.570151] BTRFS: error (device dm-0) in 
__btrfs_free_extent:3270: errno=-28 No space left
Jan 27 11:02:14  kernel: [248571.570178] BTRFS: error (device dm-0) in 
btrfs_run_delayed_refs:2191: errno=-28 No space left


btrfs fi usage:

Overall:
    Device size: 931.49GiB
    Device allocated:    931.49GiB
    Device unallocated:    1.00MiB
    Device missing:  0.00B
    Used:    786.39GiB
    Free (estimated):    107.69GiB  (min: 107.69GiB)
    Data ratio:   1.00
    Metadata ratio:   1.00
    Global reserve:  512.00MiB  (used: 0.00B)
    Multiple profiles:  no

Data,sing

Re: [PATCH v14 12/42] btrfs: calculate allocation offset for conventional zones

2021-01-27 Thread Josef Bacik


On 1/25/21 9:24 PM, Naohiro Aota wrote:

Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset if a block group contains a conventional
zone.

But instead, we can consider the end of the last allocated extent in the
block group as an allocation offset.

For new block group, we cannot calculate the allocation offset by
consulting the extent tree, because it can cause deadlock by taking extent
buffer lock after chunk mutex (which is already taken in
btrfs_make_block_group()). Since it is a new block group, we can simply set
the allocation offset to 0, anyway.

Signed-off-by: Naohiro Aota 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v14 13/42] btrfs: track unusable bytes for zones

2021-01-27 Thread Josef Bacik


On 1/25/21 9:24 PM, Naohiro Aota wrote:

In zoned btrfs a region that was once written then freed is not usable
until we reset the underlying zone. So we need to distinguish such
unusable space from usable free space.

Therefore we need to introduce the "zone_unusable" field  to the block
group structure, and "bytes_zone_unusable" to the space_info structure to
track the unusable space.

Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular btrfs. On zoned btrfs, it rewinds the
allocation offset.

Signed-off-by: Naohiro Aota 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v14 41/42] btrfs: serialize log transaction on ZONED mode

2021-01-27 Thread Josef Bacik


On 1/25/21 9:25 PM, Naohiro Aota wrote:

This is the 2/3 patch to enable tree-log on ZONED mode.

Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at the
time of log transaction commit. The nodes of the global log root tree
(fs_info->log_root_tree), also have the same mixed allocation problem.

This patch serializes log transactions by waiting for a committing
transaction when someone tries to start a new transaction, to avoid the
mixed allocation problem. We must also wait for running log transactions
from another subvolume, but there is no easy way to detect which subvolume
root is running a log transaction. So, this patch forbids starting a new
log transaction when other subvolumes already allocated the global log root
tree.

Signed-off-by: Naohiro Aota 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v14 22/42] btrfs: split ordered extent when bio is sent

2021-01-27 Thread Josef Bacik


On 1/25/21 9:25 PM, Naohiro Aota wrote:

For a zone append write, the device decides the location the data is
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that a ordered extent maps
to a contiguous region on disk, we need to maintain a "one bio == one
ordered extent" rule.

This commit implements the splitting of an ordered extent and extent map
on bio submission to adhere to the rule.

[testbot] made extract_ordered_extent static
Reported-by: kernel test robot 
Signed-off-by: Naohiro Aota 


Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH] btrfs: remove wrong comment for can_nocow_extent()

2021-01-27 Thread David Sterba

On Wed, Jan 27, 2021 at 03:05:41PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> The comment for can_nocow_extent() says that the function will flush
> ordered extents, however that never happens and was never true before the
> comment was added in commit e4ecaf90bc13 ("btrfs: add comments for
> btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for
> the function btrfs_check_can_nocow(), which after that commit was renamed
> to check_can_nocow(). So just remove that part of the comment.
> 
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.

Re: [PATCH v4 04/18] btrfs: make attach_extent_buffer_page() to handle subpage case

2021-01-27 Thread David Sterba

On Tue, Jan 26, 2021 at 03:29:17PM +0800, Qu Wenruo wrote:
> On 2021/1/20 上午6:35, David Sterba wrote:
> > On Tue, Jan 19, 2021 at 04:54:28PM -0500, Josef Bacik wrote:
> >> On 1/16/21 2:15 AM, Qu Wenruo wrote:
> >>> +/* For rare cases where we need to pre-allocate a btrfs_subpage 
> >>> structure */
> >>> +static inline int btrfs_alloc_subpage(struct btrfs_fs_info *fs_info,
> >>> +   struct btrfs_subpage **ret)
> >>> +{
> >>> + if (fs_info->sectorsize == PAGE_SIZE)
> >>> + return 0;
> >>> +
> >>> + *ret = kzalloc(sizeof(struct btrfs_subpage), GFP_NOFS);
> >>> + if (!*ret)
> >>> + return -ENOMEM;
> >>> + return 0;
> >>> +}
> >>
> >> We're allocating these for every metadata page, that deserves a dedicated
> >> kmem_cache.  Thanks,
> >
> > I'm not opposed to that idea but for the first implementation I'm ok
> > with using the default slabs. As the subpage support depends on the
> > filesystem, creating the cache unconditionally would waste resources and
> > creating it on demand would need some care. Either way I'd rather see it
> > in a separate patch.
> >
> Well, too late for me to see this comment...
> 
> As I have already converted to kmem cache.
> 
> But the good news is, the latest version has extra refactor on memory
> allocation/freeing, now we just need to change two lines to change how
> we allocate memory for subpage.
> (Although still need to remove the cache allocation code).
> 
> Will convert it back to default slab, but will also keep the refactor
> there to make it easier for later convert to kmem_cache.
> 
> So don't be too surprised to see function like in next version.
> 
>btrfs_free_subpage(struct btrfs_subpage *subpage)
>{
>   kfree(subpage);
>}

I hoped to save you time converting it to the kmem slabs so no need to
revert it back to kmalloc, keep what you have. Switching with the helper
would be easier should we need to reconsider it for some reason.

Re: btrfs becomes read-only

2021-01-27 Thread Chris Murphy

On Wed, Jan 27, 2021 at 1:57 AM Alexey Isaev  wrote:
>
> kernel version:
>
> aleksey@host:~$ sudo uname --all
> Linux host 4.15.0-132-generic #136~16.04.1-Ubuntu SMP Tue Jan 12
> 18:22:20 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

This is an old and EOL kernel. It could be a long fixed Btrfs bug that
caused this problem, I'm not sure. I suggest 5.4.93+ if you need a
longterm kernel, otherwise 5.10.11 is the current stable kernel.

>
> drive make/model:
>
> Drive is external 5 bay HDD enclosure with raid-5 connected via usb-3
> (made by Orico https://www.orico.cc/us/product/detail/3622.html)
> with 5 WD Red 10 Tb. We use this drive for backups.
>
> When i try to run btrfs check i get error message:
>
> aleksey@host:~$ sudo btrfs check --readonly /dev/sdg
> Couldn't open file system

OK is it now on some other dev node? A relatively recent btrfs-progs
is also recommended, 5.10 is current and I probably wouldn't use
anything older than 5.6.1.

> aleksey@host:~$ sudo smartctl -x /dev/sdg

Yeah probably won't work since it's behind a raid5 controller. I think
there's smartctl commands to enable passthrough and get information
for each drive, so that you don't have to put it in JBOD mode. But I'm
not familiar with how to do that. Anyway it's a good idea to find out
if there's SMART reporting any problems about any drive, but not
urgent.

-- 
Chris Murphy

Re: btrfs becomes read-only

2021-01-27 Thread Chris Murphy

On Wed, Jan 27, 2021 at 6:05 AM Alexey Isaev  wrote:
>
> I managed to run btrs check, but it didn't solve the problem:
>
> aleksey@host:~$ sudo btrfs check --repair /dev/sdg

OK it's risky to run --repair without a developer giving a go ahead,
in particular with older versions of btrfs-progs. There are warnings
in the man page about it.


> [sudo] password for aleksey:
> enabling repair mode
> Checking filesystem on /dev/sdg
> UUID: 070ce9af-6511-4b89-a501-0823514320c1
> checking extents
> parent transid verify failed on 52180048330752 wanted 132477 found 132432
> parent transid verify failed on 52180048330752 wanted 132477 found 132432
> parent transid verify failed on 52180048330752 wanted 132477 found 132432
> parent transid verify failed on 52180048330752 wanted 132477 found 132432
> Ignoring transid failure
> leaf parent key incorrect 52180048330752
> bad block 52180048330752
> Errors found in extent allocation tree or chunk allocation
> parent transid verify failed on 52180048330752 wanted 132477 found 132432

Yeah it's not finding what it's expecting to find there.

Any power fail or crash in the history of the file system?

What do you get for:

btrfs insp dump-s -f /dev/sdg


-- 
Chris Murphy

Re: Only one subvolume can be mounted after replace/balance

2021-01-27 Thread Chris Murphy

On Wed, Jan 27, 2021 at 6:10 AM Jakob Schöttl  wrote:
>
> Thank you Chris, it's resolved now, see below.
>
> Am 25.01.21 um 23:47 schrieb Chris Murphy:
> > On Sat, Jan 23, 2021 at 7:50 AM Jakob Schöttl  wrote:
> >> Hi,
> >>
> >> In short:
> >> When mounting a second subvolume from a pool, I get this error:
> >> "mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sda,
> >> missing code page or helper program, or other."
> >> dmesg | grep BTRFS only shows this error:
> >> info (device sda): disk space caching is enabled
> >> error (device sda): Remounting read-write after error is not allowed
> > It went read-only before this because it's confused. You need to
> > unmount it before it can be mounted rw. In some cases a reboot is
> > needed.
> Oh, I didn't notice that the pool was already mounted (via fstab).
> The filesystem where out of space and I had to resize both disks
> separately. And I had to mount with -o skip_balance for that. Now it
> works again.
>
> >> What happened:
> >>
> >> In my RAID1 pool with two disk, I successfully replaced one disk with
> >>
> >> btrfs replace start 2 /dev/sdx
> >>
> >> After that, I mounted the pool and did
> > I don't understand this sequence. In order to do a replace, the file
> > system is already mounted.
> That was, what I did before my actual problem occurred. But it's
> resolved now.
>
> >> btrfs fi show /mnt
> >>
> >> which showed WARNINGs about
> >> "filesystems with multiple block group profiles detected"
> >> (don't remember exactly)
> >>
> >> I thought it is a good idea to do
> >>
> >> btrfs balance start /mnt
> >>
> >> which finished without errors.
> > Balance alone does not convert block groups to a new profile. You have
> > to explicitly select a conversion filter, e.g.
> >
> > btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft /mnt
> I didn't want to convert to a new profile. I thought btrfs replace
> automatically uses the same profile as the pool?

Btrfs replace does not change the profile. But you reported mixed
profile block groups, which means conversion is indicated to make sure
they're al the same. Please post:

sudo btrfs fi us /mnt

Let's see what the block groups are and what you want them to be and
then see what conversion command might be indicated.


-- 
Chris Murphy

Re: ENOSPC in btrfs_run_delayed_refs with 5.10.8 + zstd

2021-01-27 Thread Chris Murphy

On Wed, Jan 27, 2021 at 10:27 AM Martin Raiber  wrote:
>
> Hi,
>
> seems 5.10.8 still has the ENOSPC issue when compression is used 
> (compress-force=zstd,space_cache=v2):
>
> Jan 27 11:02:14  kernel: [248571.569840] [ cut here ]
> Jan 27 11:02:14  kernel: [248571.569843] BTRFS: Transaction aborted (error 
> -28)
> Jan 27 11:02:14  kernel: [248571.569845] BTRFS: error (device dm-0) in 
> add_to_free_space_tree:1039: errno=-28 No space left
> Jan 27 11:02:14  kernel: [248571.569848] BTRFS info (device dm-0): forced 
> readonly
> Jan 27 11:02:14  kernel: [248571.569851] BTRFS: error (device dm-0) in 
> add_to_free_space_tree:1039: errno=-28 No space left
> Jan 27 11:02:14  kernel: [248571.569852] BTRFS: error (device dm-0) in 
> __btrfs_free_extent:3270: errno=-28 No space left
> Jan 27 11:02:14  kernel: [248571.569854] BTRFS: error (device dm-0) in 
> btrfs_run_delayed_refs:2191: errno=-28 No space left
> Jan 27 11:02:14  kernel: [248571.569898] WARNING: CPU: 3 PID: 21255 at 
> fs/btrfs/free-space-tree.c:1039 add_to_free_space_tree+0xe8/0x130
> Jan 27 11:02:14  kernel: [248571.569913] BTRFS: error (device dm-0) in 
> __btrfs_free_extent:3270: errno=-28 No space left
> Jan 27 11:02:14  kernel: [248571.569939] Modules linked in:
> Jan 27 11:02:14  kernel: [248571.569966] BTRFS: error (device dm-0) in 
> btrfs_run_delayed_refs:2191: errno=-28 No space left
> Jan 27 11:02:14  kernel: [248571.569992]  bfq zram bcache crc64 loop dm_crypt 
> xfs dm_mod st sr_mod cdrom nf_tables nfnetlink iptable_filter bridge stp llc 
> intel_powerclamp coretemp k$
> Jan 27 11:02:14  kernel: [248571.570075] CPU: 3 PID: 21255 Comm: 
> kworker/u50:22 Tainted: G  I   5.10.8 #1
> Jan 27 11:02:14  kernel: [248571.570076] Hardware name: Dell Inc. PowerEdge 
> R510/0DPRKF, BIOS 1.13.0 03/02/2018
> Jan 27 11:02:14  kernel: [248571.570079] Workqueue: events_unbound 
> btrfs_async_reclaim_metadata_space
> Jan 27 11:02:14  kernel: [248571.570081] RIP: 
> 0010:add_to_free_space_tree+0xe8/0x130
> Jan 27 11:02:14  kernel: [248571.570082] Code: 55 50 f0 48 0f ba aa 40 0a 00 
> 00 02 72 22 83 f8 fb 74 4c 83 f8 e2 74 47 89 c6 48 c7 c7 b8 39 49 82 89 44 24 
> 04 e8 8a 99 4a 00 <0f> 0b 8$
> Jan 27 11:02:14  kernel: [248571.570083] RSP: 0018:c90009c57b88 EFLAGS: 
> 00010282
> Jan 27 11:02:14  kernel: [248571.570084] RAX:  RBX: 
> 4000 RCX: 0027
> Jan 27 11:02:14  kernel: [248571.570085] RDX: 0027 RSI: 
> 0004 RDI: 888617a58b88
> Jan 27 11:02:14  kernel: [248571.570086] RBP: 8889ecb874e0 R08: 
> 888617a58b80 R09: 
> Jan 27 11:02:14  kernel: [248571.570087] R10: 0001 R11: 
> 822372e0 R12: 00574151
> Jan 27 11:02:14  kernel: [248571.570087] R13: 8884e05727e0 R14: 
> 88815ae4fc00 R15: 88815ae4fdd8
> Jan 27 11:02:14  kernel: [248571.570088] FS:  () 
> GS:888617a4() knlGS:
> Jan 27 11:02:14  kernel: [248571.570089] CS:  0010 DS:  ES:  CR0: 
> 80050033
> Jan 27 11:02:14  kernel: [248571.570090] CR2: 7eb4a3a4f00a CR3: 
> 0260a005 CR4: 000206e0
> Jan 27 11:02:14  kernel: [248571.570091] Call Trace:
> Jan 27 11:02:14  kernel: [248571.570097]  
> __btrfs_free_extent.isra.0+0x56a/0xa10
> Jan 27 11:02:14  kernel: [248571.570100]  __btrfs_run_delayed_refs+0x659/0xf20
> Jan 27 11:02:14  kernel: [248571.570102]  btrfs_run_delayed_refs+0x73/0x200
> Jan 27 11:02:14  kernel: [248571.570103]  flush_space+0x4e8/0x5e0
> Jan 27 11:02:14  kernel: [248571.570105]  ? btrfs_get_alloc_profile+0x66/0x1b0
> Jan 27 11:02:14  kernel: [248571.570106]  ? btrfs_get_alloc_profile+0x66/0x1b0
> Jan 27 11:02:14  kernel: [248571.570107]  
> btrfs_async_reclaim_metadata_space+0x107/0x3a0
> Jan 27 11:02:14  kernel: [248571.570111]  process_one_work+0x1b6/0x350
> Jan 27 11:02:14  kernel: [248571.570112]  worker_thread+0x50/0x3b0
> Jan 27 11:02:14  kernel: [248571.570114]  ? process_one_work+0x350/0x350
> Jan 27 11:02:14  kernel: [248571.570116]  kthread+0xfe/0x140
> Jan 27 11:02:14  kernel: [248571.570117]  ? kthread_park+0x90/0x90
> Jan 27 11:02:14  kernel: [248571.570120]  ret_from_fork+0x22/0x30
> Jan 27 11:02:14  kernel: [248571.570122] ---[ end trace 568d2f30de65b1c0 ]---
> Jan 27 11:02:14  kernel: [248571.570123] BTRFS: error (device dm-0) in 
> add_to_free_space_tree:1039: errno=-28 No space left
> Jan 27 11:02:14  kernel: [248571.570151] BTRFS: error (device dm-0) in 
> __btrfs_free_extent:3270: errno=-28 No space left
> Jan 27 11:02:14  kernel: [248571.570178] BTRFS: error (device dm-0) in 
> btrfs_run_delayed_refs:2191: errno=-28 No space left
>
>
> btrfs fi usage:
>
> Overall:
> Device size: 931.49GiB
> Device allocated:931.49GiB
> Device unallocated:1.00MiB
> Device missing:  0.00B
> Used:786.39GiB
> Free (estimated):

Re: [PATCH v5 00/18] btrfs: add read-only support for subpage sector size

2021-01-27 Thread Qu Wenruo





On 2021/1/28 上午12:17, Josef Bacik wrote:

On 1/26/21 3:33 AM, Qu Wenruo wrote:

Patches can be fetched from github:
https://github.com/adam900710/linux/tree/subpage
Currently the branch also contains partial RW data support (still some
ordered extent and data csum mismatch problems)

Great thanks to David/Nikolay/Josef for their effort reviewing and
merging the preparation patches into misc-next.

=== What works ===
Just from the patchset:
- Data read
   Both regular and compressed data, with csum check.

- Metadata read

This means, with these patchset, 64K page systems can at least mount
btrfs with 4K sector size read-only.
This should provide the ability to migrate data at least.

While on the github branch, there are already experimental RW supports,
there are still ordered extent related bugs for me to fix.
Thus only the RO part is sent for review and testing.

=== Patchset structure ===
Patch 01~02:    Preparation patches which don't have functional change
Patch 03~12:    Subpage metadata allocation and freeing
Patch 13~15:    Subpage metadata read path
Patch 16~17:    Subpage data read path
Patch 18:    Enable subpage RO support

=== Changelog ===
v1:
- Separate the main implementation from previous huge patchset
   Huge patchset doesn't make much sense.

- Use bitmap implementation
   Now page::private will be a pointer to btrfs_subpage structure, which
   contains bitmaps for various page status.

v2:
- Use page::private as btrfs_subpage for extra info
   This replace old extent io tree based solution, which reduces latency
   and don't require memory allocation for its operations.

- Cherry-pick new preparation patches from RW development
   Those new preparation patches improves the readability by their own.

v3:
- Make dummy extent buffer to follow the same subpage accessors
   Fsstress exposed several ASSERT() for dummy extent buffers.
   It turns out we need to make dummy extent buffer to own the same
   btrfs_subpage structure to make eb accessors to work properly

- Two new small __process_pages_contig() related preparation patches
   One to make __process_pages_contig() to enhance the error handling
   path for locked_page, one to merge one macro.

- Extent buffers refs count update
   Except try_release_extent_buffer(), all other eb uses will try to
   increase the ref count of the eb.
   For try_release_extent_buffer(), the eb refs check will happen inside
   the rcu critical section to avoid eb being freed.

- Comment updates
   Addressing the comments from the mail list.

v4:
- Get rid of btrfs_subpage::tree_block_bitmap
   This is to reduce lock complexity (no need to bother extra subpage
   lock for metadata, all locks are existing locks)
   Now eb looking up mostly depends on radix tree, with small help from
   btrfs_subpage::under_alloc.
   Now I haven't experieneced metadata related problems any more during
   my local fsstress tests.

- Fix a race where metadata page dirty bit can race
   Fixed in the metadata RW patchset though.

- Rebased to latest misc-next branch
   With 4 patches removed, as they are already in misc-next.

v5:
- Use the updated version from David as base
   Most comment/commit message update should be kept as is.

- A new separate patch to move UNMAPPED bit set timing

- New comment on why we need to prealloc subpage inside a loop
   Mostly for further 16K page size support, where we can have
   eb across multiple pages.

- Remove one patch which is too RW specific
   Since it introduces functional change which only makes sense for RW
   support, it's not a good idea to include it in RO support.

- Error handling fixes
   Great thanks to Josef.

- Refactor btrfs_subpage allocation/freeing
   Now we have btrfs_alloc_subpage() and btrfs_free_subpage() helpers to
   do all the allocation/freeing.
   It's pretty easy to convert to kmem_cache using above helpers.
   (already internally tested using kmem_cache without problem, in fact
    it's all the problems found in kmem_cache test leads to the new
    interface)

- Use btrfs_subpage::eb_refs to replace old under_alloc
   This makes checking whether the page has any eb left much easier.

Qu Wenruo (18):
   btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
 PAGE_START_WRITEBACK
   btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for
 subpage support
   btrfs: introduce the skeleton of btrfs_subpage structure
   btrfs: make attach_extent_buffer_page() handle subpage case
   btrfs: make grab_extent_buffer_from_page() handle subpage case
   btrfs: support subpage for extent buffer page release


I don't have this patch in my inbox so I can't reply to it directly, but
you include refcount.h, but then use normal atomics.  Please used the
actual refcount_t, as it gets us all the debugging stuff that makes
finding problems much easier.  Thanks,


My bad, my initial plan is to use refcount, but the use case has valid 0
refcount usage, thus refcount is not good here.

I'll remove the remaini

87 matches

Mail list logo