from:"Qu Wenruo"





On 2021/4/19 下午10:56, Gervais, Francois wrote:

My bad, wrong number.

The correct number command is:
# btrfs ins dump-tree -b 790151168 /dev/loop0p3



root@debug:~# btrfs ins dump-tree -b 790151168 /dev/loop0p3
btrfs-progs v5.7

[...]

item 4 key (5007 INODE_ITEM 0) itemoff 15760 itemsize 160
generation 294 transid 219603 size 0 nbytes 18446462598731726987


The nbytes looks very strange.

It's 0x0xfffeffef008b, which definitely looks aweful for an empty inode.


block group 0 mode 100600 links 1 uid 1000 gid 1000 rdev 0
sequence 476091 flags 0x0(none)
atime 1610373772.750632843 (2021-01-11 14:02:52)
ctime 1617477826.205928110 (2021-04-03 19:23:46)
mtime 1617477826.205928110 (2021-04-03 19:23:46)
otime 0.0 (1970-01-01 00:00:00)
item 5 key (5007 INODE_REF 4727) itemoff 15732 itemsize 28
index 0 namelen 0 name:
index 0 namelen 0 name:
index 0 namelen 294 name:


Definitely corrupted. I'm afraid tree-checker is correct.

The log tree is corrupted.
And the check to detect such corrupted inode ref is only introduced in
v5.5 kernel, no wonder v5.4 kernel didn't catch it at runtime.

I don't have any idea why this could happen, as it doesn't look like an
obvious memory flip.

Maybe Filipe could have some clue on this?

Thanks,
Qu


item 6 key (5041 INODE_ITEM 0) itemoff 15572 itemsize 160
generation 295 transid 219603 size 4096 nbytes 4096
block group 0 mode 100600 links 1 uid 1000 gid 1000 rdev 0
sequence 321954 flags 0x0(none)
atime 1610373832.763235044 (2021-01-11 14:03:52)
ctime 1617477815.541863825 (2021-04-03 19:23:35)
mtime 1617477815.541863825 (2021-04-03 19:23:35)
otime 0.0 (1970-01-01 00:00:00)
item 7 key (5041 INODE_REF 4727) itemoff 15544 itemsize 28
index 12 namelen 18 name: health_metrics.txt
item 8 key (5041 EXTENT_DATA 0) itemoff 15491 itemsize 53
generation 219603 type 1 (regular)
extent data disk byte 12746752 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
item 9 key (EXTENT_CSUM EXTENT_CSUM 12746752) itemoff 15487 itemsize 4
range start 12746752 end 12750848 length 4096

Re: [PATCH v2] btrfs-progs: mkfs: only output the warning if the sectorsize is not supported





On 2021/4/20 上午12:31, Boris Burkov wrote:

On Mon, Apr 19, 2021 at 02:45:12PM +0800, Qu Wenruo wrote:

Currently mkfs.btrfs will output a warning message if the sectorsize is
not the same as page size:
   WARNING: the filesystem may not be mountable, sectorsize 4096 doesn't match 
page size 65536

But since btrfs subpage support for 64K page size is comming, this
output is populating the golden output of fstests, causing tons of false
alerts.

This patch will make teach mkfs.btrfs to check
/sys/fs/btrfs/features/supported_sectorsizes, and compare if the sector
size is supported.

Then only output above warning message if the sector size is not
supported.

This patch will also introduce a new helper,
sysfs_open_global_feature_file() to make it more obvious which global
feature file we're opening.

Signed-off-by: Qu Wenruo 
---
changelog:
v2:
- Introduce new helper to open global feature file
- Extra the supported sectorsize check into its own function
- Do proper token check other than strstr()
- Fix the bug that we're passing @page_size to check
---
  common/fsfeatures.c | 49 -
  common/utils.c  | 15 ++
  common/utils.h  |  1 +
  3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 569208a9e5b1..6641c44dfa45 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -327,8 +327,50 @@ u32 get_running_kernel_version(void)

return version;
  }
+
+/*
+ * The buffer size should be strlen("4096 8192 16384 32768 65536"),
+ * which is 28, then we just round it up to 32.
+ */
+#define SUPPORTED_SECTORSIZE_BUF_SIZE  32
+
+/*
+ * Check if the current kernel supports given sectorsize.
+ *
+ * Return true if the sectorsize is supported.
+ * Return false otherwise.
+ */
+static bool check_supported_sectorsize(u32 sectorsize)
+{
+   char supported_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
+   char sectorsize_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
+   char *this_char;
+   char *save_ptr = NULL;
+   int fd;
+   int ret;
+
+   fd = sysfs_open_global_feature_file("supported_sectorsizes");
+   if (fd < 0)
+   return false;
+   ret = sysfs_read_file(fd, supported_buf, SUPPORTED_SECTORSIZE_BUF_SIZE);
+   close(fd);
+   if (ret < 0)
+   return false;
+   snprintf(sectorsize_buf, SUPPORTED_SECTORSIZE_BUF_SIZE,
+"%u", sectorsize);
+
+   for (this_char = strtok_r(supported_buf, " ", &save_ptr);
+this_char != NULL;
+this_char = strtok_r(NULL, ",", &save_ptr)) {


Based on the example file contents in the comment, I would expect " " as
the delimeter for looping through the supported sizes, not ",".


What am I doing, (facepalm...

Thanks for pointing this out,
Qu



+   if (!strncmp(this_char, sectorsize_buf, strlen(sectorsize_buf)))
+   return true;
+   }
+   return false;
+}
+
  int btrfs_check_sectorsize(u32 sectorsize)
  {
+   bool sectorsize_checked = false;
u32 page_size = (u32)sysconf(_SC_PAGESIZE);

if (!is_power_of_2(sectorsize)) {
@@ -340,7 +382,12 @@ int btrfs_check_sectorsize(u32 sectorsize)
  sectorsize);
return -EINVAL;
}
-   if (page_size != sectorsize)
+   if (page_size == sectorsize)
+   sectorsize_checked = true;
+   else
+   sectorsize_checked = check_supported_sectorsize(sectorsize);
+
+   if (!sectorsize_checked)
warning(
  "the filesystem may not be mountable, sectorsize %u doesn't match page size 
%u",
sectorsize, page_size);
diff --git a/common/utils.c b/common/utils.c
index 57e41432c8fb..e8b35879f19f 100644
--- a/common/utils.c
+++ b/common/utils.c
@@ -2205,6 +2205,21 @@ int sysfs_open_fsid_file(int fd, const char *filename)
return open(sysfs_file, O_RDONLY);
  }

+/*
+ * Open a file in global btrfs features directory and return the file
+ * descriptor or error.
+ */
+int sysfs_open_global_feature_file(const char *feature_name)
+{
+   char path[PATH_MAX];
+   int ret;
+
+   ret = path_cat_out(path, "/sys/fs/btrfs/features", feature_name);
+   if (ret < 0)
+   return ret;
+   return open(path, O_RDONLY);
+}
+
  /*
   * Read up to @size bytes to @buf from @fd
   */
diff --git a/common/utils.h b/common/utils.h
index c38bdb08077c..d2f6416a9b5a 100644
--- a/common/utils.h
+++ b/common/utils.h
@@ -169,6 +169,7 @@ char *btrfs_test_for_multiple_profiles(int fd);
  int btrfs_warn_multiple_profiles(int fd);

  int sysfs_open_fsid_file(int fd, const char *filename);
+int sysfs_open_global_feature_file(const char *feature_name);
  int sysfs_read_file(int fd, char *buf, size_t size);

  #endif
--
2.31.1

Re: read time tree block corruption detected





On 2021/4/19 下午9:20, Gervais, Francois wrote:

Please provide the following dump:
   #btrfs ins dump-tree -b 18446744073709551610 /dev/loop0p3

I'm wondering why write-time tree-check didn't catch it.

Thanks,
Qu


I get:

root@debug:~# btrfs ins dump-tree -b 18446744073709551610 /dev/loop0p3
btrfs-progs v5.7
ERROR: tree block bytenr 18446744073709551610 is not aligned to sectorsize 4096


My bad, wrong number.

The correct number command is:
# btrfs ins dump-tree -b 790151168 /dev/loop0p3

Thanks,
Qu


We have an unusual partition table due to an hardware (cpu) requirement.
This might be the source of this error?

Disk /dev/loop0: 40763392 sectors, 19.4 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): A18E4543-634A-4E8C-B55D-DA1E217C4D98
Partition table holds up to 24 entries
Main partition table begins at sector 2 and ends at sector 7
First usable sector is 8, last usable sector is 40763384
Partitions will be aligned on 8-sector boundaries
Total free space is 0 sectors (0 bytes)

Number  Start (sector)End (sector)  Size   Code  Name
1   8   32775   16.0 MiB8300
2   32776  237575   100.0 MiB   8300
3  23757640763384   19.3 GiB8300

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata


[...]


diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 45ec3f5ef839..49f78d643392 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1341,7 +1341,17 @@ static int prepare_uptodate_page(struct inode 
*inode,

    unlock_page(page);
    return -EIO;
    }
-   if (page->mapping != inode->i_mapping) {
+
+   /*
+    * Since btrfs_readpage() will get the page unlocked, we
have
+    * a window where fadvice() can try to release the page.
+    * Here we check both inode mapping and PagePrivate() to
+    * make sure the page is not released.
+    *
+    * The priavte flag check is essential for subpage as we
need
+    * to store extra bitmap using page->private.
+    */
+   if (page->mapping != inode->i_mapping ||
PagePrivate(page)) {

  ^ Obviously it should be !PagePrivate(page).


Hi Ritesh,

Mind to have another try on generic/095?

This time the branch is updated with the following commit at top:

commit d700b16dced6f2e2b47e1ca5588a92216ce84dfb (HEAD -> subpage, 
github/subpage)

Author: Qu Wenruo 
Date:   Mon Apr 19 13:41:31 2021 +0800

btrfs: fix a crash caused by race between prepare_pages() and
btrfs_releasepage()

The fix uses the PagePrivate() check to avoid the problem, and passes 
several generic/auto loops without any sign of crash.


But considering I always have difficult in reproducing the bug with 
previous improper fix, your verification would be very helpful.


Thanks,
Qu


Thanks,
Qu


    unlock_page(page);
    return -EAGAIN;
    }





-ritesh

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata





On 2021/4/19 下午2:16, Qu Wenruo wrote:



On 2021/4/19 下午1:59, riteshh wrote:

On 21/04/16 10:22PM, riteshh wrote:

On 21/04/16 02:14PM, Qu Wenruo wrote:



On 2021/4/16 下午1:50, riteshh wrote:

On 21/04/16 09:34AM, Qu Wenruo wrote:



On 2021/4/16 上午7:34, Qu Wenruo wrote:



On 2021/4/16 上午7:19, Qu Wenruo wrote:



On 2021/4/15 下午10:52, riteshh wrote:

On 21/04/15 09:14AM, riteshh wrote:

On 21/04/12 07:33PM, Qu Wenruo wrote:
Good news, you can fetch the subpage branch for better test 
results.


Now the branch should pass all generic tests, except defrag 
andknown

failures.
And no more random crash during the tests.


Thanks, let me test it on PPC64 box.


I do see some failures remaining with the patch series.
However the one which is blocking my testing is the 
tests/generic/095

I see kernel BUG hitting with below signature.


That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.




Please let me know if this a known failure?


#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION   -- btrfs_4k
FSTYP -- btrfs
PLATFORM  -- Linux/ppc64le qemu 
5.12.0-rc7-02316-g3490dae50c0 #73

SMP Thu Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3


I see you're using -n 4096, not the default -n 16K, let me see 
if I can

reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related to 
sectorsize.



MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch



[ 6057.560580] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching is 
enabled

[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector
size 4096 with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid 
43ec9cdf-c124-4460-ad93-933bfd5ddbbd

devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching is 
enabled

[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) && 
page->private,

in fs/btrfs/subpage.c:171
[ 6061.378296] [ cut here ]
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c000260d7490]
   pc: c0a9370c: assertfail.constprop.11+0x34/0x48
   lr: c0a93708: assertfail.constprop.11+0x30/0x48
   sp: c000260d7730
      msr: 8282b033
     current = 0xc000260c0080
     paca    = 0xc0003fff8a00   irqmask: 0x03   
irq_happened: 0x01

   pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@) (gcc
(Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for 
Ubuntu)

2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
enter ? for help
[c000260d7790] c0a90280
btrfs_subpage_assert.isra.9+0x70/0x110
[c000260d77b0] c0a91064
btrfs_subpage_set_uptodate+0x54/0x110
[c000260d7800] c09c6d0c btrfs_dirty_pages+0x1bc/0x2c0


This is very strange.
As in btrfs_dirty_pages(), the pages passed in are already 
prepared by

prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?


Yes. almost reliably on my PPC box.



OK, I got it reproduced.

It's not a reliable BUG_ON(), but can be reproduced.
The test get skipped for all my boards as it requires fio tool, 
thus I

didn't get it triggered for all previous runs.

I'll take a look into the case.


This exposed an interesting race window in btrfs_buffered_write():
  Writer    | fadvice
--+---
btrfs_buffered_write()    |
|- prepare_pages()    |
|  |- Now all pages involved get  |
| Private set |
| | btrfs_release_page()
| | |- Clear page Private
|- lock_extent()  |
|  |- This would prevent  |
| btrfs_release_page() to |
| clear the page Private  |
|
|- btrfs_dirty_page()
 |- Will trigger the BUG_ON()



Sorry about the silly query. But help me understand how is above 
racepossible?
Won't prepare_pages() will lock all the pages first. The same 
requirement

of locked page should be with btrfs_releasepage() too no?


releasepage() call can easily got a page locked and release it.

For call sites like btrfs_invalidatepage(), the page is already locked.

btrfs_releasepage() will not to try to release the pag

[PATCH v2] btrfs-progs: mkfs: only output the warning if the sectorsize is not supported

2021-04-18 Thread Qu Wenruo

Currently mkfs.btrfs will output a warning message if the sectorsize is
not the same as page size:
  WARNING: the filesystem may not be mountable, sectorsize 4096 doesn't match 
page size 65536

But since btrfs subpage support for 64K page size is comming, this
output is populating the golden output of fstests, causing tons of false
alerts.

This patch will make teach mkfs.btrfs to check
/sys/fs/btrfs/features/supported_sectorsizes, and compare if the sector
size is supported.

Then only output above warning message if the sector size is not
supported.

This patch will also introduce a new helper,
sysfs_open_global_feature_file() to make it more obvious which global
feature file we're opening.

Signed-off-by: Qu Wenruo 
---
changelog:
v2:
- Introduce new helper to open global feature file
- Extra the supported sectorsize check into its own function
- Do proper token check other than strstr()
- Fix the bug that we're passing @page_size to check
---
 common/fsfeatures.c | 49 -
 common/utils.c  | 15 ++
 common/utils.h  |  1 +
 3 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 569208a9e5b1..6641c44dfa45 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -327,8 +327,50 @@ u32 get_running_kernel_version(void)
 
return version;
 }
+
+/*
+ * The buffer size should be strlen("4096 8192 16384 32768 65536"),
+ * which is 28, then we just round it up to 32.
+ */
+#define SUPPORTED_SECTORSIZE_BUF_SIZE  32
+
+/*
+ * Check if the current kernel supports given sectorsize.
+ *
+ * Return true if the sectorsize is supported.
+ * Return false otherwise.
+ */
+static bool check_supported_sectorsize(u32 sectorsize)
+{
+   char supported_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
+   char sectorsize_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
+   char *this_char;
+   char *save_ptr = NULL;
+   int fd;
+   int ret;
+
+   fd = sysfs_open_global_feature_file("supported_sectorsizes");
+   if (fd < 0)
+   return false;
+   ret = sysfs_read_file(fd, supported_buf, SUPPORTED_SECTORSIZE_BUF_SIZE);
+   close(fd);
+   if (ret < 0)
+   return false;
+   snprintf(sectorsize_buf, SUPPORTED_SECTORSIZE_BUF_SIZE,
+"%u", sectorsize);
+
+   for (this_char = strtok_r(supported_buf, " ", &save_ptr);
+this_char != NULL;
+this_char = strtok_r(NULL, ",", &save_ptr)) {
+   if (!strncmp(this_char, sectorsize_buf, strlen(sectorsize_buf)))
+   return true;
+   }
+   return false;
+}
+
 int btrfs_check_sectorsize(u32 sectorsize)
 {
+   bool sectorsize_checked = false;
u32 page_size = (u32)sysconf(_SC_PAGESIZE);
 
if (!is_power_of_2(sectorsize)) {
@@ -340,7 +382,12 @@ int btrfs_check_sectorsize(u32 sectorsize)
  sectorsize);
return -EINVAL;
}
-   if (page_size != sectorsize)
+   if (page_size == sectorsize)
+   sectorsize_checked = true;
+   else
+   sectorsize_checked = check_supported_sectorsize(sectorsize);
+
+   if (!sectorsize_checked)
warning(
 "the filesystem may not be mountable, sectorsize %u doesn't match page size 
%u",
sectorsize, page_size);
diff --git a/common/utils.c b/common/utils.c
index 57e41432c8fb..e8b35879f19f 100644
--- a/common/utils.c
+++ b/common/utils.c
@@ -2205,6 +2205,21 @@ int sysfs_open_fsid_file(int fd, const char *filename)
return open(sysfs_file, O_RDONLY);
 }
 
+/*
+ * Open a file in global btrfs features directory and return the file
+ * descriptor or error.
+ */
+int sysfs_open_global_feature_file(const char *feature_name)
+{
+   char path[PATH_MAX];
+   int ret;
+
+   ret = path_cat_out(path, "/sys/fs/btrfs/features", feature_name);
+   if (ret < 0)
+   return ret;
+   return open(path, O_RDONLY);
+}
+
 /*
  * Read up to @size bytes to @buf from @fd
  */
diff --git a/common/utils.h b/common/utils.h
index c38bdb08077c..d2f6416a9b5a 100644
--- a/common/utils.h
+++ b/common/utils.h
@@ -169,6 +169,7 @@ char *btrfs_test_for_multiple_profiles(int fd);
 int btrfs_warn_multiple_profiles(int fd);
 
 int sysfs_open_fsid_file(int fd, const char *filename);
+int sysfs_open_global_feature_file(const char *feature_name);
 int sysfs_read_file(int fd, char *buf, size_t size);
 
 #endif
-- 
2.31.1

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-18 Thread Qu Wenruo





On 2021/4/19 下午1:59, riteshh wrote:

On 21/04/16 10:22PM, riteshh wrote:

On 21/04/16 02:14PM, Qu Wenruo wrote:



On 2021/4/16 下午1:50, riteshh wrote:

On 21/04/16 09:34AM, Qu Wenruo wrote:



On 2021/4/16 上午7:34, Qu Wenruo wrote:



On 2021/4/16 上午7:19, Qu Wenruo wrote:



On 2021/4/15 下午10:52, riteshh wrote:

On 21/04/15 09:14AM, riteshh wrote:

On 21/04/12 07:33PM, Qu Wenruo wrote:

Good news, you can fetch the subpage branch for better test results.

Now the branch should pass all generic tests, except defrag and known
failures.
And no more random crash during the tests.


Thanks, let me test it on PPC64 box.


I do see some failures remaining with the patch series.
However the one which is blocking my testing is the tests/generic/095
I see kernel BUG hitting with below signature.


That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.




Please let me know if this a known failure?


#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION   -- btrfs_4k
FSTYP -- btrfs
PLATFORM  -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
SMP Thu Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3


I see you're using -n 4096, not the default -n 16K, let me see if I can
reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related to sectorsize.


MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch



[ 6057.560580] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector
size 4096 with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) && page->private,
in fs/btrfs/subpage.c:171
[ 6061.378296] [ cut here ]
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c000260d7490]
   pc: c0a9370c: assertfail.constprop.11+0x34/0x48
   lr: c0a93708: assertfail.constprop.11+0x30/0x48
   sp: c000260d7730
      msr: 8282b033
     current = 0xc000260c0080
     paca    = 0xc0003fff8a00   irqmask: 0x03   irq_happened: 0x01
   pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@) (gcc
(Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
enter ? for help
[c000260d7790] c0a90280
btrfs_subpage_assert.isra.9+0x70/0x110
[c000260d77b0] c0a91064
btrfs_subpage_set_uptodate+0x54/0x110
[c000260d7800] c09c6d0c btrfs_dirty_pages+0x1bc/0x2c0


This is very strange.
As in btrfs_dirty_pages(), the pages passed in are already prepared by
prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?


Yes. almost reliably on my PPC box.



OK, I got it reproduced.

It's not a reliable BUG_ON(), but can be reproduced.
The test get skipped for all my boards as it requires fio tool, thus I
didn't get it triggered for all previous runs.

I'll take a look into the case.


This exposed an interesting race window in btrfs_buffered_write():
  Writer| fadvice
--+---
btrfs_buffered_write()|
|- prepare_pages()|
|  |- Now all pages involved get  |
| Private set |
| | btrfs_release_page()
| | |- Clear page Private
|- lock_extent()  |
|  |- This would prevent  |
| btrfs_release_page() to |
| clear the page Private  |
|
|- btrfs_dirty_page()
 |- Will trigger the BUG_ON()



Sorry about the silly query. But help me understand how is above race possible?
Won't prepare_pages() will lock all the pages first. The same requirement
of locked page should be with btrfs_releasepage() too no?


releasepage() call can easily got a page locked and release it.

For call sites like btrfs_invalidatepage(), the page is already locked.

btrfs_releasepage() will not to try to release the page if the extent is
locked (any extent range inside the page has EXTENT_LO

[PATCH U-boot v2] fs: btrfs: fix the false alert of decompression failure

2021-04-17 Thread Qu Wenruo

There are some cases where decompressed sectors can have padding zeros.

In kernel code, we have lines to address such situation:

/*
 * btrfs_getblock is doing a zero on the tail of the page too,
 * but this will cover anything missing from the decompressed
 * data.
 */
if (bytes < destlen)
memset(kaddr+bytes, 0, destlen-bytes);
kunmap_local(kaddr);

But not in U-boot code, thus we have some reports of U-boot failed to
read compressed files in btrfs.

Fix it by doing the same thing of the kernel, for both inline and
regular compressed extents.

Reported-by: Matwey Kornilov 
Link: https://bugzilla.suse.com/show_bug.cgi?id=1183717
Fixes: a26a6bedafcf ("fs: btrfs: Introduce btrfs_read_extent_inline() and 
btrfs_read_extent_reg()")
Signed-off-by: Qu Wenruo 
---
Changelog:
v2:
- Fix the bug for regular and inline compressed extents
---
 fs/btrfs/inode.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 019d532a1a4b..2c2379303d74 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -390,10 +390,16 @@ int btrfs_read_extent_inline(struct btrfs_path *path,
   csize);
ret = btrfs_decompress(btrfs_file_extent_compression(leaf, fi),
   cbuf, csize, dbuf, dsize);
-   if (ret < 0 || ret != dsize) {
+   if (ret == (u32)-1) {
ret = -EIO;
goto out;
}
+   /*
+* The compressed part ends before sector boundary, the remaining needs
+* to be zeroed out.
+*/
+   if (ret < dsize)
+   memset(dbuf + ret, 0, dsize - ret);
memcpy(dest, dbuf, dsize);
ret = dsize;
 out:
@@ -494,10 +500,16 @@ int btrfs_read_extent_reg(struct btrfs_path *path,
 
ret = btrfs_decompress(btrfs_file_extent_compression(leaf, fi), cbuf,
   csize, dbuf, dsize);
-   if (ret != dsize) {
+   if (ret == (u32)-1) {
ret = -EIO;
goto out;
}
+   /*
+* The compressed part ends before sector boundary, the remaining needs
+* to be zeroed out.
+*/
+   if (ret < dsize)
+   memset(dbuf + ret, 0, dsize - ret);
/* Then copy the needed part */
memcpy(dest, dbuf + btrfs_file_extent_offset(leaf, fi), len);
ret = len;
-- 
2.31.1

[PATCH UBoot] fs: btrfs: fix the false alert of decompression failure

2021-04-17 Thread Qu Wenruo

There are some cases where decompressed sectors can have padding zeros.

In kernel code, we have lines to address such situation:

/*
 * btrfs_getblock is doing a zero on the tail of the page too,
 * but this will cover anything missing from the decompressed
 * data.
 */
if (bytes < destlen)
memset(kaddr+bytes, 0, destlen-bytes);
kunmap_local(kaddr);

But not in U-boot code, thus we have some reports of U-boot failed to
read compressed files in btrfs.

Fix it by doing the same thing of the kernel.

Reported-by: Matwey Kornilov 
Link: https://bugzilla.suse.com/show_bug.cgi?id=1183717
Fixes: a26a6bedafcf ("fs: btrfs: Introduce btrfs_read_extent_inline() and 
btrfs_read_extent_reg()")
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 019d532a1a4b..f780c53d5250 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -390,10 +390,16 @@ int btrfs_read_extent_inline(struct btrfs_path *path,
   csize);
ret = btrfs_decompress(btrfs_file_extent_compression(leaf, fi),
   cbuf, csize, dbuf, dsize);
-   if (ret < 0 || ret != dsize) {
+   if (ret == (u32)-1) {
ret = -EIO;
goto out;
}
+   /*
+* The compressed part ends before sector boundary, the remaining needs
+* to be zeroed out.
+*/
+   if (ret < dsize)
+   memset(dbuf + ret, 0, dsize - ret);
memcpy(dest, dbuf, dsize);
ret = dsize;
 out:
-- 
2.31.1

Re: read time tree block corruption detected





On 2021/4/17 上午3:35, Gervais, Francois wrote:

We are using btrfs on one of our embedded devices and we got filesystem 
corruption on one of them.

This product undergo a lot of tests on our side and apparently it's the first 
it happened so it seems to be a pretty rare occurrence. However we still want 
to get to the bottom of this to ensure it doesn't happen in the future.

Some background:
- The corruption happened on kernel v5.4.72.
- On the debug device I'm on master (v5.12.0-rc7) hoping it might help to have 
all the latest patches.

Here what kernel v5.12.0-rc7 tells me when trying to mount the partition:

Apr 16 19:31:45 buildroot kernel: BTRFS info (device loop0p3): disk space 
caching is enabled
Apr 16 19:31:45 buildroot kernel: BTRFS info (device loop0p3): has skinny 
extents
Apr 16 19:31:45 buildroot kernel: BTRFS info (device loop0p3): start tree-log 
replay
Apr 16 19:31:45 buildroot kernel: BTRFS critical (device loop0p3): corrupt 
leaf: root=18446744073709551610 block=790151168 slot=5 ino=5007, inode ref 
overflow, ptr 15853 end 15861 namelen 294


Please provide the following dump:
 #btrfs ins dump-tree -b 18446744073709551610 /dev/loop0p3

I'm wondering why write-time tree-check didn't catch it.

Thanks,
Qu

Apr 16 19:31:45 buildroot kernel: BTRFS error (device loop0p3): block=790151168 
read time tree block corruption detected
Apr 16 19:31:45 buildroot kernel: BTRFS critical (device loop0p3): corrupt 
leaf: root=18446744073709551610 block=790151168 slot=5 ino=5007, inode ref 
overflow, ptr 15853 end 15861 namelen 294
Apr 16 19:31:45 buildroot kernel: BTRFS error (device loop0p3): block=790151168 
read time tree block corruption detected
Apr 16 19:31:45 buildroot kernel: BTRFS: error (device loop0p3) in 
btrfs_recover_log_trees:6246: errno=-5 IO failure (Couldn't read tree log root.)
Apr 16 19:31:45 buildroot kernel: BTRFS: error (device loop0p3) in 
btrfs_replay_log:2341: errno=-5 IO failure (Failed to recover log tree)
Apr 16 19:31:45 buildroot e512c123daaa[468]: mount: /root/mnt: can't read 
superblock on /dev/loop0p3.
Apr 16 19:31:45 buildroot kernel: BTRFS error (device loop0p3): open_ctree 
failed: -5

Any suggestions?

Re: [PATCH] btrfs-progs: mkfs: only output the warning if the sectorsize is not supported





On 2021/4/17 上午2:14, Boris Burkov wrote:

On Thu, Apr 15, 2021 at 01:30:11PM +0800, Qu Wenruo wrote:

Currently mkfs.btrfs will output a warning message if the sectorsize is
not the same as page size:
   WARNING: the filesystem may not be mountable, sectorsize 4096 doesn't match 
page size 65536

But since btrfs subpage support for 64K page size is comming, this
output is populating the golden output of fstests, causing tons of false
alerts.

This patch will make teach mkfs.btrfs to check
/sys/fs/btrfs/features/supported_sectorsizes, and compare if the sector
size is supported.

Then only output above warning message if the sector size is not
supported.

Signed-off-by: Qu Wenruo 
---
  common/fsfeatures.c | 36 +++-
  1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 569208a9e5b1..13b775da9c72 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -16,6 +16,8 @@

  #include "kerncompat.h"
  #include 
+#include 
+#include 
  #include 
  #include 
  #include "common/fsfeatures.h"
@@ -327,8 +329,15 @@ u32 get_running_kernel_version(void)

return version;
  }
+
+/*
+ * The buffer size if strlen("4096 8192 16384 32768 65536"),
+ * which is 28, then round up to 32.


I think there is a typo in this comment, because it doesn't quite parse.


I mean the strlen() == 28, then round the value to 32.

Any better alternative?





+ */
+#define SUPPORTED_SECTORSIZE_BUF_SIZE  32
  int btrfs_check_sectorsize(u32 sectorsize)
  {
+   bool sectorsize_checked = false;
u32 page_size = (u32)sysconf(_SC_PAGESIZE);

if (!is_power_of_2(sectorsize)) {
@@ -340,7 +349,32 @@ int btrfs_check_sectorsize(u32 sectorsize)
  sectorsize);
return -EINVAL;
}
-   if (page_size != sectorsize)
+   if (page_size == sectorsize) {
+   sectorsize_checked = true;
+   } else {
+   /*
+* Check if the sector size is supported
+*/
+   char supported_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
+   char sectorsize_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
+   int fd;
+   int ret;
+
+   fd = open("/sys/fs/btrfs/features/supported_sectorsizes",
+ O_RDONLY);
+   if (fd < 0)
+   goto out;
+   ret = read(fd, supported_buf, sizeof(supported_buf));
+   close(fd);
+   if (ret < 0)
+   goto out;
+   snprintf(sectorsize_buf, SUPPORTED_SECTORSIZE_BUF_SIZE,
+"%u", page_size);
+   if (strstr(supported_buf, sectorsize_buf))
+   sectorsize_checked = true;


Two comments here.
1: I think we should be checking sectorsize against the file rather than
page_size.


Damn it, all my bad.


2: strstr seems too permissive, since it doesn't have a notion of
tokens. If not for the power_of_2 check above, we would admit all kinds
of silly things like 409. But even with it, we would permit "4" now and
with your example from the comment, "8", "16", and "32".


Indeed I took a shortcut here.

It's indeed not elegant, I'll change it to use " " as token to analyse
each value and compare to sector size.

Thanks,
Qu



+   }
+out:
+   if (!sectorsize_checked)
warning(
  "the filesystem may not be mountable, sectorsize %u doesn't match page size 
%u",
sectorsize, page_size);


Do you have plans to change the contents of this string to match the new
meaning of the check, or is that too harmful to testing/automation?


--
2.31.1

Re: [PATCH 3/3] btrfs-progs: misc-tests: add test to ensure the restored image can be mounted





On 2021/4/17 上午1:46, David Sterba wrote:

On Fri, Mar 26, 2021 at 08:50:47PM +0800, Qu Wenruo wrote:

This new test case is to make sure the restored image file has been
properly enlarged so that newer kernel won't complain.

Signed-off-by: Qu Wenruo 
---
  .../047-image-restore-mount/test.sh   | 19 +++
  1 file changed, 19 insertions(+)
  create mode 100755 tests/misc-tests/047-image-restore-mount/test.sh

diff --git a/tests/misc-tests/047-image-restore-mount/test.sh 
b/tests/misc-tests/047-image-restore-mount/test.sh
new file mode 100755
index ..7f12afa2bab6
--- /dev/null
+++ b/tests/misc-tests/047-image-restore-mount/test.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+# Verify that the restored image of an empty btrfs can still be mounted

 ^

I've seen that in patches and comments, the use of word 'btrfs' instead
of 'filesystem' sounds a bit inappropriate to me, so I change it
whenever I see it. It's perhaps matter of taste and style, one can write
it also as 'btrfs filesystem' but that may belong to some more polished
documentation, so you can go with just 'filesystem'.


Thanks for pointing this out.

I'll use 'filesystem' from now on.

Thanks,
Qu

Re: [PATCH 2/3] btrfs-progs: image: enlarge the output file if no tree modification is needed for restore





On 2021/4/17 上午1:40, David Sterba wrote:

On Fri, Mar 26, 2021 at 08:50:46PM +0800, Qu Wenruo wrote:

[BUG]
If restoring dumpped image into a new file, under most cases kernel will
reject it:

  # mkfs.btrfs -f /dev/test/test
  # btrfs-image /dev/test/test /tmp/dump
  # btrfs-image -r /tmp/dump ~/test.img
  # mount ~/test.img /mnt/btrfs
  mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, 
missing codepage or helper program, or other error.
  # dmesg -t | tail -n 7
  loop0: detected capacity change from 10592 to 0
  BTRFS info (device loop0): disk space caching is enabled
  BTRFS info (device loop0): has skinny extents
  BTRFS info (device loop0): flagging fs with big metadata feature
  BTRFS error (device loop0): device total_bytes should be at most 5423104 but 
found 10737418240
  BTRFS error (device loop0): failed to read chunk tree: -22
  BTRFS error (device loop0): open_ctree failed

[CAUSE]
When btrfs-image restores an image into a file, and the source image
contains only single device, then we don't need to modify the
chunk/device tree, as we can reuse the existing chunk/dev tree without
any problem.

This also means, for such restore, we also won't do any target file
enlarge. This behavior itself is fine, as at that time, kernel won't
check if the device is smaller than the device size recorded in device
tree.

But later kernel commit 3a160a933111 ("btrfs: drop never met disk total
bytes check in verify_one_dev_extent") introduces new check on device
size at mount time, rejecting any loop file which is smaller than the
original device size.

[FIX]
Do extra file enlarge for single device restore.

Reported-by: Nikolay Borisov 
Signed-off-by: Qu Wenruo 
---
  image/main.c | 43 +++
  1 file changed, 43 insertions(+)

diff --git a/image/main.c b/image/main.c
index 24393188e5e3..9933f69d0fdb 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2706,6 +2706,49 @@ static int restore_metadump(const char *input, FILE 
*out, int old_restore,
close_ctree(info->chunk_root);
if (ret)
goto out;
+   } else {
+   struct btrfs_root *root;
+   struct stat st;
+   u64 dev_size;
+
+   if (!info) {
+   root = open_ctree_fd(fileno(out), target, 0, 0);
+   if (!root) {
+   error("open ctree failed in %s", target);
+   ret = -EIO;
+   goto out;
+   }
+
+   info = root->fs_info;
+
+   dev_size = btrfs_stack_device_total_bytes(
+   &info->super_copy->dev_item);
+   close_ctree(root);
+   info = NULL;
+   } else {
+   dev_size = btrfs_stack_device_total_bytes(
+   &info->super_copy->dev_item);
+   }
+
+   /*
+* We don't need extra tree modification, but if the output is
+* a file, we need to enlarge the output file so that
+* newer kernel won't report error.
+*/
+   ret = fstat(fileno(out), &st);
+   if (ret < 0) {
+   error("failed to stat result image: %m");
+   ret = -errno;
+   goto out;
+   }
+   if (S_ISREG(st.st_mode)) {
+   ret = ftruncate64(fileno(out), dev_size);


This truncates the file unconditionally, so if the file is larger than
required, I don't think it's necessary to do it.


Indeed, I'll update the patchset to do conditional truncation.

Thanks,
Qu

Re: [PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage()




On 2021/4/16 下午11:13, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

Just like read page, for subpage support we only require sector size
alignment.

So change the error message condition to only require sector alignment.

This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/extent_io.c | 29 -
  1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 53ac22e3560f..94f8b3ffe6a7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2779,25 +2779,20 @@ static void end_bio_extent_writepage(struct 
bio *bio)

  struct page *page = bvec->bv_page;
  struct inode *inode = page->mapping->host;
  struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+    const u32 sectorsize = fs_info->sectorsize;
-    /* We always issue full-page reads, but if some block
- * in a page fails to read, blk_update_request() will
- * advance bv_offset and adjust bv_len to compensate.
- * Print a warning for nonzero offsets, and an error
- * if they don't add up to a full page.  */
-    if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-    if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-    btrfs_err(fs_info,
-   "partial page write in btrfs with offset %u and 
length %u",

-    bvec->bv_offset, bvec->bv_len);
-    else
-    btrfs_info(fs_info,
-   "incomplete page write in btrfs with offset %u and 
length %u",

-    bvec->bv_offset, bvec->bv_len);
-    }
+    /* Btrfs read write should always be sector aligned. */
+    if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+    btrfs_err(fs_info,
+    "partial page write in btrfs with offset %u and length %u",
+  bvec->bv_offset, bvec->bv_len);
+    else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+    btrfs_info(fs_info,
+    "incomplete page write with offset %u and length %u",
+   bvec->bv_offset, bvec->bv_len);
-    start = page_offset(page);
-    end = start + bvec->bv_offset + bvec->bv_len - 1;
+    start = page_offset(page) + bvec->bv_offset;
+    end = start + bvec->bv_len - 1;


Does this bit work out for you now?


At least generic passes here for my arm board.

  Because before start was just the 
page offset.  Clearly the way it was before is a bug (I think?), because 
it gets used in btrfs_writepage_endio_finish_ordered() with the 
start+len, so you really do want start = page_offset(page) + bv_offset.  


Not a bug before, as for sectorsize == PAGE_SIZE case, all bvec has 
bv_offset == 0 and bv_len == PAGE_SIZE.


Thanks,
Qu

But this is a behavior change that warrants a patch of its own as it's 
unrelated to the sectorsize change.  (Yes I realize I'm asking for more 
patches in an already huge series, yes I'm insane.) Thanks,


Josef

Re: [PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig()




On 2021/4/16 下午10:58, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.

So change the parameter and all callers to pass bytenr in.

With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.

Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.

If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().

So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.

This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/extent_io.c | 57 
  1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ac01f29b00c9..ff24db8513b4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1807,8 +1807,8 @@ bool btrfs_find_delalloc_range(struct 
extent_io_tree *tree, u64 *start,

  static int __process_pages_contig(struct address_space *mapping,
    struct page *locked_page,
-  pgoff_t start_index, pgoff_t end_index,
-  unsigned long page_ops, pgoff_t *index_ret);
+  u64 start, u64 end, unsigned long page_ops,
+  u64 *processed_end);
  static noinline void __unlock_for_delalloc(struct inode *inode,
 struct page *locked_page,
@@ -1821,7 +1821,7 @@ static noinline void 
__unlock_for_delalloc(struct inode *inode,

  if (index == locked_page->index && end_index == index)
  return;
-    __process_pages_contig(inode->i_mapping, locked_page, index, 
end_index,

+    __process_pages_contig(inode->i_mapping, locked_page, start, end,
 PAGE_UNLOCK, NULL);
  }
@@ -1831,19 +1831,19 @@ static noinline int lock_delalloc_pages(struct 
inode *inode,

  u64 delalloc_end)
  {
  unsigned long index = delalloc_start >> PAGE_SHIFT;
-    unsigned long index_ret = index;
  unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+    u64 processed_end = delalloc_start;
  int ret;
  ASSERT(locked_page);
  if (index == locked_page->index && index == end_index)
  return 0;
-    ret = __process_pages_contig(inode->i_mapping, locked_page, index,
- end_index, PAGE_LOCK, &index_ret);
-    if (ret == -EAGAIN)
+    ret = __process_pages_contig(inode->i_mapping, locked_page, 
delalloc_start,

+ delalloc_end, PAGE_LOCK, &processed_end);
+    if (ret == -EAGAIN && processed_end > delalloc_start)
  __unlock_for_delalloc(inode, locked_page, delalloc_start,
-  (u64)index_ret << PAGE_SHIFT);
+  processed_end);
  return ret;
  }
@@ -1938,12 +1938,14 @@ noinline_for_stack bool 
find_lock_delalloc_range(struct inode *inode,

  static int __process_pages_contig(struct address_space *mapping,
    struct page *locked_page,
-  pgoff_t start_index, pgoff_t end_index,
-  unsigned long page_ops, pgoff_t *index_ret)
+  u64 start, u64 end, unsigned long page_ops,
+  u64 *processed_end)
  {
+    pgoff_t start_index = start >> PAGE_SHIFT;
+    pgoff_t end_index = end >> PAGE_SHIFT;
+    pgoff_t index = start_index;
  unsigned long nr_pages = end_index - start_index + 1;
  unsigned long pages_processed = 0;
-    pgoff_t index = start_index;
  struct page *pages[16];
  unsigned ret;
  int err = 0;
@@ -1951,17 +1953,19 @@ static int __process_pages_contig(struct 
address_space *mapping,

  if (page_ops & PAGE_LOCK) {
  ASSERT(page_ops == PAGE_LOCK);
-    ASSERT(index_ret && *index_ret == start_index);
+    ASSERT(processed_end && *processed_end == start);
  }
  if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
  mapping_set_error(mapping, -EIO);
  while (nr_pages > 0) {
-    ret = find_get_pages_contig(mapping, index,
+    int found_pages;
+
+    found_pages = find_get_pages_contig(mapping, index,
   min_t(unsigned long,
   nr_pages, ARRAY_SIZE(pages)), pages);
-    if (ret == 0) {
+    if (found_pages == 0) {
  /*
   * Only if we're going to lock these pages,
   * can we find nothing at @index.
@@ -2004,13 +2008,27 @@ stat

Re: [PATCH 11/42] btrfs: refactor btrfs_invalidatepage()




On 2021/4/16 下午10:42, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

This patch will refactor btrfs_invalidatepage() for the incoming subpage
support.

The invovled modifcations are:
- Use while() loop instead of "goto again;"
- Use single variable to determine whether to delete extent states
   Each branch will also have comments why we can or cannot delete the
   extent states
- Do qgroup free and extent states deletion per-loop
   Current code can only work for PAGE_SIZE == sectorsize case.

This refactor also makes it clear what we do for different sectors:
- Sectors without ordered extent
   We're completely safe to remove all extent states for the sector(s)

- Sectors with ordered extent, but no Private2 bit
   This means the endio has already been executed, we can't remove all
   extent states for the sector(s).

- Sectors with ordere extent, still has Private2 bit
   This means we need to decrease the ordered extent accounting.
   And then it comes to two different variants:
   * We have finished and removed the ordered extent
 Then it's the same as "sectors without ordered extent"
   * We didn't finished the ordered extent
 We can remove some extent states, but not all.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/inode.c | 173 +--
  1 file changed, 94 insertions(+), 79 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4c894de2e813..93bb7c0482ba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8320,15 +8320,12 @@ static void btrfs_invalidatepage(struct page 
*page, unsigned int offset,

  {
  struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
  struct extent_io_tree *tree = &inode->io_tree;
-    struct btrfs_ordered_extent *ordered;
  struct extent_state *cached_state = NULL;
  u64 page_start = page_offset(page);
  u64 page_end = page_start + PAGE_SIZE - 1;
-    u64 start;
-    u64 end;
+    u64 cur;
+    u32 sectorsize = inode->root->fs_info->sectorsize;
  int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
-    bool found_ordered = false;
-    bool completed_ordered = false;
  /*
   * We have page locked so no new ordered extent can be created on
@@ -8352,96 +8349,114 @@ static void btrfs_invalidatepage(struct page 
*page, unsigned int offset,

  if (!inode_evicting)
  lock_extent_bits(tree, page_start, page_end, &cached_state);
-    start = page_start;
-again:
-    ordered = btrfs_lookup_ordered_range(inode, start, page_end - 
start + 1);

-    if (ordered) {
-    found_ordered = true;
-    end = min(page_end,
-  ordered->file_offset + ordered->num_bytes - 1);
+    cur = page_start;
+    while (cur < page_end) {
+    struct btrfs_ordered_extent *ordered;
+    bool delete_states = false;
+    u64 range_end;
+
+    /*
+ * Here we can't pass "file_offset = cur" and
+ * "len = page_end + 1 - cur", as btrfs_lookup_ordered_range()
+ * may not return the first ordered extent after @file_offset.
+ *
+ * Here we want to iterate through the range in byte order.
+ * This is slower but definitely correct.
+ *
+ * TODO: Make btrfs_lookup_ordered_range() to return the
+ * first ordered extent in the range to reduce the number
+ * of loops.
+ */
+    ordered = btrfs_lookup_ordered_range(inode, cur, sectorsize);


How does it not find the first ordered extent after file_offset?  
Looking at the code it just loops through and returns the first thing it 
finds that overlaps our range.  Is there a bug in 
btrfs_lookup_ordered_range()?


btrfs_lookup_ordered_range() does two search:
node = tree_search(tree, file_offset);
if (!node) {
node = tree_search(tree, file_offset + len);
}

That means for the following seach pattern, it will not return the first OE:
start   end
|   |///|   |///|   |




We should add some self tests to make sure these helpers are doing the 
right thing if there is in fact a bug.


It's not a bug, as most call sites for btrfs_lookup_ordered_range() will 
wait for the ordered extent to finish, then re-search until all ordered 
extent is exhausted.


In that case, they don't care the order of returned OE.

It's really the first time we need a specific ordered.

Since you're already complaining, I guess I'd either add a new function 
to make the existing one to follow the order.





+    if (!ordered) {
+    range_end = cur + sectorsize - 1;
+    /*
+ * No ordered extent covering this sector, we are safe
+ * to delete all extent states in the range.
+ */
+    delete_states = true;
+    goto next;
+    }
+
+    range_end = min(ordered->file_offset +

Re: [PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions




On 2021/4/16 下午10:09, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

Btrfs has two endio functions to mark certain io range finished for
ordered extents:
- __endio_write_update_ordered()
   This is for direct IO

- btrfs_writepage_endio_finish_ordered()
   This for buffered IO.

However they go different routines to handle ordered extent io:
- Whether to iterate through all ordered extents
   __endio_write_update_ordered() will but
   btrfs_writepage_endio_finish_ordered() will not.

   In fact, iterating through all ordered extents will benefit later
   subpage support, while for current PAGE_SIZE == sectorsize requirement
   those behavior makes no difference.

- Whether to update page Private2 flag
   __endio_write_update_ordered() will no update page Private2 flag as
   for iomap direct IO, the page can be not even mapped.
   While btrfs_writepage_endio_finish_ordered() will clear Private2 to
   prevent double accounting against btrfs_invalidatepage().

Those differences are pretty small, and the ordered extent iterations
codes in callers makes code much harder to read.

So this patch will introduce a new function,
btrfs_mark_ordered_io_finished(), to do the heavy lifting work:
- Iterate through all ordered extents in the range
- Do the ordered extent accounting
- Queue the work for finished ordered extent

This function has two new feature:
- Proper underflow detection and recover
   The old underflow detection will only detect the problem, then
   continue.
   No proper info like root/inode/ordered extent info, nor noisy enough
   to be caught by fstests.

   Furthermore when underflow happens, the ordered extent will never
   finish.

   New error detection will reset the bytes_left to 0, do proper
   kernel warning, and output extra info including root, ino, ordered
   extent range, the underflow value.

- Prevent double accounting based on Private2 flag
   Now if we find a range without Private2 flag, we will skip to next
   range.
   As that means someone else has already finished the accounting of
   ordered extent.
   This makes no difference for current code, but will be a critical part
   for incoming subpage support.

Now both endio functions only need to call that new function.

And since the only caller of btrfs_dec_test_first_ordered_pending() is
removed, also remove btrfs_dec_test_first_ordered_pending() completely.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/inode.c    |  55 +---
  fs/btrfs/ordered-data.c | 179 +++-
  fs/btrfs/ordered-data.h |   8 +-
  3 files changed, 129 insertions(+), 113 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 752f0c78e1df..645097bff5a0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3063,25 +3063,11 @@ void 
btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,

    struct page *page, u64 start,
    u64 end, int uptodate)
  {
-    struct btrfs_fs_info *fs_info = inode->root->fs_info;
-    struct btrfs_ordered_extent *ordered_extent = NULL;
-    struct btrfs_workqueue *wq;
-
  ASSERT(end + 1 - start < U32_MAX);
  trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
-    ClearPagePrivate2(page);
-    if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
-    end - start + 1, uptodate))
-    return;
-
-    if (btrfs_is_free_space_inode(inode))
-    wq = fs_info->endio_freespace_worker;
-    else
-    wq = fs_info->endio_write_workers;
-
-    btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, 
NULL);

-    btrfs_queue_work(wq, &ordered_extent->work);
+    btrfs_mark_ordered_io_finished(inode, page, start, end + 1 - start,
+   finish_ordered_fn, uptodate);
  }
  /*
@@ -7959,42 +7945,9 @@ static void __endio_write_update_ordered(struct 
btrfs_inode *inode,

   const u64 offset, const u64 bytes,
   const bool uptodate)
  {
-    struct btrfs_fs_info *fs_info = inode->root->fs_info;
-    struct btrfs_ordered_extent *ordered = NULL;
-    struct btrfs_workqueue *wq;
-    u64 ordered_offset = offset;
-    u64 ordered_bytes = bytes;
-    u64 last_offset;
-
-    if (btrfs_is_free_space_inode(inode))
-    wq = fs_info->endio_freespace_worker;
-    else
-    wq = fs_info->endio_write_workers;
-
  ASSERT(bytes < U32_MAX);
-    while (ordered_offset < offset + bytes) {
-    last_offset = ordered_offset;
-    if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
- &ordered_offset,
- ordered_bytes,
- uptodate)) {
-    btrfs_init_work(&ordered->work, finish_ordered_fn, NULL,
-    NULL);
-    btrfs_queue_work(wq, &ordered->work);
-    }
-
-    /* No ordered extent found in the range

Re: [PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()




On 2021/4/16 下午9:58, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().

It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.

Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.

By this, end_compressed_bio_write() can happily pass page=NULL while
still get everything done properly.

Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/compression.c   |  4 +---
  fs/btrfs/ctree.h |  3 ++-
  fs/btrfs/extent_io.c | 16 ++--
  fs/btrfs/inode.c |  9 +
  include/trace/events/btrfs.h | 19 ---
  5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2600703fab83..4fbe3e12be71 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -343,11 +343,9 @@ static void end_compressed_bio_write(struct bio 
*bio)

   * call back into the FS and do all the end_io operations
   */
  inode = cb->inode;
-    cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
-    btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0],
+    btrfs_writepage_endio_finish_ordered(BTRFS_I(inode), NULL,
  cb->start, cb->start + cb->len - 1,
  bio->bi_status == BLK_STS_OK);
-    cb->compressed_pages[0]->mapping = NULL;
  end_compressed_writeback(inode, cb);
  /* note, our inode could be gone now */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c858d5349c8..505bc6674bcc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3175,7 +3175,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode 
*inode, struct page *locked_page
  u64 start, u64 end, int *page_started, unsigned long 
*nr_written,

  struct writeback_control *wbc);
  int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
-void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
+void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
+  struct page *page, u64 start,
    u64 end, int uptodate);
  extern const struct dentry_operations btrfs_dentry_operations;
  extern const struct iomap_ops btrfs_dio_iomap_ops;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7d1fca9b87f0..6d712418b67b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2711,10 +2711,13 @@ blk_status_t btrfs_submit_read_repair(struct 
inode *inode,
  void end_extent_writepage(struct page *page, int err, u64 start, u64 
end)

  {
+    struct btrfs_inode *inode;
  int uptodate = (err == 0);
  int ret = 0;
-    btrfs_writepage_endio_finish_ordered(page, start, end, uptodate);
+    ASSERT(page && page->mapping);
+    inode = BTRFS_I(page->mapping->host);
+    btrfs_writepage_endio_finish_ordered(inode, page, start, end, 
uptodate);

  if (!uptodate) {
  ClearPageUptodate(page);
@@ -3739,7 +3742,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,

  u32 iosize;
  if (cur >= i_size) {
-    btrfs_writepage_endio_finish_ordered(page, cur, end, 1);
+    btrfs_writepage_endio_finish_ordered(inode, page, cur,
+ end, 1);
  break;
  }
  em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
@@ -3777,8 +3781,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,

  if (compressed)
  nr++;
  else
-    btrfs_writepage_endio_finish_ordered(page, cur,
-    cur + iosize - 1, 1);
+    btrfs_writepage_endio_finish_ordered(inode,
+    page, cur, cur + iosize - 1, 1);
  cur += iosize;
  continue;
  }
@@ -4842,8 +4846,8 @@ int extent_write_locked_range(struct inode 
*inode, u64 start, u64 end,

  if (clear_page_dirty_for_io(page))
  ret = __extent_writepage(page, &wbc_writepages, &epd);
  else {
-    btrfs_writepage_endio_finish_ordered(page, start,
-    start + PAGE_SIZE - 1, 1);
+    btrfs_writepage_endio_finish_ordered(BTRFS_I(inode),
+    page, start, start + PAGE_SIZE - 1, 1);
  unlock_page(page);
  }
  put_page(page);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 554effbf307e..752f0c78e1df 100644
--- a/fs/btrfs/in

Re: [PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent




On 2021/4/16 下午9:54, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

Unlike btrfs_file_extent_item, btrfs_ordered_extent has its length
limit (BTRFS_MAX_EXTENT_SIZE), which is far smaller than U32_MAX.

Using u64 for those length related members are just a waste of memory.

This patch will make the following members u32:
- num_bytes
- disk_num_bytes
- bytes_left
- truncated_len

This will save 16 bytes for btrfs_ordered_extent structure.

For btrfs_add_ordered_extent*() call sites, they are mostly deeply
inside other functions passing u64.
Thus this patch will keep those u64, but do internal ASSERT() to ensure
the correct length values are passed in.

For btrfs_dec_test_.*_ordered_extent() call sites, length related
parameters are converted to u32, with extra ASSERT() added to ensure we
get correct values passed in.

There is special convert needed in btrfs_remove_ordered_extent(), which
needs s64, using "-entry->num_bytes" from u32 directly will cause
underflow.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/inode.c    | 11 ---
  fs/btrfs/ordered-data.c | 21 ++---
  fs/btrfs/ordered-data.h | 25 ++---
  3 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 74ee34fc820d..554effbf307e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3066,6 +3066,7 @@ void btrfs_writepage_endio_finish_ordered(struct 
page *page, u64 start,

  struct btrfs_ordered_extent *ordered_extent = NULL;
  struct btrfs_workqueue *wq;
+    ASSERT(end + 1 - start < U32_MAX);
  trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
  ClearPagePrivate2(page);
@@ -7969,6 +7970,7 @@ static void __endio_write_update_ordered(struct 
btrfs_inode *inode,

  else
  wq = fs_info->endio_write_workers;
+    ASSERT(bytes < U32_MAX);
  while (ordered_offset < offset + bytes) {
  last_offset = ordered_offset;
  if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
@@ -8415,10 +8417,13 @@ static void btrfs_invalidatepage(struct page 
*page, unsigned int offset,

  if (TestClearPagePrivate2(page)) {
  spin_lock_irq(&inode->ordered_tree.lock);
  set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-    ordered->truncated_len = min(ordered->truncated_len,
- start - ordered->file_offset);
+    ASSERT(start - ordered->file_offset < U32_MAX);
+    ordered->truncated_len = min_t(u32,
+    ordered->truncated_len,
+    start - ordered->file_offset);
  spin_unlock_irq(&inode->ordered_tree.lock);
+    ASSERT(end - start + 1 < U32_MAX);
  if (btrfs_dec_test_ordered_pending(inode, &ordered,
 start,
 end - start + 1, 1)) {
@@ -8937,7 +8942,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
  break;
  else {
  btrfs_err(root->fs_info,
-  "found ordered extent %llu %llu on inode cleanup",
+  "found ordered extent %llu %u on inode cleanup",
    ordered->file_offset, ordered->num_bytes);
  btrfs_remove_ordered_extent(inode, ordered);
  btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 07b0b4218791..8e6d9d906bdd 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -160,6 +160,12 @@ static int __btrfs_add_ordered_extent(struct 
btrfs_inode *inode, u64 file_offset

  struct btrfs_ordered_extent *entry;
  int ret;
+    /*
+ * Basic size check, all length related members should be smaller
+ * than U32_MAX.
+ */
+    ASSERT(num_bytes < U32_MAX && disk_num_bytes < U32_MAX);
+
  if (type == BTRFS_ORDERED_NOCOW || type == 
BTRFS_ORDERED_PREALLOC) {

  /* For nocow write, we can release the qgroup rsv right now */
  ret = btrfs_qgroup_free_data(inode, NULL, file_offset, 
num_bytes);
@@ -186,7 +192,7 @@ static int __btrfs_add_ordered_extent(struct 
btrfs_inode *inode, u64 file_offset

  entry->bytes_left = num_bytes;
  entry->inode = igrab(&inode->vfs_inode);
  entry->compress_type = compress_type;
-    entry->truncated_len = (u64)-1;
+    entry->truncated_len = (u32)-1;
  entry->qgroup_rsv = ret;
  entry->physical = (u64)-1;
  entry->disk = NULL;
@@ -320,7 +326,7 @@ void btrfs_add_ordered_sum(struct 
btrfs_ordered_extent *entry,

   */
  bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
 struct btrfs_ordered_extent **finished_ret,
-   u64 *file_offset, u64 io_size, int uptodate)
+   u64 *file_offset, u32 io_size, int uptodate)

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata





On 2021/4/16 下午1:50, riteshh wrote:

On 21/04/16 09:34AM, Qu Wenruo wrote:



On 2021/4/16 上午7:34, Qu Wenruo wrote:



On 2021/4/16 上午7:19, Qu Wenruo wrote:



On 2021/4/15 下午10:52, riteshh wrote:

On 21/04/15 09:14AM, riteshh wrote:

On 21/04/12 07:33PM, Qu Wenruo wrote:

Good news, you can fetch the subpage branch for better test results.

Now the branch should pass all generic tests, except defrag and known
failures.
And no more random crash during the tests.


Thanks, let me test it on PPC64 box.


I do see some failures remaining with the patch series.
However the one which is blocking my testing is the tests/generic/095
I see kernel BUG hitting with below signature.


That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.




Please let me know if this a known failure?


#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION   -- btrfs_4k
FSTYP -- btrfs
PLATFORM  -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
SMP Thu Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3


I see you're using -n 4096, not the default -n 16K, let me see if I can
reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related to sectorsize.


MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch



[ 6057.560580] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector
size 4096 with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) && page->private,
in fs/btrfs/subpage.c:171
[ 6061.378296] [ cut here ]
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c000260d7490]
  pc: c0a9370c: assertfail.constprop.11+0x34/0x48
  lr: c0a93708: assertfail.constprop.11+0x30/0x48
  sp: c000260d7730
     msr: 8282b033
    current = 0xc000260c0080
    paca    = 0xc0003fff8a00   irqmask: 0x03   irq_happened: 0x01
  pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@) (gcc
(Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
enter ? for help
[c000260d7790] c0a90280
btrfs_subpage_assert.isra.9+0x70/0x110
[c000260d77b0] c0a91064
btrfs_subpage_set_uptodate+0x54/0x110
[c000260d7800] c09c6d0c btrfs_dirty_pages+0x1bc/0x2c0


This is very strange.
As in btrfs_dirty_pages(), the pages passed in are already prepared by
prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?


Yes. almost reliably on my PPC box.



OK, I got it reproduced.

It's not a reliable BUG_ON(), but can be reproduced.
The test get skipped for all my boards as it requires fio tool, thus I
didn't get it triggered for all previous runs.

I'll take a look into the case.


This exposed an interesting race window in btrfs_buffered_write():
 Writer| fadvice
--+---
btrfs_buffered_write()|
|- prepare_pages()|
|  |- Now all pages involved get  |
| Private set |
| | btrfs_release_page()
| | |- Clear page Private
|- lock_extent()  |
|  |- This would prevent  |
| btrfs_release_page() to |
| clear the page Private  |
|
|- btrfs_dirty_page()
|- Will trigger the BUG_ON()



Sorry about the silly query. But help me understand how is above race possible?
Won't prepare_pages() will lock all the pages first. The same requirement
of locked page should be with btrfs_releasepage() too no?


releasepage() call can easily got a page locked and release it.

For call sites like btrfs_invalidatepage(), the page is already locked.

btrfs_releasepage() will not to try to release the page if the extent is 
locked (any extent range inside the page has EXTENT_LOCK bit).




I see only two paths which could result into btrfs_releasepage()
1. one via try_to_release_pages -> rel

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata





On 2021/4/16 上午7:34, Qu Wenruo wrote:



On 2021/4/16 上午7:19, Qu Wenruo wrote:



On 2021/4/15 下午10:52, riteshh wrote:

On 21/04/15 09:14AM, riteshh wrote:

On 21/04/12 07:33PM, Qu Wenruo wrote:

Good news, you can fetch the subpage branch for better test results.

Now the branch should pass all generic tests, except defrag and known
failures.
And no more random crash during the tests.


Thanks, let me test it on PPC64 box.


I do see some failures remaining with the patch series.
However the one which is blocking my testing is the tests/generic/095
I see kernel BUG hitting with below signature.


That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.




Please let me know if this a known failure?


#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION   -- btrfs_4k
FSTYP -- btrfs
PLATFORM  -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
SMP Thu Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3


I see you're using -n 4096, not the default -n 16K, let me see if I can
reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related to sectorsize.


MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch



[ 6057.560580] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector
size 4096 with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) && page->private,
in fs/btrfs/subpage.c:171
[ 6061.378296] [ cut here ]
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c000260d7490]
 pc: c0a9370c: assertfail.constprop.11+0x34/0x48
 lr: c0a93708: assertfail.constprop.11+0x30/0x48
 sp: c000260d7730
    msr: 8282b033
   current = 0xc000260c0080
   paca    = 0xc0003fff8a00   irqmask: 0x03   irq_happened: 0x01
 pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@) (gcc
(Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
enter ? for help
[c000260d7790] c0a90280
btrfs_subpage_assert.isra.9+0x70/0x110
[c000260d77b0] c0a91064
btrfs_subpage_set_uptodate+0x54/0x110
[c000260d7800] c09c6d0c btrfs_dirty_pages+0x1bc/0x2c0


This is very strange.
As in btrfs_dirty_pages(), the pages passed in are already prepared by
prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?


OK, I got it reproduced.

It's not a reliable BUG_ON(), but can be reproduced.
The test get skipped for all my boards as it requires fio tool, thus I
didn't get it triggered for all previous runs.

I'll take a look into the case.


This exposed an interesting race window in btrfs_buffered_write():
Writer| fadvice
--+---
btrfs_buffered_write()|
|- prepare_pages()|
|  |- Now all pages involved get  |
| Private set |
| | btrfs_release_page()
| | |- Clear page Private
|- lock_extent()  |
|  |- This would prevent  |
| btrfs_release_page() to |
| clear the page Private  |
|
|- btrfs_dirty_page()
   |- Will trigger the BUG_ON()

This only happens for subpage, because subpage introduces new ASSERT()
to do extra check.

If we want to speak strictly, regular sector size should also report
this problem.
But regular sector size case doesn't really care about page Private, as
it just set page->private to a constant value, unlike subpage case which
stores important value.

The fix will just re-set page Private and needed structures in
btrfs_dirty_page(), under extent locked so no btrfs_releasepage() is
able to release it anymore.

The fix is already added to the github branch.
Now it has the fix as the HEAD.

I hope this won't damage your confidence on the patchset.

Thanks for the report!
Qu



Thanks for the report,
Qu


BTW, are u

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata





On 2021/4/16 上午7:19, Qu Wenruo wrote:



On 2021/4/15 下午10:52, riteshh wrote:

On 21/04/15 09:14AM, riteshh wrote:

On 21/04/12 07:33PM, Qu Wenruo wrote:

Good news, you can fetch the subpage branch for better test results.

Now the branch should pass all generic tests, except defrag and known
failures.
And no more random crash during the tests.


Thanks, let me test it on PPC64 box.


I do see some failures remaining with the patch series.
However the one which is blocking my testing is the tests/generic/095
I see kernel BUG hitting with below signature.


That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.




Please let me know if this a known failure?


#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION   -- btrfs_4k
FSTYP -- btrfs
PLATFORM  -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73
SMP Thu Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3


I see you're using -n 4096, not the default -n 16K, let me see if I can
reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related to sectorsize.


MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch



[ 6057.560580] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector
size 4096 with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd
devid 1 transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector
size 4096 with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) && page->private,
in fs/btrfs/subpage.c:171
[ 6061.378296] [ cut here ]
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c000260d7490]
 pc: c0a9370c: assertfail.constprop.11+0x34/0x48
 lr: c0a93708: assertfail.constprop.11+0x30/0x48
 sp: c000260d7730
    msr: 8282b033
   current = 0xc000260c0080
   paca    = 0xc0003fff8a00   irqmask: 0x03   irq_happened: 0x01
 pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@) (gcc
(Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu)
2.30) #73 SMP Thu Apr 15 07:29:23 CDT 2021
enter ? for help
[c000260d7790] c0a90280
btrfs_subpage_assert.isra.9+0x70/0x110
[c000260d77b0] c0a91064 btrfs_subpage_set_uptodate+0x54/0x110
[c000260d7800] c09c6d0c btrfs_dirty_pages+0x1bc/0x2c0


This is very strange.
As in btrfs_dirty_pages(), the pages passed in are already prepared by
prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?


OK, I got it reproduced.

It's not a reliable BUG_ON(), but can be reproduced.
The test get skipped for all my boards as it requires fio tool, thus I
didn't get it triggered for all previous runs.

I'll take a look into the case.

Thanks for the report,
Qu


BTW, are using running the latest branch, with this commit at top?

commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
Author: Qu Wenruo 
Date:   Mon Feb 22 14:19:38 2021 +0800

    btrfs: allow read-write for 4K sectorsize on 64K page size systems

As I was updating the patchset until the last minute.

Thanks,
Qu


[c000260d7880] c09c7298 btrfs_buffered_write+0x488/0x7f0
[c000260d79d0] c09cbeb4 btrfs_file_write_iter+0x314/0x520
[c000260d7a50] c055fd84 do_iter_readv_writev+0x1b4/0x260
[c000260d7ac0] c056114c do_iter_write+0xdc/0x2c0
[c000260d7b10] c05c2d2c iter_file_splice_write+0x2ec/0x510
[c000260d7c30] c05c1ba0 do_splice_from+0x50/0x70
[c000260d7c50] c05c37e8 do_splice+0x5a8/0x910
[c000260d7cd0] c05c3ce0 sys_splice+0x190/0x300
[c000260d7d60] c0039ba4 system_call_exception+0x384/0x3d0
[c000260d7e10] c000d45c system_call_common+0xec/0x278
--- Exception: c00 (System Call) at 772ef170


-ritesh

Re: [PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page





On 2021/4/16 上午3:27, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

The new function, submit_eb_subpage(), will submit all the dirty extent
buffers in the page.

The major difference between submit_eb_page() and submit_eb_subpage()
is:
- How to grab extent buffer
   Now we use find_extent_buffer_nospinlock() other than using
   page::private.

All other different handling is already done in functions like
lock_extent_buffer_for_io() and write_one_eb().

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/extent_io.c | 95 
  1 file changed, 95 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c068c2fcba09..7d1fca9b87f0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4323,6 +4323,98 @@ static noinline_for_stack int
write_one_eb(struct extent_buffer *eb,
  return ret;
  }
+/*
+ * Submit one subpage btree page.
+ *
+ * The main difference between submit_eb_page() is:
+ * - Page locking
+ *   For subpage, we don't rely on page locking at all.
+ *
+ * - Flush write bio
+ *   We only flush bio if we may be unable to fit current extent
buffers into
+ *   current bio.
+ *
+ * Return >=0 for the number of submitted extent buffers.
+ * Return <0 for fatal error.
+ */
+static int submit_eb_subpage(struct page *page,
+ struct writeback_control *wbc,
+ struct extent_page_data *epd)
+{
+    struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+    int submitted = 0;
+    u64 page_start = page_offset(page);
+    int bit_start = 0;
+    int nbits = BTRFS_SUBPAGE_BITMAP_SIZE;
+    int sectors_per_node = fs_info->nodesize >>
fs_info->sectorsize_bits;
+    int ret;
+
+    /* Lock and write each dirty extent buffers in the range */
+    while (bit_start < nbits) {
+    struct btrfs_subpage *subpage = (struct btrfs_subpage
*)page->private;
+    struct extent_buffer *eb;
+    unsigned long flags;
+    u64 start;
+
+    /*
+ * Take private lock to ensure the subpage won't be detached
+ * halfway.
+ */
+    spin_lock(&page->mapping->private_lock);
+    if (!PagePrivate(page)) {
+    spin_unlock(&page->mapping->private_lock);
+    break;
+    }
+    spin_lock_irqsave(&subpage->lock, flags);


writepages doesn't get called with irq context, so you can just do
spin_lock_irq()/spin_unlock_irq().


But this spinlock is used in endio function.
If we don't use irqsave variant here, won't an endio interruption call
sneak in and screw up everything?




+    if (!((1 << bit_start) & subpage->dirty_bitmap)) {


Can we make this a helper so it's more clear what's going on here?  Thanks,


That makes sense.

Thanks,
Qu



Josef

Re: [PATCH 02/42] btrfs: introduce write_one_subpage_eb() function





On 2021/4/16 上午3:03, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

The new function, write_one_subpage_eb(), as a subroutine for subpage
metadata write, will handle the extent buffer bio submission.

The major differences between the new write_one_subpage_eb() and
write_one_eb() is:
- No page locking
   When entering write_one_subpage_eb() the page is no longer locked.
   We only lock the page for its status update, and unlock immediately.
   Now we completely rely on extent io tree locking.

- Extra bitmap update along with page status update
   Now page dirty and writeback is controlled by
   btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
   They both follow the schema that any sector is dirty/writeback, then
   the full page get dirty/writeback.

- When to update the nr_written number
   Now we take a short cut, if we have cleared the last dirty bit of the
   page, we update nr_written.
   This is not completely perfect, but should emulate the old behavior
   good enough.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/extent_io.c | 55 
  1 file changed, 55 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 21a14b1cb065..f32163a465ec 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4196,6 +4196,58 @@ static void
end_bio_extent_buffer_writepage(struct bio *bio)
  bio_put(bio);
  }
+/*
+ * Unlike the work in write_one_eb(), we rely completely on extent
locking.
+ * Page locking is only utizlied at minimal to keep the VM code happy.
+ *
+ * Caller should still call write_one_eb() other than this function
directly.
+ * As write_one_eb() has extra prepration before submitting the
extent buffer.
+ */
+static int write_one_subpage_eb(struct extent_buffer *eb,
+    struct writeback_control *wbc,
+    struct extent_page_data *epd)
+{
+    struct btrfs_fs_info *fs_info = eb->fs_info;
+    struct page *page = eb->pages[0];
+    unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
+    bool no_dirty_ebs = false;
+    int ret;
+
+    /* clear_page_dirty_for_io() in subpage helper need page locked. */
+    lock_page(page);
+    btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
+
+    /* If we're the last dirty bit to update nr_written */
+    no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
+  eb->start, eb->len);
+    if (no_dirty_ebs)
+    clear_page_dirty_for_io(page);
+
+    ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
+    eb->start, eb->len, eb->start - page_offset(page),
+    &epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
+    false);
+    if (ret) {
+    btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+  eb->len);
+    set_btree_ioerr(page, eb);
+    unlock_page(page);
+
+    if (atomic_dec_and_test(&eb->io_pages))
+    end_extent_buffer_writeback(eb);
+    return -EIO;
+    }
+    unlock_page(page);
+    /*
+ * Submission finishes without problem, if no range of the page is
+ * dirty anymore, we have submitted a page.
+ * Update the nr_written in wbc.
+ */
+    if (no_dirty_ebs)
+    update_nr_written(wbc, 1);
+    return ret;
+}
+
  static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
  struct writeback_control *wbc,
  struct extent_page_data *epd)
@@ -4227,6 +4279,9 @@ static noinline_for_stack int
write_one_eb(struct extent_buffer *eb,
  memzero_extent_buffer(eb, start, end - start);
  }
+    if (eb->fs_info->sectorsize < PAGE_SIZE)
+    return write_one_subpage_eb(eb, wbc, epd);
+


Same comment here, again you're calling write_one_eb() which expects to
do the eb thing, but then later have an entirely different path for the
subpage stuff, and thus could just call your write_one_subpage_eb()
helper from there instead of stuffing it into write_one_eb().


But there are some common code before calling the subpage routine.

I don't think it's a good idea to have duplicated code between subpage
and regular routine.



Also, I generally don't care about ordering of patches as long as they
make sense generally.

However in this case if you were to bisect to just this patch you would
be completely screwed, as the normal write path would just fail to write
the other eb's on the page.  You really need to have the patches that do
the write_cache_pages part done first, and then have this patch.


No way one can bisect to this patch.
Without the last patch to enable subpage write, bisect will never point
to this one.

And how could it be possible to implement data write before metadata?
Without metadata write ability, data write won't even be possible.

But without data write ability, metadata write can still be possible,
just doi

Re: [PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function




On 2021/4/16 上午2:50, Josef Bacik wrote:

On 4/15/21 1:04 AM, Qu Wenruo wrote:

The new function, end_bio_subpage_eb_writepage(), will handle the
metadata writeback endio.

The major differences involved are:
- How to grab extent buffer
   Now page::private is a pointer to btrfs_subpage, we can no longer grab
   extent buffer directly.
   Thus we need to use the bv_offset to locate the extent buffer manually
   and iterate through the whole range.

- Use btrfs_subpage_end_writeback() caller
   This helper will handle the subpage writeback for us.

Since this function is executed under endio context, when grabbing
extent buffers it can't grab eb->refs_lock as that lock is not designed
to be grabbed under hardirq context.

So here introduce a helper, find_extent_buffer_nospinlock(), for such
situation, and convert find_extent_buffer() to use that helper.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/extent_io.c | 135 +--
  1 file changed, 106 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a50adbd8808d..21a14b1cb065 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4080,13 +4080,97 @@ static void set_btree_ioerr(struct page *page, 
struct extent_buffer *eb)

  }
  }
+/*
+ * This is the endio specific version which won't touch any unsafe 
spinlock

+ * in endio context.
+ */
+static struct extent_buffer *find_extent_buffer_nospinlock(
+    struct btrfs_fs_info *fs_info, u64 start)
+{
+    struct extent_buffer *eb;
+
+    rcu_read_lock();
+    eb = radix_tree_lookup(&fs_info->buffer_radix,
+   start >> fs_info->sectorsize_bits);
+    if (eb && atomic_inc_not_zero(&eb->refs)) {
+    rcu_read_unlock();
+    return eb;
+    }
+    rcu_read_unlock();
+    return NULL;
+}
+/*
+ * The endio function for subpage extent buffer write.
+ *
+ * Unlike end_bio_extent_buffer_writepage(), we only call 
end_page_writeback()

+ * after all extent buffers in the page has finished their writeback.
+ */
+static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
+ struct bio *bio)
+{
+    struct bio_vec *bvec;
+    struct bvec_iter_all iter_all;
+
+    ASSERT(!bio_flagged(bio, BIO_CLONED));
+    bio_for_each_segment_all(bvec, bio, iter_all) {
+    struct page *page = bvec->bv_page;
+    u64 bvec_start = page_offset(page) + bvec->bv_offset;
+    u64 bvec_end = bvec_start + bvec->bv_len - 1;
+    u64 cur_bytenr = bvec_start;
+
+    ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+
+    /* Iterate through all extent buffers in the range */
+    while (cur_bytenr <= bvec_end) {
+    struct extent_buffer *eb;
+    int done;
+
+    /*
+ * Here we can't use find_extent_buffer(), as it may
+ * try to lock eb->refs_lock, which is not safe in endio
+ * context.
+ */
+    eb = find_extent_buffer_nospinlock(fs_info, cur_bytenr);
+    ASSERT(eb);
+
+    cur_bytenr = eb->start + eb->len;
+
+    ASSERT(test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags));
+    done = atomic_dec_and_test(&eb->io_pages);
+    ASSERT(done);
+
+    if (bio->bi_status ||
+    test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+    ClearPageUptodate(page);
+    set_btree_ioerr(page, eb);
+    }
+
+    btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+  eb->len);
+    end_extent_buffer_writeback(eb);
+    /*
+ * free_extent_buffer() will grab spinlock which is not
+ * safe in endio context. Thus here we manually dec
+ * the ref.
+ */
+    atomic_dec(&eb->refs);
+    }
+    }
+    bio_put(bio);
+}
+
  static void end_bio_extent_buffer_writepage(struct bio *bio)
  {
+    struct btrfs_fs_info *fs_info;
  struct bio_vec *bvec;
  struct extent_buffer *eb;
  int done;
  struct bvec_iter_all iter_all;
+    fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
+    if (fs_info->sectorsize < PAGE_SIZE)
+    return end_bio_subpage_eb_writepage(fs_info, bio);
+


You replace the write_one_eb() call with one specifically for subpage, 
why not just use your special endio from there without polluting the 
normal writepage helper?  Thanks,


That makes sense, I'd go that direction.

Thanks,
Qu



Josef

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata





On 2021/4/15 下午10:52, riteshh wrote:

On 21/04/15 09:14AM, riteshh wrote:

On 21/04/12 07:33PM, Qu Wenruo wrote:

Good news, you can fetch the subpage branch for better test results.

Now the branch should pass all generic tests, except defrag and known
failures.
And no more random crash during the tests.


Thanks, let me test it on PPC64 box.


I do see some failures remaining with the patch series.
However the one which is blocking my testing is the tests/generic/095
I see kernel BUG hitting with below signature.


That's pretty different from my tests.

As I haven't seen such BUG_ON() for a while.




Please let me know if this a known failure?


#:~/work-tools/xfstests$ sudo ./check -g auto
SECTION   -- btrfs_4k
FSTYP -- btrfs
PLATFORM  -- Linux/ppc64le qemu 5.12.0-rc7-02316-g3490dae50c0 #73 SMP Thu 
Apr 15 07:29:23 CDT 2021
MKFS_OPTIONS  -- -f -s 4096 -n 4096 /dev/loop3


I see you're using -n 4096, not the default -n 16K, let me see if I can
reproduce that.

But from the backtrace, it doesn't look like the case,
as it happens for data path, which means it's only related to sectorsize.


MOUNT_OPTIONS -- /dev/loop3 /mnt1/scratch



[ 6057.560580] BTRFS warning (device loop3): read-write for sector size 4096 
with page size 65536 is experimental
[ 6057.861383] run fstests generic/095 at 2021-04-15 14:12:10
[ 6058.345127] BTRFS info (device loop2): disk space caching is enabled
[ 6058.348910] BTRFS info (device loop2): has skinny extents
[ 6058.351930] BTRFS warning (device loop2): read-write for sector size 4096 
with page size 65536 is experimental
[ 6059.896382] BTRFS: device fsid 43ec9cdf-c124-4460-ad93-933bfd5ddbbd devid 1 
transid 5 /dev/loop3 scanned by mkfs.btrfs (739641)
[ 6060.225107] BTRFS info (device loop3): disk space caching is enabled
[ 6060.226213] BTRFS info (device loop3): has skinny extents
[ 6060.227084] BTRFS warning (device loop3): read-write for sector size 4096 
with page size 65536 is experimental
[ 6060.234537] BTRFS info (device loop3): checking UUID tree
[ 6061.375902] assertion failed: PagePrivate(page) && page->private, in 
fs/btrfs/subpage.c:171
[ 6061.378296] [ cut here ]
[ 6061.379422] kernel BUG at fs/btrfs/ctree.h:3403!
cpu 0x5: Vector: 700 (Program Check) at [c000260d7490]
 pc: c0a9370c: assertfail.constprop.11+0x34/0x48
 lr: c0a93708: assertfail.constprop.11+0x30/0x48
 sp: c000260d7730
msr: 8282b033
   current = 0xc000260c0080
   paca= 0xc0003fff8a00   irqmask: 0x03   irq_happened: 0x01
 pid   = 739712, comm = fio
kernel BUG at fs/btrfs/ctree.h:3403!
Linux version 5.12.0-rc7-02316-g3490dae50c0 (riteshh@) (gcc (Ubuntu 
8.4.0-1ubuntu1~18.04) 8.4.0, GNU ld (GNU Binutils for Ubuntu) 2.30) #73 SMP Thu 
Apr 15 07:29:23 CDT 2021
enter ? for help
[c000260d7790] c0a90280 btrfs_subpage_assert.isra.9+0x70/0x110
[c000260d77b0] c0a91064 btrfs_subpage_set_uptodate+0x54/0x110
[c000260d7800] c09c6d0c btrfs_dirty_pages+0x1bc/0x2c0


This is very strange.
As in btrfs_dirty_pages(), the pages passed in are already prepared by
prepare_pages(), which means all of them should have Private set.

Can you reproduce the bug reliable?

BTW, are using running the latest branch, with this commit at top?

commit 3490dae50c01cec04364e5288f43ae9ac9eca2c9
Author: Qu Wenruo 
Date:   Mon Feb 22 14:19:38 2021 +0800

btrfs: allow read-write for 4K sectorsize on 64K page size systems

As I was updating the patchset until the last minute.

Thanks,
Qu


[c000260d7880] c09c7298 btrfs_buffered_write+0x488/0x7f0
[c000260d79d0] c09cbeb4 btrfs_file_write_iter+0x314/0x520
[c000260d7a50] c055fd84 do_iter_readv_writev+0x1b4/0x260
[c000260d7ac0] c056114c do_iter_write+0xdc/0x2c0
[c000260d7b10] c05c2d2c iter_file_splice_write+0x2ec/0x510
[c000260d7c30] c05c1ba0 do_splice_from+0x50/0x70
[c000260d7c50] c05c37e8 do_splice+0x5a8/0x910
[c000260d7cd0] c05c3ce0 sys_splice+0x190/0x300
[c000260d7d60] c0039ba4 system_call_exception+0x384/0x3d0
[c000260d7e10] c000d45c system_call_common+0xec/0x278
--- Exception: c00 (System Call) at 772ef170


-ritesh

[PATCH] btrfs-progs: mkfs: only output the warning if the sectorsize is not supported

Currently mkfs.btrfs will output a warning message if the sectorsize is
not the same as page size:
  WARNING: the filesystem may not be mountable, sectorsize 4096 doesn't match 
page size 65536

But since btrfs subpage support for 64K page size is comming, this
output is populating the golden output of fstests, causing tons of false
alerts.

This patch will make teach mkfs.btrfs to check
/sys/fs/btrfs/features/supported_sectorsizes, and compare if the sector
size is supported.

Then only output above warning message if the sector size is not
supported.

Signed-off-by: Qu Wenruo 
---
 common/fsfeatures.c | 36 +++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/common/fsfeatures.c b/common/fsfeatures.c
index 569208a9e5b1..13b775da9c72 100644
--- a/common/fsfeatures.c
+++ b/common/fsfeatures.c
@@ -16,6 +16,8 @@
 
 #include "kerncompat.h"
 #include 
+#include 
+#include 
 #include 
 #include 
 #include "common/fsfeatures.h"
@@ -327,8 +329,15 @@ u32 get_running_kernel_version(void)
 
return version;
 }
+
+/*
+ * The buffer size if strlen("4096 8192 16384 32768 65536"),
+ * which is 28, then round up to 32.
+ */
+#define SUPPORTED_SECTORSIZE_BUF_SIZE  32
 int btrfs_check_sectorsize(u32 sectorsize)
 {
+   bool sectorsize_checked = false;
u32 page_size = (u32)sysconf(_SC_PAGESIZE);
 
if (!is_power_of_2(sectorsize)) {
@@ -340,7 +349,32 @@ int btrfs_check_sectorsize(u32 sectorsize)
  sectorsize);
return -EINVAL;
}
-   if (page_size != sectorsize)
+   if (page_size == sectorsize) {
+   sectorsize_checked = true;
+   } else {
+   /*
+* Check if the sector size is supported
+*/
+   char supported_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
+   char sectorsize_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
+   int fd;
+   int ret;
+
+   fd = open("/sys/fs/btrfs/features/supported_sectorsizes",
+ O_RDONLY);
+   if (fd < 0)
+   goto out;
+   ret = read(fd, supported_buf, sizeof(supported_buf));
+   close(fd);
+   if (ret < 0)
+   goto out;
+   snprintf(sectorsize_buf, SUPPORTED_SECTORSIZE_BUF_SIZE,
+"%u", page_size);
+   if (strstr(supported_buf, sectorsize_buf))
+   sectorsize_checked = true;
+   }
+out:
+   if (!sectorsize_checked)
warning(
 "the filesystem may not be mountable, sectorsize %u doesn't match page size 
%u",
sectorsize, page_size);
-- 
2.31.1

[PATCH 42/42] btrfs: allow read-write for 4K sectorsize on 64K page size systems

Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limits though:
- For now, subpage RW mount is still considered experimental
  Thus that mount warning will still be there.

- No compression support
  There are still quite some PAGE_SIZE hard coded and quite some call
  sites use extent_clear_unlock_delalloc() to unlock locked_page.
  This will screw up subpage helpers

  Now for subpage RW mount, no matter whatever mount option or inode
  attr is set, all write will not be compressed.
  Although reading compressed data has no problem.

- No sectorsize defrag
  The problem here is, defrag is still done in full page size (64K).
  This means, if a page only has 4K data while the remaining 60K is all
  hole, after defrag it will be full 64K.

  This should not cause any kernel warning/hang nor data corruption, but
  it's still a behavior difference.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize.

  This is a very special corner case, and even current btrfs check won't
  report error on such inline extent + regular extent.
  But considering how much effort has been put to prevent such inline +
  regular, I'd prefer to cut off inline extent completely until we have
  a good solution.

- Read-time data repair is in bvec size
  This is different from original sector size repair.
  Bvec size is a floating number between 4K to 64K (page size).
  If the extent is only 4K sized then we can do the repair in 4K size.
  But if the extent is larger, our repair unit grows follows the
  extent size, until it reaches PAGE_SIZE.

  This is mostly due to the design of the repair code, it can be
  enhanced later.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 13 -
 fs/btrfs/inode.c   |  3 +++
 fs/btrfs/ioctl.c   |  7 +++
 fs/btrfs/super.c   |  7 ---
 fs/btrfs/sysfs.c   |  5 +
 5 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0a1182694f48..6db6c231ecc4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3386,15 +3386,10 @@ int __cold open_ctree(struct super_block *sb, struct 
btrfs_fs_devices *fs_device
goto fail_alloc;
}
 
-   /* For 4K sector size support, it's only read-only */
-   if (PAGE_SIZE == SZ_64K && sectorsize == SZ_4K) {
-   if (!sb_rdonly(sb) || btrfs_super_log_root(disk_super)) {
-   btrfs_err(fs_info,
-   "subpage sectorsize %u only supported read-only for page size %lu",
-   sectorsize, PAGE_SIZE);
-   err = -EINVAL;
-   goto fail_alloc;
-   }
+   if (sectorsize != PAGE_SIZE) {
+   btrfs_warn(fs_info,
+   "read-write for sector size %u with page size %lu is experimental",
+  sectorsize, PAGE_SIZE);
}
 
ret = btrfs_init_workqueues(fs_info, fs_devices);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 077c0aa4f846..cd36182aa653 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -466,6 +466,9 @@ static noinline int add_async_extent(struct async_chunk 
*cow,
  */
 static inline bool inode_can_compress(struct btrfs_inode *inode)
 {
+   /* Subpage doesn't support compress yet */
+   if (inode->root->fs_info->sectorsize < PAGE_SIZE)
+   return false;
if (inode->flags & BTRFS_INODE_NODATACOW ||
inode->flags & BTRFS_INODE_NODATASUM)
return false;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 37c92a9fa2e3..be174dc9bcd0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3149,6 +3149,13 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
struct btrfs_ioctl_defrag_range_args *range;
int ret;
 
+   /*
+* Subpage defrag support is not really sector perfect yet.
+* Disable defrag fro subpage case for now.
+*/
+   if (root->fs_info->sectorsize < PAGE_SIZE)
+   return -ENOTTY;
+
ret = mnt_want_write_file(file);
if (ret)
return ret;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f7a4ad86adee..f892ddf2e9f1 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2027,13 +2027,6 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
ret = -EINVAL;
goto restore;
}
-   if (fs_info->sectorsize < PAGE_SIZE) {
-   btrfs_warn(fs_info,
-   "read-write mount is not yet allowed for sectorsize %u page size %lu",
-

[PATCH 41/42] btrfs: allow submit_extent_page() to do bio split for subpage

Current submit_extent_page() just if the current page range can fit into
the current bio, and if not, submit then re-add.

But this behavior has a problem, it can't handle subpage cases.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:
- Check how many bytes we can put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

cur = pg_offset;
while (cur < pg_offset + size) {
u32 offset = cur - pg_offset;
int added;
if (!bio_ctrl->bio) {
/* Allocate new bio if needed */
}
/* Add as many bytes into the bio */
if (added < size - offset) {
/* The current bio is full, submit it */
}
cur += added;
}

Also, since we're doing new bio allocation deep inside the main loop,
extra that code into a new function, alloc_new_bio().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 183 ---
 1 file changed, 122 insertions(+), 61 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4afc3949e6e6..692cc9e693db 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -172,6 +172,7 @@ int __must_check submit_one_bio(struct bio *bio, int 
mirror_num,
 
bio->bi_private = NULL;
 
+   ASSERT(bio->bi_iter.bi_size);
if (is_data_inode(tree->private_data))
ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
bio_flags);
@@ -3201,13 +3202,13 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, 
int offset, int size)
  * @size:  portion of page that we want to write
  * @prev_bio_flags:  flags of previous bio to see if we can merge the current 
one
  * @bio_flags: flags of the current bio to see if we can merge them
- * @return:true if page was added, false otherwise
  *
  * Attempt to add a page to bio considering stripe alignment etc.
  *
- * Return true if successfully page added. Otherwise, return false.
+ * Return >= 0 for the number of bytes added to the bio.
+ * Return <0 for error.
  */
-static bool btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
+static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
   struct page *page,
   u64 disk_bytenr, unsigned int size,
   unsigned int pg_offset,
@@ -3215,6 +3216,7 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl 
*bio_ctrl,
 {
struct bio *bio = bio_ctrl->bio;
u32 bio_size = bio->bi_iter.bi_size;
+   u32 real_size;
const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
bool contig;
int ret;
@@ -3223,26 +3225,33 @@ static bool btrfs_bio_add_page(struct btrfs_bio_ctrl 
*bio_ctrl,
/* The limit should be calculated when bio_ctrl->bio is allocated */
ASSERT(bio_ctrl->len_to_oe_boundary &&
   bio_ctrl->len_to_stripe_boundary);
+
if (bio_ctrl->bio_flags != bio_flags)
-   return false;
+   return 0;
 
if (bio_ctrl->bio_flags & EXTENT_BIO_COMPRESSED)
contig = bio->bi_iter.bi_sector == sector;
else
contig = bio_end_sector(bio) == sector;
if (!contig)
-   return false;
+   return 0;
 
-   if (bio_size + size > bio_ctrl->len_to_oe_boundary ||
-   bio_size + size > bio_ctrl->len_to_stripe_boundary)
-   return false;
+   real_size = min(bio_ctrl->len_to_oe_boundary,
+   bio_ctrl->len_to_stripe_boundary) - bio_size;
+   real_size = min(real_size, size);
+   /*
+* If real_size is 0, never call bio_add_*_page(), as even size is 0,
+* bio will still execute its endio function on the page!
+*/
+   if (real_size == 0)
+   return 0;
 
if (bio_op(bio) == REQ_OP_ZONE_APPEND)
-   ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+   ret = bio_add_zone_append_page(bio, page, real_size, pg_offset);
else
-   ret = bio_add_page(bio, page, size, pg_offset);
+   ret = bio_add_page(bio, page, real_size, pg_offset);
 
-   return ret == size;
+   return ret;
 }
 
 static int calc_bio_boundaries(struct btrfs_bio_ctrl *bio_ctrl,
@@ -3301,6 +331

[PATCH 39/42] btrfs: make free space cache size consistent across different PAGE_SIZE

Currently free space cache inode size is determined by two factors:
- block group size
- PAGE_SIZE

This means, for the same sized block group, with different PAGE_SIZE, it
will result different inode size.

This will not be a good thing for subpage support, so change the
requirement for PAGE_SIZE to sectorsize.

Now for the same 4K sectorsize btrfs, it should result the same inode
size no matter whatever the PAGE_SIZE is.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/block-group.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 293f3169be80..a0591eca270b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2414,7 +2414,7 @@ static int cache_save_setup(struct btrfs_block_group 
*block_group,
struct extent_changeset *data_reserved = NULL;
u64 alloc_hint = 0;
int dcs = BTRFS_DC_ERROR;
-   u64 num_pages = 0;
+   u64 cache_size = 0;
int retries = 0;
int ret = 0;
 
@@ -2526,20 +2526,20 @@ static int cache_save_setup(struct btrfs_block_group 
*block_group,
 * taking up quite a bit since it's not folded into the other space
 * cache.
 */
-   num_pages = div_u64(block_group->length, SZ_256M);
-   if (!num_pages)
-   num_pages = 1;
+   cache_size = div_u64(block_group->length, SZ_256M);
+   if (!cache_size)
+   cache_size = 1;
 
-   num_pages *= 16;
-   num_pages *= PAGE_SIZE;
+   cache_size *= 16;
+   cache_size *= fs_info->sectorsize;
 
ret = btrfs_check_data_free_space(BTRFS_I(inode), &data_reserved, 0,
- num_pages);
+ cache_size);
if (ret)
goto out_put;
 
-   ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages,
- num_pages, num_pages,
+   ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, cache_size,
+ cache_size, cache_size,
  &alloc_hint);
/*
 * Our cache requires contiguous chunks so that we don't modify a bunch
-- 
2.31.1

[PATCH 40/42] btrfs: refactor submit_extent_page() to make bio and its flag tracing easier

There are a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some parameter
like "unsigned long bio_flags".

Such strange parameter is here for bio assembly.

For example, we have such inode page layout:

0   4K  8K  12K
|<-- Extent A-->|<- EB->|

Then what we do is:
- Page [0, 4K)
  *bio_ret = NULL
  So we allocate a new bio to bio_ret,
  Add page [0, 4K) to *bio_ret.

- Page [4K, 8K)
  *bio_ret != NULL
  We found this page is continuous to *bio_ret,
  and if we're not at stripe boundary, we
  add page [4K, 8K) to *bio_ret.

- Page [8K, 12K)
  *bio_ret != NULL
  But we found this page is not continuous, so
  we submit *bio_ret, then allocate a new bio,
  and add page [8K, 12K) to the new bio.

This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulate them into their own structure.

So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.

Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary.
This check itself can be time consuming, and we don't really need to do
that for each page.

This patch also integrate the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.

The following functions/structures are affected:
- struct extent_page_data
  Replace its bio pointer with structure btrfs_bio_ctrl (embedded
  structure, not pointer)

- end_write_bio()
- flush_write_bio()
  Just change how bio is fetched

- btrfs_bio_add_page()
  Use pre-calculated boundaries instead of re-calculating them.
  And use @bio_ctrl to replace @bio and @prev_bio_flags.

- calc_bio_boundaries()
  New function

- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
  To Use @bio_ctrl to raplace @bio and @prev_bio_flags, and how to grab
  bio.

- btrfs_bio_fits_in_ordered_extent()
  Removed, as now the ordered extent size limit is done at bio
  allocation time, no need to check for each page range.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h |   2 -
 fs/btrfs/extent_io.c | 212 +++
 fs/btrfs/extent_io.h |  13 ++-
 fs/btrfs/inode.c |  36 +---
 4 files changed, 152 insertions(+), 111 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f8d1e495deda..deb781a8cf92 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3134,8 +3134,6 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 struct extent_state *orig, u64 split);
 int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio,
 unsigned long bio_flags);
-bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
- unsigned int size);
 void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 81931c02c0e4..4afc3949e6e6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -136,7 +136,7 @@ struct tree_entry {
 };
 
 struct extent_page_data {
-   struct bio *bio;
+   struct btrfs_bio_ctrl bio_ctrl;
/* tells writepage not to lock the state bits for this range
 * it still does the unlocking
 */
@@ -185,10 +185,12 @@ int __must_check submit_one_bio(struct bio *bio, int 
mirror_num,
 /* Cleanup unsubmitted bios */
 static void end_write_bio(struct extent_page_data *epd, int ret)
 {
-   if (epd->bio) {
-   epd->bio->bi_status = errno_to_blk_status(ret);
-   bio_endio(epd->bio);
-   epd->bio = NULL;
+   struct bio *bio = epd->bio_ctrl.bio;
+
+   if (bio) {
+   bio->bi_status = errno_to_blk_status(ret);
+   bio_endio(bio);
+   epd->bio_ctrl.bio = NULL;
}
 }
 
@@ -201,9 +203,10 @@ static void end_write_bio(struct extent_page_data *epd, 
int ret)
 static int __must_check flush_write_bio(struct extent_page_data *epd)
 {
int ret = 0;
+   struct bio *bio = epd->bio_ctrl.bio;
 
-   if (epd->bio) {
-   ret = submit_one_bio(epd->bio, 0, 0);
+   if (bio) {
+   ret = submit_one_bio(bio, 0, 0);
/*
 * Clean up of epd->bio is handled by its endio function.
 * And endio is either triggered by successful bio execution
@@ -211,7 +214,7 @@ static int __must_check flush_write_bio(struct 
extent_page_data *epd)

[PATCH 36/42] btrfs: fix wild subpage writeback which does not have ordered extent.

[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

 kernel BUG at fs/btrfs/file-item.c:667!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3486 Comm: kworker/u13:2 Tainted: GWC O  
5.11.0-rc4-custom+ #43
 Hardware name: Radxa ROCK Pi 4B (DT)
 Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
 pstate: 6005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
 pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
 lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
 Call trace:
  btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
  btrfs_submit_bio_start+0x20/0x30 [btrfs]
  run_one_async_start+0x28/0x44 [btrfs]
  btrfs_work_helper+0x128/0x1b4 [btrfs]
  process_one_work+0x22c/0x430
  worker_thread+0x70/0x3a0
  kthread+0x13c/0x140
  ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there are some bio range which doesn't have ordered
extent, which indeed is worthy a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond isize, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 17 +
 fs/btrfs/subpage.c   | 16 
 fs/btrfs/subpage.h   |  7 +++
 3 files changed, 40 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ae6357a6749e..152aface4eeb 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3852,6 +3852,16 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
if (cur >= i_size) {
btrfs_writepage_endio_finish_ordered(inode, page, cur,
 end, 1);
+   /*
+* This range is beyond isize, thus we don't need to
+* bother writing back.
+* But we still need to clear the dirty subpage bit, or
+* the next time the page get dirtied, we will try to
+* writeback the sectors with subpage diryt bits,
+* causing writeback without ordered extent.
+*/
+   btrfs_page_clear_dirty(fs_info, page, cur,
+  end + 1 - cur);
break;
}
 
@@ -3902,6 +3912,7 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
else
btrfs_writepage_endio_finish_ordered(inode,
page, cur, cur + iosize - 1, 1);
+   btrfs_page_clear_dirty(fs_info, page, cur, iosize);
cur += iosize;
continue;
}
@@ -3936,6 +3947,12 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
cur += iosize;
nr++;
}
+   /*
+* If we finishes without problem, we should not only clear page dirty,
+* but also emptied subpage dirty bits
+*/
+   if (!ret)
+   btrfs_page_assert_not_dirty(fs_info, page);
*nr_ret = nr;
return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 516e0b3f2ed9..696485ab68a2 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -548,3 +548,19 @@ IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, 
end_page_writeback,
 PageWriteback);
 IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
 PageOrdered);
+
+void btrfs_page_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+struct page *page)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+
+   if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
+   return;
+
+   ASSERT(!PageDirty(page));
+   if (fs_info->

[PATCH 37/42] btrfs: disable inline extent creation for subpage

[BUG]
When running the following fsx command (extracted from generic/127) on
subpage btrfs, it can create inline extent with regular extents:

fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > 
/tmp/fsx

The offending extent would look like:

item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
index 2 namelen 4 name: file
item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
generation 7 type 0 (inline)
inline extent data size 707 ram_bytes 707 compression 0 (none)
item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
generation 7 type 2 (prealloc)
prealloc data disk byte 102346752 nr 4096
prealloc data offset 0 nr 4096

[CAUSE]
For subpage btrfs, the writeback is triggered in page unit, which means,
even if we just want to writeback range [16K, 20K) for 64K page system,
we will still try to writeback any dirty sector of range [0, 64K).

This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
this can cause unexpected problems.

For above test case, the last several operations from fsx are:

 9055 trunc  from 0x4 to 0x2c3
 9057 falloc from 0x164c to 0x19d2 (0x386 bytes)

In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
writeback any dirty data in [4096, 8192), but nothing else.

Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
trigger writeback of the range [0, 64K), which includes the data at [0,
4096).

And since at the call site, we haven't yet increased i_size, which is
still 707, this means cow_file_range() can insert an inline extent.

Resulting above inline + regular extent.

[WORKAROUND]
I don't really have any good short-term solution yet, as this means all
operations that would trigger writeback need to be reviewed for any
isize change.

So here I choose to disable inline extent creation for subpage case as a
workaround.
We have done tons of work just to avoid such extent, so I don't to
create an exception just for subpage.

This only affects inline extent creation, btrfs subpage support has no
problem reading existing inline extents at all.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e31a0521564e..5030bbf3a667 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -663,7 +663,11 @@ static noinline int compress_file_range(struct async_chunk 
*async_chunk)
}
}
 cont:
-   if (start == 0) {
+   /*
+* Check cow_file_range() for why we don't even try to create
+* inline extent for subpage case.
+*/
+   if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
/* lets try to make an inline extent */
if (ret || total_in < actual_end) {
/* we didn't compress the entire range, try
@@ -1061,7 +1065,17 @@ static noinline int cow_file_range(struct btrfs_inode 
*inode,
 
inode_should_defrag(inode, start, end, num_bytes, SZ_64K);
 
-   if (start == 0) {
+   /*
+* Due to the page size limit, for subpage we can only trigger the
+* writeback for the dirty sectors of page, that means data writeback
+* is doing more writeback than what we want.
+*
+* This is especially unexpected for some call sites like fallocate,
+* where we only increase isize after everything is done.
+* This means we can trigger inline extent even we didn't want.
+* So here we skip inline extent creation completely.
+*/
+   if (start == 0 && fs_info->sectorsize == PAGE_SIZE) {
/* lets try to make an inline extent */
ret = cow_file_range_inline(inode, start, end, 0,
BTRFS_COMPRESS_NONE, NULL);
-- 
2.31.1

[PATCH 38/42] btrfs: skip validation for subpage read repair

Unlike PAGE_SIZE == sectorsize case, read in subpage btrfs are always
merged if the range is in the same page:

E.g:
For regular sectorsize case, if we want to read range [0, 16K) of a
file, the bio will look like:

 0   4K  8K  12K 16K
 | bvec 1| bvec 2| bvec 3| bvec 4|

But for subpage case, above 16K can be merged into one bvec:

 0   4K  8K  12K 16K
 |  bvec 1   |

This means our bvec is no longer 1:1 mapped to btrfs sector.

This makes repair much harder to do, if we want to do sector perfect
repair.

For now, just skip validation for subpage read repair, this means:
- We will submit extra range to repair
  Even if we only have one sector error for above read, we will
  still submit full 16K to over-write the bad copy

- Less chance to get good copy
  Now the repair granularity is much lower, we need a copy with
  all sectors correct to be able to submit a repair.

Sector perfect repair needs more modification, but for now the new
behavior should be good enough for us to test the basis of subpage
support.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 152aface4eeb..81931c02c0e4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2651,6 +2651,19 @@ static bool btrfs_io_needs_validation(struct inode 
*inode, struct bio *bio)
if (bio->bi_status == BLK_STS_OK)
return false;
 
+   /*
+* For subpage case, read bio are always submitted as multiple-sector
+* bio if the range is in the same page.
+* For now, let's just skip the validation, and do page sized repair.
+*
+* This reduce the granularity for repair, meaning if we have two
+* copies with different csum mismatch at different location, we're
+* unable to repair in subpage case.
+*
+* TODO: Make validation code to be fully subpage compatible
+*/
+   if (blocksize < PAGE_SIZE)
+   return false;
/*
 * We need to validate each sector individually if the failed I/O was
 * for multiple sectors.
-- 
2.31.1

[PATCH 34/42] btrfs: extract relocation page read and dirty part into its own function

In function relocate_file_extent_cluster(), we have a big loop for
marking all involved page delalloc.

That part is long enough to be contained in one function, so this patch
will move that code chunk into a new function, relocate_one_page().

This also provides enough space for later subpage work.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 199 --
 1 file changed, 94 insertions(+), 105 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b70be2ac2e9e..862fe5247c76 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2885,19 +2885,102 @@ noinline int btrfs_should_cancel_balance(struct 
btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
-static int relocate_file_extent_cluster(struct inode *inode,
-   struct file_extent_cluster *cluster)
+static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
+struct file_extent_cluster *cluster,
+int *cluster_nr, unsigned long page_index)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+   u64 offset = BTRFS_I(inode)->index_cnt;
+   const unsigned long last_index = (cluster->end - offset) >> PAGE_SHIFT;
+   gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
+   struct page *page;
u64 page_start;
u64 page_end;
+   int ret;
+
+   ASSERT(page_index <= last_index);
+   ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
+   if (ret)
+   return ret;
+
+   page = find_lock_page(inode->i_mapping, page_index);
+   if (!page) {
+   page_cache_sync_readahead(inode->i_mapping, ra, NULL,
+   page_index, last_index + 1 - page_index);
+   page = find_or_create_page(inode->i_mapping, page_index, mask);
+   if (!page) {
+   ret = -ENOMEM;
+   goto release_delalloc;
+   }
+   }
+   ret = set_page_extent_mapped(page);
+   if (ret < 0)
+   goto release_page;
+
+   if (PageReadahead(page))
+   page_cache_async_readahead(inode->i_mapping, ra, NULL, page,
+  page_index, last_index + 1 - page_index);
+
+   if (!PageUptodate(page)) {
+   btrfs_readpage(NULL, page);
+   lock_page(page);
+   if (!PageUptodate(page)) {
+   ret = -EIO;
+   goto release_page;
+   }
+   }
+
+   page_start = page_offset(page);
+   page_end = page_start + PAGE_SIZE - 1;
+
+   lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+
+   if (*cluster_nr < cluster->nr &&
+   page_start + offset == cluster->boundary[*cluster_nr]) {
+   set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
+   EXTENT_BOUNDARY);
+   (*cluster_nr)++;
+   }
+
+   ret = btrfs_set_extent_delalloc(BTRFS_I(inode), page_start, page_end,
+   0, NULL);
+   if (ret) {
+   clear_extent_bits(&BTRFS_I(inode)->io_tree, page_start,
+ page_end, EXTENT_LOCKED | EXTENT_BOUNDARY);
+   goto release_page;
+
+   }
+   set_page_dirty(page);
+
+   unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
+   unlock_page(page);
+   put_page(page);
+
+   btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+   balance_dirty_pages_ratelimited(inode->i_mapping);
+   btrfs_throttle(fs_info);
+   if (btrfs_should_cancel_balance(fs_info))
+   ret = -ECANCELED;
+   return ret;
+
+release_page:
+   unlock_page(page);
+   put_page(page);
+release_delalloc:
+   btrfs_delalloc_release_metadata(BTRFS_I(inode), PAGE_SIZE, true);
+   btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+   return ret;
+}
+
+static int relocate_file_extent_cluster(struct inode *inode,
+   struct file_extent_cluster *cluster)
+{
+   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
u64 offset = BTRFS_I(inode)->index_cnt;
unsigned long index;
unsigned long last_index;
-   struct page *page;
struct file_ra_state *ra;
-   gfp_t mask = btrfs_alloc_write_mask(inode->i_mapping);
-   int nr = 0;
+   int cluster_nr = 0;
int ret = 0;
 
if (!cluster->nr)
@@ -2918,109 +3001,15 @@ static int relocate_file_extent_cluster(struct inode 
*inode,
if (ret)
goto out;
 
-   index = (cluster->start - offset) >> PAGE_SHIFT;
last_index = (cluster->end - offset) >> PAG

[PATCH 35/42] btrfs: make relocate_one_page() to handle subpage case

For subpage case, one page of data reloc inode can contain several file
extents, like this:

|<--- File extent A --->| FE B | FE C |<--- File extent D -->|
|<- Page ->|

We can no longer use PAGE_SIZE directly for various operations.

This patch will relocate_one_page() to handle subpage case by:
- Iterating through all extents of a cluster when marking pages
  When marking pages dirty and delalloc, we need to check the cluster
  extent boundary.
  Now we introduce a loop to go extent by extent of a page, until we
  either finished the last extent, or reach the page end.

  By this, regular sectorsize == PAGE_SIZE can still work as usual, since
  we will do that loop only once.

- Iteration start from max(page_start, extent_start)
  Since we can have the following case:
| FE B | FE C |<--- File extent D -->|
|<- Page ->|
  Thus we can't always start from page_start, but do a
  max(page_start, extent_start)

- Iteration end when the cluster is exhausted
  Similar to previous case, the last file extent can end before the page
  end:
|<--- File extent A --->| FE B | FE C |
|<- Page ->|
  In this case, we need to manually exit the loop after we have finished
  the last extent of the cluster.

- Reserve metadata space for each extent range
  Since now we can hit multiple ranges in one page, we should reserve
  metadata for each range, not simply PAGE_SIZE.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 108 ++
 1 file changed, 79 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 862fe5247c76..cd50559c6d17 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -24,6 +24,7 @@
 #include "block-group.h"
 #include "backref.h"
 #include "misc.h"
+#include "subpage.h"
 
 /*
  * Relocation overview
@@ -2885,6 +2886,17 @@ noinline int btrfs_should_cancel_balance(struct 
btrfs_fs_info *fs_info)
 }
 ALLOW_ERROR_INJECTION(btrfs_should_cancel_balance, TRUE);
 
+static u64 get_cluster_boundary_end(struct file_extent_cluster *cluster,
+   int cluster_nr)
+{
+   /* Last extent, use cluster end directly */
+   if (cluster_nr >= cluster->nr - 1)
+   return cluster->end;
+
+   /* Use next boundary start*/
+   return cluster->boundary[cluster_nr + 1] - 1;
+}
+
 static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 struct file_extent_cluster *cluster,
 int *cluster_nr, unsigned long page_index)
@@ -2896,22 +2908,17 @@ static int relocate_one_page(struct inode *inode, 
struct file_ra_state *ra,
struct page *page;
u64 page_start;
u64 page_end;
+   u64 cur;
int ret;
 
ASSERT(page_index <= last_index);
-   ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE);
-   if (ret)
-   return ret;
-
page = find_lock_page(inode->i_mapping, page_index);
if (!page) {
page_cache_sync_readahead(inode->i_mapping, ra, NULL,
page_index, last_index + 1 - page_index);
page = find_or_create_page(inode->i_mapping, page_index, mask);
-   if (!page) {
-   ret = -ENOMEM;
-   goto release_delalloc;
-   }
+   if (!page)
+   return -ENOMEM;
}
ret = set_page_extent_mapped(page);
if (ret < 0)
@@ -2933,30 +2940,76 @@ static int relocate_one_page(struct inode *inode, 
struct file_ra_state *ra,
page_start = page_offset(page);
page_end = page_start + PAGE_SIZE - 1;
 
-   lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end);
-
-   if (*cluster_nr < cluster->nr &&
-   page_start + offset == cluster->boundary[*cluster_nr]) {
-   set_extent_bits(&BTRFS_I(inode)->io_tree, page_start, page_end,
-   EXTENT_BOUNDARY);
-   (*cluster_nr)++;
-   }
+   /*
+* Start from the cluster, as for subpage case, the cluster can start
+* inside the page.
+*/
+   cur = max(page_start, cluster->boundary[*cluster_nr] - offset);
+   while (cur <= page_end) {
+   u64 extent_start = cluster->boundary[*cluster_nr] - offset;
+   u64 extent_end = get_cluster_boundary_end(cluster,
+   *cluster_nr) - offset;
+   u64 clamped_start = max(page_start, extent_start);
+   u64 clamped_end = min(page_end, extent_end);
+   u32 clamped_len = cl

[PATCH 30/42] btrfs: make btrfs_page_mkwrite() to be subpage compatible

Only set_page_dirty() and SetPageUptodate() is not subpage compatible.
Convert them to subpage helpers, so that __extent_writepage_io() can
submit page content correctly.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 38ebb79ee580..67c82de6b96a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8628,8 +8628,9 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
kunmap(page);
}
ClearPageChecked(page);
-   set_page_dirty(page);
-   SetPageUptodate(page);
+   btrfs_page_set_dirty(fs_info, page, page_start, end + 1 - page_start);
+   btrfs_page_set_uptodate(fs_info, page, page_start,
+   end + 1 - page_start);
 
btrfs_set_inode_last_sub_trans(BTRFS_I(inode));
 
-- 
2.31.1

[PATCH 31/42] btrfs: reflink: make copy_inline_to_page() to be subpage compatible

The modifications are:
- Page copy destination
  For subpage case, one page can contain multiple sectors, thus we can
  no longer expect the memcpy_to_page()/btrfs_decompress() to copy
  data into page offset 0.
  The correct offset is offset_in_page(file_offset) now, which should
  handle both regular sectorsize and subpage cases well.

- Page status update
  Now we need to use subpage helper to handle the page status update.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/reflink.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index f4ec06b53aa0..e5680c03ead4 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -7,6 +7,7 @@
 #include "delalloc-space.h"
 #include "reflink.h"
 #include "transaction.h"
+#include "subpage.h"
 
 #define BTRFS_MAX_DEDUPE_LEN   SZ_16M
 
@@ -52,7 +53,8 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
   const u64 datal,
   const u8 comp_type)
 {
-   const u64 block_size = btrfs_inode_sectorsize(inode);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
+   const u32 block_size = fs_info->sectorsize;
const u64 range_end = file_offset + block_size - 1;
const size_t inline_size = size - btrfs_file_extent_calc_inline_size(0);
char *data_start = inline_data + btrfs_file_extent_calc_inline_size(0);
@@ -106,10 +108,12 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
set_bit(BTRFS_INODE_NO_DELALLOC_FLUSH, &inode->runtime_flags);
 
if (comp_type == BTRFS_COMPRESS_NONE) {
-   memcpy_to_page(page, 0, data_start, datal);
+   memcpy_to_page(page, offset_in_page(file_offset), data_start,
+  datal);
flush_dcache_page(page);
} else {
-   ret = btrfs_decompress(comp_type, data_start, page, 0,
+   ret = btrfs_decompress(comp_type, data_start, page,
+  offset_in_page(file_offset),
   inline_size, datal);
if (ret)
goto out_unlock;
@@ -137,9 +141,9 @@ static int copy_inline_to_page(struct btrfs_inode *inode,
kunmap(page);
}
 
-   SetPageUptodate(page);
+   btrfs_page_set_uptodate(fs_info, page, file_offset, block_size);
ClearPageChecked(page);
-   set_page_dirty(page);
+   btrfs_page_set_dirty(fs_info, page, file_offset, block_size);
 out_unlock:
if (page) {
unlock_page(page);
-- 
2.31.1

[PATCH 32/42] btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()

[BUG]
With current subpage RW support, the following script can hang the fs on
with 64K page size.

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev -o nospace_cache $mnt
 # fsstress -w -n 50 -p 1 -s 1607749395 -d $mnt

The kernel will do an infinite loop in btrfs_punch_hole_lock_range().

[CAUSE]
In btrfs_punch_hole_lock_range() we:
- Truncate page cache range
- Lock extent io tree
- Wait any ordered extents in the range.

We exit the loop until we meet all the following conditions:
- No ordered extent in the lock range
- No page is in the lock range

The latter condition has a pitfall, it only works for sector size ==
PAGE_SIZE case.

While can't handle the following subpage case:

  0   32K 64K 96K 128K
  |   |///||//|   ||

lockstart=32K
lockend=96K - 1

In this case, although the range cross 2 pages,
truncate_pagecache_range() will invalidate no page at all, but only zero
the [32K, 96K) range of the two pages.

Thus filemap_range_has_page(32K, 96K-1) will always return true, thus we
will never meet the loop exit condition.

[FIX]
Fix the problem by doing page alignment for the lock range.

Function filemap_range_has_page() has already handled lend < lstart
case, we only need to round up @lockstart, and round_down @lockend for
truncate_pagecache_range().

This modification should not change any thing for sector size ==
PAGE_SIZE case, as in that case our range is already page aligned.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/file.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 8f71699fdd18..45ec3f5ef839 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2471,6 +2471,16 @@ static int btrfs_punch_hole_lock_range(struct inode 
*inode,
   const u64 lockend,
   struct extent_state **cached_state)
 {
+   /*
+* For subpage case, if the range is not at page boundary, we could
+* have pages at the leading/tailing part of the range.
+* This could lead to dead loop since filemap_range_has_page()
+* will always return true.
+* So here we need to do extra page alignment for
+* filemap_range_has_page().
+*/
+   u64 page_lockstart = round_up(lockstart, PAGE_SIZE);
+   u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1;
while (1) {
struct btrfs_ordered_extent *ordered;
int ret;
@@ -2491,7 +2501,7 @@ static int btrfs_punch_hole_lock_range(struct inode 
*inode,
(ordered->file_offset + ordered->num_bytes <= lockstart ||
 ordered->file_offset > lockend)) &&
 !filemap_range_has_page(inode->i_mapping,
-lockstart, lockend)) {
+page_lockstart, page_lockend)) {
if (ordered)
btrfs_put_ordered_extent(ordered);
break;
-- 
2.31.1

[PATCH 33/42] btrfs: don't clear page extent mapped if we're not invalidating the full page

[BUG]
With current btrfs subpage rw support, the following script can lead to
fs hang:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt

  fsstress -w -n 100 -p 1 -s 1608140256 -v -d $mnt

The fs will hang at btrfs_start_ordered_extent().

[CAUSE]
In above test case, btrfs_invalidate() will be called with the following
parameters:
  offset = 0 length = 53248 page dirty = 1 subpage dirty bitmap = 0x2000

Since @offset is 0, btrfs_invalidate() will try to invalidate the full
page, and finally call clear_page_extent_mapped() which will detach
btrfs subpage structure from the page.

And since the page no longer has btrfs subpage structure, the subpage
dirty bitmap will be cleared, preventing the dirty range from
written back, thus no way to wake up the ordered extent.

[FIX]
Just follow other fses, only to invalidate the page if the range covers
the full page.

There are cases like truncate_setsize() which can call
btrfs_invalidatepage() with offset == 0 and length != 0 for the last
page of an inode.

Although the old code will still try to invalidate the full page, we are
still safe to just wait for ordered extent to finish.
So it shouldn't cause extra problems.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 67c82de6b96a..e31a0521564e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8361,7 +8361,19 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 */
wait_on_page_writeback(page);
 
-   if (offset) {
+   /*
+* For subpage case, we have call sites like
+* btrfs_punch_hole_lock_range() which passes range not aligned to
+* sectorsize.
+* If the range doesn't cover the full page, we don't need to and
+* shouldn't clear page extent mapped, as page->private can still
+* record subpage dirty bits for other part of the range.
+*
+* For cases where can invalidate the full even the range doesn't
+* cover the full page, like invalidating the last page, we're
+* still safe to wait for ordered extent to finish.
+*/
+   if (!(offset == 0 && length == PAGE_SIZE)) {
btrfs_releasepage(page, GFP_NOFS);
return;
}
-- 
2.31.1

[PATCH 29/42] btrfs: make btrfs_truncate_block() to be subpage compatible

btrfs_truncate_block() itself is already mostly subpage compatible, the
only missing part is the page dirtying code.

Currently if we have a sector that needs to be truncated, we set the
sector aligned range delalloc, then set the full page dirty.

The problem is, current subpage code requires subpage dirty bit to be
set, or __extent_writepage_io() won't submit bio, thus leads to ordered
extent never to finish.

So this patch will make btrfs_truncate_block() to call
btrfs_page_set_dirty() helper to replace set_page_dirty() to fix the
problem.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index da73fd51d232..38ebb79ee580 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4937,7 +4937,8 @@ int btrfs_truncate_block(struct btrfs_inode *inode, 
loff_t from, loff_t len,
kunmap(page);
}
ClearPageChecked(page);
-   set_page_dirty(page);
+   btrfs_page_set_dirty(fs_info, page, block_start,
+block_end + 1 - block_start);
unlock_extent_cached(io_tree, block_start, block_end, &cached_state);
 
if (only_release_metadata)
-- 
2.31.1

[PATCH 28/42] btrfs: add extra assert for submit_extent_page()

There are already bugs exposed in __extent_writepage_io() where due to
wrong alignment and lack of support for subpage, we can pass insane
pg_offset into submit_extent_page().

Add basic size check to ensure the combination of @size and @pg_offset
is sane.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index be825b73ee43..ae6357a6749e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3261,6 +3261,8 @@ static int submit_extent_page(unsigned int opf,
 
ASSERT(bio_ret);
 
+   ASSERT(pg_offset < PAGE_SIZE && size <= PAGE_SIZE &&
+  pg_offset + size <= PAGE_SIZE);
if (*bio_ret) {
bio = *bio_ret;
if (force_bio_submit ||
-- 
2.31.1

[PATCH 27/42] btrfs: make __extent_writepage_io() only submit dirty range for subpage

__extent_writepage_io() function originally just iterate through all the
extent maps of a page, and submit any regular extents.

This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
need to submit the only sector contained in the page.

But for subpage case, one dirty page can contain several clean sectors
with at least one dirty sector.

If __extent_writepage_io() still submit all regular extent maps, it can
submit data which is already written to disk.
And since such already written data won't have corresponding ordered
extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().

Change the behavior of __extent_writepage_io() by finding the first
dirty byte in the page, and only submit the dirty range other than the
full extent.

Since we're also here, also modify the following calls to be subpage
compatible:
- SetPageError()
- end_page_writeback()

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 100 ---
 1 file changed, 95 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c593071fa8c1..be825b73ee43 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3728,6 +3728,74 @@ static noinline_for_stack int writepage_delalloc(struct 
btrfs_inode *inode,
return 0;
 }
 
+/*
+ * To find the first byte we need to write.
+ *
+ * For subpage, one page can contain several sectors, and
+ * __extent_writepage_io() will just grab all extent maps in the page
+ * range and try to submit all non-inline/non-compressed extents.
+ *
+ * This is a big problem for subpage, we shouldn't re-submit already written
+ * data at all.
+ * This function will lookup subpage dirty bit to find which range we really
+ * need to submit.
+ *
+ * Return the next dirty range in [@start, @end).
+ * If no dirty range is found, @start will be page_offset(page) + PAGE_SIZE.
+ */
+static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
+struct page *page, u64 *start, u64 *end)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   u64 orig_start = *start;
+   u16 dirty_bitmap;
+   unsigned long flags;
+   int nbits = (orig_start - page_offset(page)) >> fs_info->sectorsize;
+   int first_bit_set;
+   int first_bit_zero;
+
+   /*
+* For regular sector size == page size case, since one page only
+* contains one sector, we return the page offset directly.
+*/
+   if (fs_info->sectorsize == PAGE_SIZE) {
+   *start = page_offset(page);
+   *end = page_offset(page) + PAGE_SIZE;
+   return;
+   }
+
+   /* We should have the page locked, but just in case */
+   spin_lock_irqsave(&subpage->lock, flags);
+   dirty_bitmap = subpage->dirty_bitmap;
+   spin_unlock_irqrestore(&subpage->lock, flags);
+
+   /* Set bits lower than @nbits with 0 */
+   dirty_bitmap &= ~((1 << nbits) - 1);
+
+   first_bit_set = ffs(dirty_bitmap);
+   /* No dirty range found */
+   if (first_bit_set == 0) {
+   *start = page_offset(page) + PAGE_SIZE;
+   return;
+   }
+
+   ASSERT(first_bit_set > 0 && first_bit_set <= BTRFS_SUBPAGE_BITMAP_SIZE);
+   *start = page_offset(page) + (first_bit_set - 1) * fs_info->sectorsize;
+
+   /* Set all bits lower than @nbits to 1 for ffz() */
+   dirty_bitmap |= ((1 << nbits) - 1);
+
+   first_bit_zero = ffz(dirty_bitmap);
+   if (first_bit_zero == 0 || first_bit_zero > BTRFS_SUBPAGE_BITMAP_SIZE) {
+   *end = page_offset(page) + PAGE_SIZE;
+   return;
+   }
+   ASSERT(first_bit_zero > 0 &&
+  first_bit_zero <= BTRFS_SUBPAGE_BITMAP_SIZE);
+   *end = page_offset(page) + first_bit_zero * fs_info->sectorsize;
+   ASSERT(*end > *start);
+}
+
 /*
  * helper for __extent_writepage.  This calls the writepage start hooks,
  * and does the loop to map the page into extents and bios.
@@ -3775,6 +3843,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
while (cur <= end) {
u64 disk_bytenr;
u64 em_end;
+   u64 dirty_range_start = cur;
+   u64 dirty_range_end;
u32 iosize;
 
if (cur >= i_size) {
@@ -3782,9 +3852,17 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
 end, 1);
break;
}
+
+   find_next_dirty_byte(fs_info, page, &dirty_range_start,
+&dirty_range_end);
+   if (cur < dirty_range_start) {
+   cur = dirty_range_start;
+   continue;
+

[PATCH 26/42] btrfs: make btrfs_set_range_writeback() subpage compatible

Function btrfs_set_range_writeback() currently just set the page
writeback unconditionally.

Change it to call the subpage helper so that we can handle both cases
well.

Since the subpage helpers needs btrfs_info, also change the parameter to
accept btrfs_inode.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/extent_io.c |  3 +--
 fs/btrfs/inode.c | 12 
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 903fdcb6ecd0..f8d1e495deda 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3136,7 +3136,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
size, struct bio *bio,
 unsigned long bio_flags);
 bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio,
  unsigned int size);
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 
end);
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end);
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf);
 int btrfs_readpage(struct file *file, struct page *page);
 void btrfs_evict_inode(struct inode *inode);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7dc1b367bf35..c593071fa8c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3745,7 +3745,6 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
 int *nr_ret)
 {
struct btrfs_fs_info *fs_info = inode->root->fs_info;
-   struct extent_io_tree *tree = &inode->io_tree;
u64 start = page_offset(page);
u64 end = start + PAGE_SIZE - 1;
u64 cur = start;
@@ -3824,7 +3823,7 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
continue;
}
 
-   btrfs_set_range_writeback(tree, cur, cur + iosize - 1);
+   btrfs_set_range_writeback(inode, cur, cur + iosize - 1);
if (!PageWriteback(page)) {
btrfs_err(inode->root->fs_info,
   "page %lu not writeback, cur %llu end %llu",
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 566431d7b257..da73fd51d232 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10195,17 +10195,21 @@ static int btrfs_tmpfile(struct user_namespace 
*mnt_userns, struct inode *dir,
return ret;
 }
 
-void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
+void btrfs_set_range_writeback(struct btrfs_inode *inode, u64 start, u64 end)
 {
-   struct inode *inode = tree->private_data;
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
unsigned long index = start >> PAGE_SHIFT;
unsigned long end_index = end >> PAGE_SHIFT;
struct page *page;
+   u32 len;
 
+   ASSERT(end + 1 - start <= U32_MAX);
+   len = end + 1 - start;
while (index <= end_index) {
-   page = find_get_page(inode->i_mapping, index);
+   page = find_get_page(inode->vfs_inode.i_mapping, index);
ASSERT(page); /* Pages should be in the extent_io_tree */
-   set_page_writeback(page);
+
+   btrfs_page_set_writeback(fs_info, page, start, len);
put_page(page);
index++;
}
-- 
2.31.1

[PATCH 23/42] btrfs: make page Ordered bit to be subpage compatible

This involves the following modication:
- Ordered extent creation
  This is done in process_one_page(), now PAGE_SET_ORDERED will call
  subpage helper to do the work.

- endio functions
  This is done in btrfs_mark_ordered_io_finished().

- btrfs_invalidatepage()

Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c|  2 +-
 fs/btrfs/inode.c| 14 ++
 fs/btrfs/ordered-data.c |  5 +++--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 876b7f655df7..cc73fd3c840c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1824,7 +1824,7 @@ static int process_one_page(struct btrfs_fs_info *fs_info,
len = end + 1 - start;
 
if (page_ops & PAGE_SET_ORDERED)
-   SetPageOrdered(page);
+   btrfs_page_clamp_set_ordered(fs_info, page, start, len);
 
if (page == locked_page)
return 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 03f9139b391a..f366dc2fb1ff 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -51,6 +51,7 @@
 #include "block-group.h"
 #include "space-info.h"
 #include "zoned.h"
+#include "subpage.h"
 
 struct btrfs_iget_args {
u64 ino;
@@ -170,7 +171,8 @@ static inline void btrfs_cleanup_ordered_extents(struct 
btrfs_inode *inode,
index++;
if (!page)
continue;
-   ClearPageOrdered(page);
+   btrfs_page_clear_ordered(inode->root->fs_info, page,
+page_offset(page), PAGE_SIZE);
put_page(page);
}
 
@@ -8320,12 +8322,13 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 unsigned int length)
 {
struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+   struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_io_tree *tree = &inode->io_tree;
struct extent_state *cached_state = NULL;
u64 page_start = page_offset(page);
u64 page_end = page_start + PAGE_SIZE - 1;
u64 cur;
-   u32 sectorsize = inode->root->fs_info->sectorsize;
+   u32 sectorsize = fs_info->sectorsize;
int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
 
/*
@@ -8356,6 +8359,7 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
struct btrfs_ordered_extent *ordered;
bool delete_states = false;
u64 range_end;
+   u32 range_len;
 
/*
 * Here we can't pass "file_offset = cur" and
@@ -8382,7 +8386,9 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 
range_end = min(ordered->file_offset + ordered->num_bytes - 1,
page_end);
-   if (!PageOrdered(page)) {
+   ASSERT(range_end + 1 - cur < U32_MAX);
+   range_len = range_end + 1 - cur;
+   if (!btrfs_page_test_ordered(fs_info, page, cur, range_len)) {
/*
 * If Ordered (Private2) is cleared, it means endio has
 * already been executed for the range.
@@ -8392,7 +8398,7 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
delete_states = false;
goto next;
}
-   ClearPageOrdered(page);
+   btrfs_page_clear_ordered(fs_info, page, cur, range_len);
 
/*
 * IO on this page will never be started, so we need to account
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 3e782145247e..03853e7494f7 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -16,6 +16,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "qgroup.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -402,11 +403,11 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode 
*inode,
 *
 * If no such bit, we need to skip to next range.
 */
-   if (!PageOrdered(page)) {
+   if (!btrfs_page_test_ordered(fs_info, page, cur, len)) {
cur += len;
continue;
}
-   ClearPageOrdered(page);
+   btrfs_page_clear_ordered(fs_info, page, cur, len);
}
 
/* Now we're fine to update the accounting */
-- 
2.31.1

[PATCH 22/42] btrfs: introduce helpers for subpage ordered status

This patch introduces the following functions to handle btrfs subpage
ordered (private2) status:
- btrfs_subpage_set_ordered()
- btrfs_subpage_clear_ordered()
- btrfs_subpage_test_ordered()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_ordered()
- btrfs_page_clear_ordered()
- btrfs_page_test_ordered()
  Those helpers can handle both regular sector size and subpage without
  problem.

Those functions are here to co-ordinate btrfs_invalidatepage() with
btrfs_writepage_endio_finish_ordered(), to make sure only one of those
functions can finish the ordered extent.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/subpage.c | 29 +
 fs/btrfs/subpage.h |  4 
 2 files changed, 33 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index f728e5009487..516e0b3f2ed9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -429,6 +429,32 @@ void btrfs_subpage_clear_writeback(const struct 
btrfs_fs_info *fs_info,
spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+void btrfs_subpage_set_ordered(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+   unsigned long flags;
+
+   spin_lock_irqsave(&subpage->lock, flags);
+   subpage->ordered_bitmap |= tmp;
+   SetPageOrdered(page);
+   spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+void btrfs_subpage_clear_ordered(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   const u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+   unsigned long flags;
+
+   spin_lock_irqsave(&subpage->lock, flags);
+   subpage->ordered_bitmap &= ~tmp;
+   if (subpage->ordered_bitmap == 0)
+   ClearPageOrdered(page);
+   spin_unlock_irqrestore(&subpage->lock, flags);
+}
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -451,6 +477,7 @@ IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(ordered);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -519,3 +546,5 @@ IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, 
clear_page_dirty_for_io,
 PageDirty);
 IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
 PageWriteback);
+IMPLEMENT_BTRFS_PAGE_OPS(ordered, SetPageOrdered, ClearPageOrdered,
+PageOrdered);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 9d087ab3244e..3419b152c00f 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -34,6 +34,9 @@ struct btrfs_subpage {
struct {
atomic_t readers;
atomic_t writers;
+
+   /* If a sector has pending ordered extent */
+   u16 ordered_bitmap;
};
};
 };
@@ -111,6 +114,7 @@ DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
 DECLARE_BTRFS_SUBPAGE_OPS(dirty);
 DECLARE_BTRFS_SUBPAGE_OPS(writeback);
+DECLARE_BTRFS_SUBPAGE_OPS(ordered);
 
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
struct page *page, u64 start, u32 len);
-- 
2.31.1

[PATCH 25/42] btrfs: prevent extent_clear_unlock_delalloc() to unlock page not locked by __process_pages_contig()

In cow_file_range(), after we have succeeded creating an inline extent,
we unlock the page with extent_clear_unlock_delalloc() by passing
locked_page == NULL.

For sectorsize == PAGE_SIZE case, this is just making the page lock and
unlock harder to grab.

But for incoming subpage case, it can be a big problem.

For incoming subpage case, page locking have two entrace:
- __process_pages_contig()
  In that case, we know exactly the range we want to lock (which only
  requires sector alignment).
  To handle the subpage requirement, we introduce btrfs_subpage::writers
  to page::private, and will update it in __process_pages_contig().

- Other directly lock/unlock_page() call sites
  Those won't touch btrfs_subpage::writers at all.

This means, page locked by __process_pages_contig() can only be unlocked
by __process_pages_contig().
Thankfully we already have the existing infrastructure in the form of
@locked_page in various call sites.

Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
creating an inline extent is the exception.
It intentionally call extent_clear_unlock_delalloc() with locked_page ==
NULL, to also unlock current page (and clear its dirty/writeback bits).

To co-operate with incoming subpage modifications, and make the page
lock/unlock pair easier to understand, this patch will still call
extent_clear_unlock_delalloc() with locked_page, and only unlock the
page in __extent_writepage().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f366dc2fb1ff..566431d7b257 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1072,7 +1072,8 @@ static noinline int cow_file_range(struct btrfs_inode 
*inode,
 * our outstanding extent for clearing delalloc for this
 * range.
 */
-   extent_clear_unlock_delalloc(inode, start, end, NULL,
+   extent_clear_unlock_delalloc(inode, start, end,
+locked_page,
 EXTENT_LOCKED | EXTENT_DELALLOC |
 EXTENT_DELALLOC_NEW | EXTENT_DEFRAG |
 EXTENT_DO_ACCOUNTING, PAGE_UNLOCK |
@@ -1080,6 +1081,19 @@ static noinline int cow_file_range(struct btrfs_inode 
*inode,
*nr_written = *nr_written +
 (end - start + PAGE_SIZE) / PAGE_SIZE;
*page_started = 1;
+   /*
+* locked_page is locked by the caller of
+* writepage_delalloc(), not locked by
+* __process_pages_contig().
+*
+* We can't let __process_pages_contig() to unlock it,
+* as it doesn't have any subpage::writers recorded.
+*
+* Here we manually unlock the page, since the caller
+* can't use page_started to determine if it's an
+* inline extent or a compressed extent.
+*/
+   unlock_page(locked_page);
goto out;
} else if (ret < 0) {
goto out_unlock;
-- 
2.31.1

[PATCH 24/42] btrfs: update locked page dirty/writeback/error bits in __process_pages_contig

When __process_pages_contig() gets called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.

There are several call sites that call extent_clear_unlock_delalloc()
with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK

- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
  All for their error handling branches.

For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
subpage dirty bit remaining.

Normally it's the call sites which locked the page to handle the locked
page, but it won't hurt if we also do the update.

Especially there are already other call sites doing the same thing by
manually passing NULL as locked_page.

Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cc73fd3c840c..7dc1b367bf35 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1825,10 +1825,6 @@ static int process_one_page(struct btrfs_fs_info 
*fs_info,
 
if (page_ops & PAGE_SET_ORDERED)
btrfs_page_clamp_set_ordered(fs_info, page, start, len);
-
-   if (page == locked_page)
-   return 1;
-
if (page_ops & PAGE_SET_ERROR)
btrfs_page_clamp_set_error(fs_info, page, start, len);
if (page_ops & PAGE_START_WRITEBACK) {
@@ -1837,6 +1833,10 @@ static int process_one_page(struct btrfs_fs_info 
*fs_info,
}
if (page_ops & PAGE_END_WRITEBACK)
btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
+
+   if (page == locked_page)
+   return 1;
+
if (page_ops & PAGE_LOCK) {
int ret;
 
-- 
2.31.1

[PATCH 21/42] btrfs: make process_one_page() to handle subpage locking

Introduce a new data inodes specific subpage member, writers, to record
how many sectors are under page lock for delalloc writing.

This member acts pretty much the same as readers, except it's only for
delalloc writes.

This is important for delalloc code to trace which page can really be
freed, as we have cases like run_delalloc_nocow() where we may exit
processing nocow range inside a page, but need to exit to do cow half
way.
In that case, we need a way to determine if we can really unlock a full
page.

With the new btrfs_subpage::writers, there is a new requirement:
- Page locked by process_one_page() must be unlocked by
  process_one_page()
  There are still tons of call sites manually lock and unlock a page,
  without updating btrfs_subpage::writers.
  So if we lock a page through process_one_page() then it must be
  unlocked by process_one_page() to keep btrfs_subpage::writers
  consistent.

  This will be handled in next patch.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 10 +++--
 fs/btrfs/subpage.c   | 89 ++--
 fs/btrfs/subpage.h   | 10 +
 3 files changed, 94 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index da2d4494c5c1..876b7f655df7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1838,14 +1838,18 @@ static int process_one_page(struct btrfs_fs_info 
*fs_info,
if (page_ops & PAGE_END_WRITEBACK)
btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
if (page_ops & PAGE_LOCK) {
-   lock_page(page);
+   int ret;
+
+   ret = btrfs_page_start_writer_lock(fs_info, page, start, len);
+   if (ret)
+   return ret;
if (!PageDirty(page) || page->mapping != mapping) {
-   unlock_page(page);
+   btrfs_page_end_writer_lock(fs_info, page, start, len);
return -EAGAIN;
}
}
if (page_ops & PAGE_UNLOCK)
-   unlock_page(page);
+   btrfs_page_end_writer_lock(fs_info, page, start, len);
return 0;
 }
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index a6cf1776f3f9..f728e5009487 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -110,10 +110,12 @@ int btrfs_alloc_subpage(const struct btrfs_fs_info 
*fs_info,
if (!*ret)
return -ENOMEM;
spin_lock_init(&(*ret)->lock);
-   if (type == BTRFS_SUBPAGE_METADATA)
+   if (type == BTRFS_SUBPAGE_METADATA) {
atomic_set(&(*ret)->eb_refs, 0);
-   else
+   } else {
atomic_set(&(*ret)->readers, 0);
+   atomic_set(&(*ret)->writers, 0);
+   }
return 0;
 }
 
@@ -203,6 +205,79 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info 
*fs_info,
unlock_page(page);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+   u64 orig_start = *start;
+   u32 orig_len = *len;
+
+   *start = max_t(u64, page_offset(page), orig_start);
+   *len = min_t(u64, page_offset(page) + PAGE_SIZE,
+orig_start + orig_len) - *start;
+}
+
+void btrfs_subpage_start_writer(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   int nbits = len >> fs_info->sectorsize_bits;
+   int ret;
+
+   btrfs_subpage_assert(fs_info, page, start, len);
+
+   ASSERT(atomic_read(&subpage->readers) == 0);
+   ret = atomic_add_return(nbits, &subpage->writers);
+   ASSERT(ret == nbits);
+}
+
+bool btrfs_subpage_end_and_test_writer(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   int nbits = len >> fs_info->sectorsize_bits;
+
+   btrfs_subpage_assert(fs_info, page, start, len);
+
+   ASSERT(atomic_read(&subpage->writers) >= nbits);
+   return atomic_sub_and_test(nbits, &subpage->writers);
+}
+
+/*
+ * To lock a page for delalloc page writeback.
+ *
+ * Return -EAGAIN if the page is not properly initialized.
+ * Return 0 with the page locked, and writer counter updated.
+ *
+ * Even with 0 returned, the page still need extra check to make sure
+ * it's really the correct page, as the caller is using
+ * find_get_pages_contig(), which can race with page invalidating.
+ */
+int btrfs_page_start_writer_lock(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {
+   lock_page(page);
+   return 0;
+   }
+   lock_page(page);
+

[PATCH 20/42] btrfs: make end_bio_extent_writepage() to be subpage compatible

Now in end_bio_extent_writepage(), the only subpage incompatible code is
the end_page_writeback().

Just call the subpage helpers.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f40e229960d7..da2d4494c5c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2808,7 +2808,8 @@ static void end_bio_extent_writepage(struct bio *bio)
}
 
end_extent_writepage(page, error, start, end);
-   end_page_writeback(page);
+
+   btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
}
 
bio_put(bio);
-- 
2.31.1

[PATCH 19/42] btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status

For __process_pages_contig() and process_one_page(), to handle subpage
we only need to pass bytenr in and call subpage helpers to handle
dirty/error/writeback status.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 94f8b3ffe6a7..f40e229960d7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1813,10 +1813,16 @@ bool btrfs_find_delalloc_range(struct extent_io_tree 
*tree, u64 *start,
  * Return -EGAIN if the we need to try again.
  * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
  */
-static int process_one_page(struct address_space *mapping,
+static int process_one_page(struct btrfs_fs_info *fs_info,
+   struct address_space *mapping,
struct page *page, struct page *locked_page,
-   unsigned long page_ops)
+   unsigned long page_ops, u64 start, u64 end)
 {
+   u32 len;
+
+   ASSERT(end + 1 - start != 0 && end + 1 - start < U32_MAX);
+   len = end + 1 - start;
+
if (page_ops & PAGE_SET_ORDERED)
SetPageOrdered(page);
 
@@ -1824,13 +1830,13 @@ static int process_one_page(struct address_space 
*mapping,
return 1;
 
if (page_ops & PAGE_SET_ERROR)
-   SetPageError(page);
+   btrfs_page_clamp_set_error(fs_info, page, start, len);
if (page_ops & PAGE_START_WRITEBACK) {
-   clear_page_dirty_for_io(page);
-   set_page_writeback(page);
+   btrfs_page_clamp_clear_dirty(fs_info, page, start, len);
+   btrfs_page_clamp_set_writeback(fs_info, page, start, len);
}
if (page_ops & PAGE_END_WRITEBACK)
-   end_page_writeback(page);
+   btrfs_page_clamp_clear_writeback(fs_info, page, start, len);
if (page_ops & PAGE_LOCK) {
lock_page(page);
if (!PageDirty(page) || page->mapping != mapping) {
@@ -1848,6 +1854,7 @@ static int __process_pages_contig(struct address_space 
*mapping,
  u64 start, u64 end, unsigned long page_ops,
  u64 *processed_end)
 {
+   struct btrfs_fs_info *fs_info = btrfs_sb(mapping->host->i_sb);
pgoff_t start_index = start >> PAGE_SHIFT;
pgoff_t end_index = end >> PAGE_SHIFT;
pgoff_t index = start_index;
@@ -1884,8 +1891,9 @@ static int __process_pages_contig(struct address_space 
*mapping,
for (i = 0; i < found_pages; i++) {
int process_ret;
 
-   process_ret = process_one_page(mapping, pages[i],
-   locked_page, page_ops);
+   process_ret = process_one_page(fs_info, mapping,
+   pages[i], locked_page, page_ops,
+   start, end);
if (process_ret < 0) {
for (; i < found_pages; i++)
put_page(pages[i]);
-- 
2.31.1

[PATCH 18/42] btrfs: make btrfs_dirty_pages() to be subpage compatible

Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/file.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 864c08d08a35..8f71699fdd18 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -28,6 +28,7 @@
 #include "compression.h"
 #include "delalloc-space.h"
 #include "reflink.h"
+#include "subpage.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -482,6 +483,7 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct 
page **pages,
start_pos = round_down(pos, fs_info->sectorsize);
num_bytes = round_up(write_bytes + pos - start_pos,
 fs_info->sectorsize);
+   ASSERT(num_bytes <= U32_MAX);
 
end_of_last_block = start_pos + num_bytes - 1;
 
@@ -500,9 +502,10 @@ int btrfs_dirty_pages(struct btrfs_inode *inode, struct 
page **pages,
 
for (i = 0; i < num_pages; i++) {
struct page *p = pages[i];
-   SetPageUptodate(p);
+
+   btrfs_page_clamp_set_uptodate(fs_info, p, start_pos, num_bytes);
ClearPageChecked(p);
-   set_page_dirty(p);
+   btrfs_page_clamp_set_dirty(fs_info, p, start_pos, num_bytes);
}
 
/*
-- 
2.31.1

[PATCH 17/42] btrfs: only require sector size alignment for end_bio_extent_writepage()

Just like read page, for subpage support we only require sector size
alignment.

So change the error message condition to only require sector alignment.

This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 29 -
 1 file changed, 12 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 53ac22e3560f..94f8b3ffe6a7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2779,25 +2779,20 @@ static void end_bio_extent_writepage(struct bio *bio)
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+   const u32 sectorsize = fs_info->sectorsize;
 
-   /* We always issue full-page reads, but if some block
-* in a page fails to read, blk_update_request() will
-* advance bv_offset and adjust bv_len to compensate.
-* Print a warning for nonzero offsets, and an error
-* if they don't add up to a full page.  */
-   if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
-   if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
-   btrfs_err(fs_info,
-  "partial page write in btrfs with offset %u 
and length %u",
-   bvec->bv_offset, bvec->bv_len);
-   else
-   btrfs_info(fs_info,
-  "incomplete page write in btrfs with offset 
%u and length %u",
-   bvec->bv_offset, bvec->bv_len);
-   }
+   /* Btrfs read write should always be sector aligned. */
+   if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+   btrfs_err(fs_info,
+   "partial page write in btrfs with offset %u and length %u",
+ bvec->bv_offset, bvec->bv_len);
+   else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+   btrfs_info(fs_info,
+   "incomplete page write with offset %u and length %u",
+  bvec->bv_offset, bvec->bv_len);
 
-   start = page_offset(page);
-   end = start + bvec->bv_offset + bvec->bv_len - 1;
+   start = page_offset(page) + bvec->bv_offset;
+   end = start + bvec->bv_len - 1;
 
if (first_bvec) {
btrfs_record_physical_zoned(inode, start, bio);
-- 
2.31.1

[PATCH 14/42] btrfs: pass bytenr directly to __process_pages_contig()

As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.

So change the parameter and all callers to pass bytenr in.

With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.

Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.

If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().

So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.

This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 57 
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ac01f29b00c9..ff24db8513b4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1807,8 +1807,8 @@ bool btrfs_find_delalloc_range(struct extent_io_tree 
*tree, u64 *start,
 
 static int __process_pages_contig(struct address_space *mapping,
  struct page *locked_page,
- pgoff_t start_index, pgoff_t end_index,
- unsigned long page_ops, pgoff_t *index_ret);
+ u64 start, u64 end, unsigned long page_ops,
+ u64 *processed_end);
 
 static noinline void __unlock_for_delalloc(struct inode *inode,
   struct page *locked_page,
@@ -1821,7 +1821,7 @@ static noinline void __unlock_for_delalloc(struct inode 
*inode,
if (index == locked_page->index && end_index == index)
return;
 
-   __process_pages_contig(inode->i_mapping, locked_page, index, end_index,
+   __process_pages_contig(inode->i_mapping, locked_page, start, end,
   PAGE_UNLOCK, NULL);
 }
 
@@ -1831,19 +1831,19 @@ static noinline int lock_delalloc_pages(struct inode 
*inode,
u64 delalloc_end)
 {
unsigned long index = delalloc_start >> PAGE_SHIFT;
-   unsigned long index_ret = index;
unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+   u64 processed_end = delalloc_start;
int ret;
 
ASSERT(locked_page);
if (index == locked_page->index && index == end_index)
return 0;
 
-   ret = __process_pages_contig(inode->i_mapping, locked_page, index,
-end_index, PAGE_LOCK, &index_ret);
-   if (ret == -EAGAIN)
+   ret = __process_pages_contig(inode->i_mapping, locked_page, 
delalloc_start,
+delalloc_end, PAGE_LOCK, &processed_end);
+   if (ret == -EAGAIN && processed_end > delalloc_start)
__unlock_for_delalloc(inode, locked_page, delalloc_start,
- (u64)index_ret << PAGE_SHIFT);
+ processed_end);
return ret;
 }
 
@@ -1938,12 +1938,14 @@ noinline_for_stack bool find_lock_delalloc_range(struct 
inode *inode,
 
 static int __process_pages_contig(struct address_space *mapping,
  struct page *locked_page,
- pgoff_t start_index, pgoff_t end_index,
- unsigned long page_ops, pgoff_t *index_ret)
+ u64 start, u64 end, unsigned long page_ops,
+ u64 *processed_end)
 {
+   pgoff_t start_index = start >> PAGE_SHIFT;
+   pgoff_t end_index = end >> PAGE_SHIFT;
+   pgoff_t index = start_index;
unsigned long nr_pages = end_index - start_index + 1;
unsigned long pages_processed = 0;
-   pgoff_t index = start_index;
struct page *pages[16];
unsigned ret;
int err = 0;
@@ -1951,17 +1953,19 @@ static int __process_pages_contig(struct address_space 
*mapping,
 
if (page_ops & PAGE_LOCK) {
ASSERT(page_ops == PAGE_LOCK);
-   ASSERT(index_ret && *index_ret == start_index);
+   ASSERT(processed_end && *processed_end == start);
}
 
if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
mapping_set_error(mapping, -EIO);
 
while (nr_pages > 0) {
-   ret = find_get_pages_contig(mapping, index,
+   int found_pages;
+
+   found_pages = find_get_pages_contig(mapping, index,

[PATCH 16/42] btrfs: provide btrfs_page_clamp_*() helpers

In the coming subpage RW supports, there are a lot of page status update
calls which need to be converted to subpage compatible version, which
needs @start and @len.

Some call sites already have such @start/@len and are already in
page range, like various endio functions.

But there are also call sites which need to clamp the range for subpage
case, like btrfs_dirty_pagse() and __process_contig_pages().

Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
clamp for subpage version.

Although in theory all existing btrfs_page_*() calls can be converted to
use btrfs_page_clamp_*() directly, but that would make us to do
unnecessary clamp operations.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/subpage.c | 38 ++
 fs/btrfs/subpage.h | 10 ++
 2 files changed, 48 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 2d19089ab625..a6cf1776f3f9 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -354,6 +354,16 @@ void btrfs_subpage_clear_writeback(const struct 
btrfs_fs_info *fs_info,
spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+static void btrfs_subpage_clamp_range(struct page *page, u64 *start, u32 *len)
+{
+   u64 orig_start = *start;
+   u32 orig_len = *len;
+
+   *start = max_t(u64, page_offset(page), orig_start);
+   *len = min_t(u64, page_offset(page) + PAGE_SIZE,
+orig_start + orig_len) - *start;
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -408,6 +418,34 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info 
*fs_info,   \
if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) \
return test_page_func(page);\
return btrfs_subpage_test_##name(fs_info, page, start, len);\
+}  \
+void btrfs_page_clamp_set_##name(const struct btrfs_fs_info *fs_info,  \
+   struct page *page, u64 start, u32 len)  \
+{  \
+   if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {   \
+   set_page_func(page);\
+   return; \
+   }   \
+   btrfs_subpage_clamp_range(page, &start, &len);  \
+   btrfs_subpage_set_##name(fs_info, page, start, len);\
+}  \
+void btrfs_page_clamp_clear_##name(const struct btrfs_fs_info *fs_info, \
+   struct page *page, u64 start, u32 len)  \
+{  \
+   if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) {   \
+   clear_page_func(page);  \
+   return; \
+   }   \
+   btrfs_subpage_clamp_range(page, &start, &len);  \
+   btrfs_subpage_clear_##name(fs_info, page, start, len);  \
+}  \
+bool btrfs_page_clamp_test_##name(const struct btrfs_fs_info *fs_info, \
+   struct page *page, u64 start, u32 len)  \
+{  \
+   if (unlikely(!fs_info) || fs_info->sectorsize == PAGE_SIZE) \
+   return test_page_func(page);\
+   btrfs_subpage_clamp_range(page, &start, &len);  \
+   return btrfs_subpage_test_##name(fs_info, page, start, len);\
 }
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 PageUptodate);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index bfd626e955be..291cb1932f27 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -72,6 +72,10 @@ void btrfs_subpage_end_reader(const struct btrfs_fs_info 
*fs_info,
  * btrfs_page_*() are for call sites where the page can either be subpage
  * specific or regular page. The function will handle both cases.
  * But the range still needs to be inside the page.
+ *
+ * btrfs_page_clamp_*() are similar to btrfs_page_*(), except the range doesn't
+ * need to be inside the page. Those functions will truncate the range
+ * automatically.
  */
 #define DECLARE_BTRFS_SUBPAGE_OPS(name)
\
 void btrfs_subpage_set_##name(const struct btrfs_fs_info *fs_info, \
@@ -85,6 +89,12 @@ void btrfs_page_set_##name(const struct btrfs_fs_info 
*fs_info,

[PATCH 15/42] btrfs: refactor the page status update into process_one_page()

In __process_pages_contig() we update page status according to page_ops.

That update process is a bunch of if () {} branches, which lies inside
two loops, this makes it pretty hard to expand for later subpage
operations.

So this patch will extract this operations into its own function,
process_one_pages().

Also since we're refactoring __process_pages_contig(), also move the new
helper and __process_pages_contig() before the first caller of them, to
remove the forward declaration.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 206 +++
 1 file changed, 109 insertions(+), 97 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ff24db8513b4..53ac22e3560f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1805,10 +1805,118 @@ bool btrfs_find_delalloc_range(struct extent_io_tree 
*tree, u64 *start,
return found;
 }
 
+/*
+ * Process one page for __process_pages_contig().
+ *
+ * Return >0 if we hit @page == @locked_page.
+ * Return 0 if we updated the page status.
+ * Return -EGAIN if the we need to try again.
+ * (For PAGE_LOCK case but got dirty page or page not belong to mapping)
+ */
+static int process_one_page(struct address_space *mapping,
+   struct page *page, struct page *locked_page,
+   unsigned long page_ops)
+{
+   if (page_ops & PAGE_SET_ORDERED)
+   SetPageOrdered(page);
+
+   if (page == locked_page)
+   return 1;
+
+   if (page_ops & PAGE_SET_ERROR)
+   SetPageError(page);
+   if (page_ops & PAGE_START_WRITEBACK) {
+   clear_page_dirty_for_io(page);
+   set_page_writeback(page);
+   }
+   if (page_ops & PAGE_END_WRITEBACK)
+   end_page_writeback(page);
+   if (page_ops & PAGE_LOCK) {
+   lock_page(page);
+   if (!PageDirty(page) || page->mapping != mapping) {
+   unlock_page(page);
+   return -EAGAIN;
+   }
+   }
+   if (page_ops & PAGE_UNLOCK)
+   unlock_page(page);
+   return 0;
+}
+
 static int __process_pages_contig(struct address_space *mapping,
  struct page *locked_page,
  u64 start, u64 end, unsigned long page_ops,
- u64 *processed_end);
+ u64 *processed_end)
+{
+   pgoff_t start_index = start >> PAGE_SHIFT;
+   pgoff_t end_index = end >> PAGE_SHIFT;
+   pgoff_t index = start_index;
+   unsigned long nr_pages = end_index - start_index + 1;
+   unsigned long pages_processed = 0;
+   struct page *pages[16];
+   int err = 0;
+   int i;
+
+   if (page_ops & PAGE_LOCK) {
+   ASSERT(page_ops == PAGE_LOCK);
+   ASSERT(processed_end && *processed_end == start);
+   }
+
+   if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
+   mapping_set_error(mapping, -EIO);
+
+   while (nr_pages > 0) {
+   int found_pages;
+
+   found_pages = find_get_pages_contig(mapping, index,
+min_t(unsigned long,
+nr_pages, ARRAY_SIZE(pages)), pages);
+   if (found_pages == 0) {
+   /*
+* Only if we're going to lock these pages,
+* can we find nothing at @index.
+*/
+   ASSERT(page_ops & PAGE_LOCK);
+   err = -EAGAIN;
+   goto out;
+   }
+
+   for (i = 0; i < found_pages; i++) {
+   int process_ret;
+
+   process_ret = process_one_page(mapping, pages[i],
+   locked_page, page_ops);
+   if (process_ret < 0) {
+   for (; i < found_pages; i++)
+   put_page(pages[i]);
+   err = -EAGAIN;
+   goto out;
+   }
+   put_page(pages[i]);
+   pages_processed++;
+   }
+   nr_pages -= found_pages;
+   index += found_pages;
+   cond_resched();
+   }
+out:
+   if (err && processed_end) {
+   /*
+* Update @processed_end. I know this is awful since it has
+* two different return value patterns (inclusive vs exclusive).
+*
+* But the exclusive pattern is necessary if @start is 0, or we
+* underflow and check against processed_end won't work as
+* expecte

[PATCH 10/42] btrfs: update the comments in btrfs_invalidatepage()

The existing comments in btrfs_invalidatepage() don't really get to the
point, especially for what Private2 is really representing and how the
race avoidance is done.

The truth is, there are only three entrances to do ordered extent
accounting:
- btrfs_writepage_endio_finish_ordered()
- __endio_write_update_ordered()
  Those two entrance are just endio functions for dio and buffered
  write.

- btrfs_invalidatepage()

But there is a pitfall, in endio functions there is no check on whether
the ordered extent is already accounted.
They just blindly clear the Private2 bit and do the accounting.

So it's all btrfs_invalidatepage()'s responsibility to make sure we
won't do double account on the same sector.

That's why in btrfs_invalidatepage() we have to wait page writeback,
this will ensure all submitted bios has finished, thus their endio
functions have finished the accounting on the ordered extent.

Then we also check page Private2 to ensure that, we only run ordered
extent accounting on pages who has no bio submitted.

This patch will rework related comments to make it more clear on the
race and how we use wait_on_page_writeback() and Private2 to prevent
double accounting on ordered extent.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 645097bff5a0..4c894de2e813 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8331,11 +8331,16 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
bool completed_ordered = false;
 
/*
-* we have the page locked, so new writeback can't start,
-* and the dirty bit won't be cleared while we are here.
+* We have page locked so no new ordered extent can be created on
+* this page, nor bio can be submitted for this page.
 *
-* Wait for IO on this page so that we can safely clear
-* the PagePrivate2 bit and do ordered accounting
+* But already submitted bio can still be finished on this page.
+* Furthermore, endio function won't skip page which has Private2
+* already cleared, so it's possible for endio and invalidatepage
+* to do the same ordered extent accounting twice on one page.
+*
+* So here we wait any submitted bios to finish, so that we won't
+* do double ordered extent accounting on the same page.
 */
wait_on_page_writeback(page);
 
@@ -8365,8 +8370,12 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 EXTENT_LOCKED | EXTENT_DO_ACCOUNTING |
 EXTENT_DEFRAG, 1, 0, &cached_state);
/*
-* whoever cleared the private bit is responsible
-* for the finish_ordered_io
+* A page with Private2 bit means no bio has submitted covering
+* the page, thus we have to manually do the ordered extent
+* accounting.
+*
+* For page without Private2, the ordered extent accounting is
+* done in its endio function of the submitted bio.
 */
if (TestClearPagePrivate2(page)) {
spin_lock_irq(&inode->ordered_tree.lock);
-- 
2.31.1

[PATCH 13/42] btrfs: rename PagePrivate2 to PageOrdered inside btrfs

Inside btrfs, we use Private2 page status to indicate we have ordered
extent with pending IO for the sector.

But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h| 11 +++
 fs/btrfs/extent_io.c|  4 ++--
 fs/btrfs/extent_io.h|  2 +-
 fs/btrfs/inode.c| 40 +---
 fs/btrfs/ordered-data.c |  8 
 5 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 505bc6674bcc..903fdcb6ecd0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3759,4 +3759,15 @@ static inline bool btrfs_is_zoned(const struct 
btrfs_fs_info *fs_info)
return fs_info->zoned != 0;
 }
 
+/*
+ * Btrfs uses page status Private2 to indicate there is an ordered extent with
+ * unfinished IO.
+ *
+ * Rename the Private2 accessors to Ordered inside btrfs, to slightly improve
+ * the readability.
+ */
+#define PageOrdered(page)  PagePrivate2(page)
+#define SetPageOrdered(page)   SetPagePrivate2(page)
+#define ClearPageOrdered(page) ClearPagePrivate2(page)
+
 #endif
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6d712418b67b..ac01f29b00c9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1972,8 +1972,8 @@ static int __process_pages_contig(struct address_space 
*mapping,
}
 
for (i = 0; i < ret; i++) {
-   if (page_ops & PAGE_SET_PRIVATE2)
-   SetPagePrivate2(pages[i]);
+   if (page_ops & PAGE_SET_ORDERED)
+   SetPageOrdered(pages[i]);
 
if (locked_page && pages[i] == locked_page) {
put_page(pages[i]);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 227215a5722c..32a0d541144e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -39,7 +39,7 @@ enum {
 /* Page starts writeback, clear dirty bit and set writeback bit */
 #define PAGE_START_WRITEBACK   (1 << 1)
 #define PAGE_END_WRITEBACK (1 << 2)
-#define PAGE_SET_PRIVATE2  (1 << 3)
+#define PAGE_SET_ORDERED   (1 << 3)
 #define PAGE_SET_ERROR (1 << 4)
 #define PAGE_LOCK  (1 << 5)
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e237b6ed27c0..03f9139b391a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -170,7 +170,7 @@ static inline void btrfs_cleanup_ordered_extents(struct 
btrfs_inode *inode,
index++;
if (!page)
continue;
-   ClearPagePrivate2(page);
+   ClearPageOrdered(page);
put_page(page);
}
 
@@ -1156,15 +1156,16 @@ static noinline int cow_file_range(struct btrfs_inode 
*inode,
 
btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 
-   /* we're not doing compressed IO, don't unlock the first
+   /*
+* We're not doing compressed IO, don't unlock the first
 * page (which the caller expects to stay locked), don't
 * clear any dirty bits and don't set any writeback bits
 *
-* Do set the Private2 bit so we know this page was properly
-* setup for writepage
+* Do set the Ordered (Private2) bit so we know this page was
+* properly setup for writepage.
 */
page_ops = unlock ? PAGE_UNLOCK : 0;
-   page_ops |= PAGE_SET_PRIVATE2;
+   page_ops |= PAGE_SET_ORDERED;
 
extent_clear_unlock_delalloc(inode, start, start + ram_size - 1,
 locked_page,
@@ -1828,7 +1829,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode 
*inode,
 locked_page, EXTENT_LOCKED |
 EXTENT_DELALLOC |
 EXTENT_CLEAR_DATA_RESV,
-PAGE_UNLOCK | PAGE_SET_PRIVATE2);
+PAGE_UNLOCK | PAGE_SET_ORDERED);
 
cur_offset = extent_end;
 
@@ -2603,7 +2604,7 @@ static void btrfs_writepage_fixup_worker(struct 
btrfs_work *work)
lock_extent_bits(&inode->io_tree, page_start, page_end, &cached_state);
 
/* already ordered? We're done */
-   if (PagePrivate2(page))
+   if (PageOrdered(page))
goto out_reserved;
 
ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_SIZE);
@@

[PATCH 11/42] btrfs: refactor btrfs_invalidatepage()

This patch will refactor btrfs_invalidatepage() for the incoming subpage
support.

The invovled modifcations are:
- Use while() loop instead of "goto again;"
- Use single variable to determine whether to delete extent states
  Each branch will also have comments why we can or cannot delete the
  extent states
- Do qgroup free and extent states deletion per-loop
  Current code can only work for PAGE_SIZE == sectorsize case.

This refactor also makes it clear what we do for different sectors:
- Sectors without ordered extent
  We're completely safe to remove all extent states for the sector(s)

- Sectors with ordered extent, but no Private2 bit
  This means the endio has already been executed, we can't remove all
  extent states for the sector(s).

- Sectors with ordere extent, still has Private2 bit
  This means we need to decrease the ordered extent accounting.
  And then it comes to two different variants:
  * We have finished and removed the ordered extent
Then it's the same as "sectors without ordered extent"
  * We didn't finished the ordered extent
We can remove some extent states, but not all.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 173 +--
 1 file changed, 94 insertions(+), 79 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4c894de2e813..93bb7c0482ba 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8320,15 +8320,12 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 {
struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
struct extent_io_tree *tree = &inode->io_tree;
-   struct btrfs_ordered_extent *ordered;
struct extent_state *cached_state = NULL;
u64 page_start = page_offset(page);
u64 page_end = page_start + PAGE_SIZE - 1;
-   u64 start;
-   u64 end;
+   u64 cur;
+   u32 sectorsize = inode->root->fs_info->sectorsize;
int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
-   bool found_ordered = false;
-   bool completed_ordered = false;
 
/*
 * We have page locked so no new ordered extent can be created on
@@ -8352,96 +8349,114 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
if (!inode_evicting)
lock_extent_bits(tree, page_start, page_end, &cached_state);
 
-   start = page_start;
-again:
-   ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 
1);
-   if (ordered) {
-   found_ordered = true;
-   end = min(page_end,
- ordered->file_offset + ordered->num_bytes - 1);
+   cur = page_start;
+   while (cur < page_end) {
+   struct btrfs_ordered_extent *ordered;
+   bool delete_states = false;
+   u64 range_end;
+
+   /*
+* Here we can't pass "file_offset = cur" and
+* "len = page_end + 1 - cur", as btrfs_lookup_ordered_range()
+* may not return the first ordered extent after @file_offset.
+*
+* Here we want to iterate through the range in byte order.
+* This is slower but definitely correct.
+*
+* TODO: Make btrfs_lookup_ordered_range() to return the
+* first ordered extent in the range to reduce the number
+* of loops.
+*/
+   ordered = btrfs_lookup_ordered_range(inode, cur, sectorsize);
+   if (!ordered) {
+   range_end = cur + sectorsize - 1;
+   /*
+* No ordered extent covering this sector, we are safe
+* to delete all extent states in the range.
+*/
+   delete_states = true;
+   goto next;
+   }
+
+   range_end = min(ordered->file_offset + ordered->num_bytes - 1,
+   page_end);
+   if (!PagePrivate2(page)) {
+   /*
+* If Private2 is cleared, it means endio has already
+* been executed for the range.
+* We can't delete the extent states as
+* btrfs_finish_ordered_io() may still use some of them.
+*/
+   delete_states = false;
+   goto next;
+   }
+   ClearPagePrivate2(page);
+
/*
 * IO on this page will never be started, so we need to account
 * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
 * here, must leave that up for the ordered extent completion.
+

[PATCH 12/42] btrfs: make Private2 lifespan more consistent

Currently btrfs uses page Private2 bit to incidate if we have ordered
extent for the page range.

But the lifespan of it is not consistent, during regular writeback path,
there are two locations to clear the same PagePrivate2:

T - Page marked Dirty
|
+ - Page marked Private2, through btrfs_run_dealloc_range()
|
+ - Page cleared Private2, through btrfs_writepage_cow_fixup()
|   in __extent_writepage_io()
|   ^^^ Private2 cleared for the first time
|
+ - Page marked Writeback, through btrfs_set_range_writeback()
|   in __extent_writepage_io().
|
+ - Page cleared Private2, through
|   btrfs_writepage_endio_finish_ordered()
|   ^^^ Private2 cleared for the second time.
|
+ - Page cleared Writeback, through
btrfs_writepage_endio_finish_ordered()

Currently PagePrivate2 is mostly to prevent ordered extent accounting
being executed for both endio and invalidatepage.
Thus only the one who cleared page Private2 is responsible for ordered
extent accounting.

But the fact is, in btrfs_writepage_endio_finish_ordered(), page
Private2 is cleared and ordered extent accounting is executed
unconditionally.

The race prevention only happens through btrfs_invalidatepage(), where
we wait the page writeback first, before checking the Private2 bit.

This means, Private2 is also protected by Writeback bit, and there is no
need for btrfs_writepage_cow_fixup() to clear Priavte2.

This patch will change btrfs_writepage_cow_fixup() to just
check PagePrivate2, not to clear it.
The clear will happen either in btrfs_invalidatepage() or
btrfs_writepage_endio_finish_ordered().

This makes the Private2 bit easier to understand, just meaning the page
has unfinished ordered extent attached to it.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 93bb7c0482ba..e237b6ed27c0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2679,7 +2679,7 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 
start, u64 end)
struct btrfs_writepage_fixup *fixup;
 
/* this page is properly in the ordered list */
-   if (TestClearPagePrivate2(page))
+   if (PagePrivate2(page))
return 0;
 
/*
-- 
2.31.1

[PATCH 09/42] btrfs: refactor how we finish ordered extent io for endio functions

Btrfs has two endio functions to mark certain io range finished for
ordered extents:
- __endio_write_update_ordered()
  This is for direct IO

- btrfs_writepage_endio_finish_ordered()
  This for buffered IO.

However they go different routines to handle ordered extent io:
- Whether to iterate through all ordered extents
  __endio_write_update_ordered() will but
  btrfs_writepage_endio_finish_ordered() will not.

  In fact, iterating through all ordered extents will benefit later
  subpage support, while for current PAGE_SIZE == sectorsize requirement
  those behavior makes no difference.

- Whether to update page Private2 flag
  __endio_write_update_ordered() will no update page Private2 flag as
  for iomap direct IO, the page can be not even mapped.
  While btrfs_writepage_endio_finish_ordered() will clear Private2 to
  prevent double accounting against btrfs_invalidatepage().

Those differences are pretty small, and the ordered extent iterations
codes in callers makes code much harder to read.

So this patch will introduce a new function,
btrfs_mark_ordered_io_finished(), to do the heavy lifting work:
- Iterate through all ordered extents in the range
- Do the ordered extent accounting
- Queue the work for finished ordered extent

This function has two new feature:
- Proper underflow detection and recover
  The old underflow detection will only detect the problem, then
  continue.
  No proper info like root/inode/ordered extent info, nor noisy enough
  to be caught by fstests.

  Furthermore when underflow happens, the ordered extent will never
  finish.

  New error detection will reset the bytes_left to 0, do proper
  kernel warning, and output extra info including root, ino, ordered
  extent range, the underflow value.

- Prevent double accounting based on Private2 flag
  Now if we find a range without Private2 flag, we will skip to next
  range.
  As that means someone else has already finished the accounting of
  ordered extent.
  This makes no difference for current code, but will be a critical part
  for incoming subpage support.

Now both endio functions only need to call that new function.

And since the only caller of btrfs_dec_test_first_ordered_pending() is
removed, also remove btrfs_dec_test_first_ordered_pending() completely.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c|  55 +---
 fs/btrfs/ordered-data.c | 179 +++-
 fs/btrfs/ordered-data.h |   8 +-
 3 files changed, 129 insertions(+), 113 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 752f0c78e1df..645097bff5a0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3063,25 +3063,11 @@ void btrfs_writepage_endio_finish_ordered(struct 
btrfs_inode *inode,
  struct page *page, u64 start,
  u64 end, int uptodate)
 {
-   struct btrfs_fs_info *fs_info = inode->root->fs_info;
-   struct btrfs_ordered_extent *ordered_extent = NULL;
-   struct btrfs_workqueue *wq;
-
ASSERT(end + 1 - start < U32_MAX);
trace_btrfs_writepage_end_io_hook(inode, start, end, uptodate);
 
-   ClearPagePrivate2(page);
-   if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
-   end - start + 1, uptodate))
-   return;
-
-   if (btrfs_is_free_space_inode(inode))
-   wq = fs_info->endio_freespace_worker;
-   else
-   wq = fs_info->endio_write_workers;
-
-   btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL);
-   btrfs_queue_work(wq, &ordered_extent->work);
+   btrfs_mark_ordered_io_finished(inode, page, start, end + 1 - start,
+  finish_ordered_fn, uptodate);
 }
 
 /*
@@ -7959,42 +7945,9 @@ static void __endio_write_update_ordered(struct 
btrfs_inode *inode,
 const u64 offset, const u64 bytes,
 const bool uptodate)
 {
-   struct btrfs_fs_info *fs_info = inode->root->fs_info;
-   struct btrfs_ordered_extent *ordered = NULL;
-   struct btrfs_workqueue *wq;
-   u64 ordered_offset = offset;
-   u64 ordered_bytes = bytes;
-   u64 last_offset;
-
-   if (btrfs_is_free_space_inode(inode))
-   wq = fs_info->endio_freespace_worker;
-   else
-   wq = fs_info->endio_write_workers;
-
ASSERT(bytes < U32_MAX);
-   while (ordered_offset < offset + bytes) {
-   last_offset = ordered_offset;
-   if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
-&ordered_offset,
-ordered_bytes,
-uptodate)) {
-

[PATCH 08/42] btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()

There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().

It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.

Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.

By this, end_compressed_bio_write() can happily pass page=NULL while
still get everything done properly.

Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/compression.c   |  4 +---
 fs/btrfs/ctree.h |  3 ++-
 fs/btrfs/extent_io.c | 16 ++--
 fs/btrfs/inode.c |  9 +
 include/trace/events/btrfs.h | 19 ---
 5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2600703fab83..4fbe3e12be71 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -343,11 +343,9 @@ static void end_compressed_bio_write(struct bio *bio)
 * call back into the FS and do all the end_io operations
 */
inode = cb->inode;
-   cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
-   btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0],
+   btrfs_writepage_endio_finish_ordered(BTRFS_I(inode), NULL,
cb->start, cb->start + cb->len - 1,
bio->bi_status == BLK_STS_OK);
-   cb->compressed_pages[0]->mapping = NULL;
 
end_compressed_writeback(inode, cb);
/* note, our inode could be gone now */
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2c858d5349c8..505bc6674bcc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3175,7 +3175,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, 
struct page *locked_page
u64 start, u64 end, int *page_started, unsigned long 
*nr_written,
struct writeback_control *wbc);
 int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end);
-void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start,
+void btrfs_writepage_endio_finish_ordered(struct btrfs_inode *inode,
+ struct page *page, u64 start,
  u64 end, int uptodate);
 extern const struct dentry_operations btrfs_dentry_operations;
 extern const struct iomap_ops btrfs_dio_iomap_ops;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7d1fca9b87f0..6d712418b67b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2711,10 +2711,13 @@ blk_status_t btrfs_submit_read_repair(struct inode 
*inode,
 
 void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
 {
+   struct btrfs_inode *inode;
int uptodate = (err == 0);
int ret = 0;
 
-   btrfs_writepage_endio_finish_ordered(page, start, end, uptodate);
+   ASSERT(page && page->mapping);
+   inode = BTRFS_I(page->mapping->host);
+   btrfs_writepage_endio_finish_ordered(inode, page, start, end, uptodate);
 
if (!uptodate) {
ClearPageUptodate(page);
@@ -3739,7 +3742,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
u32 iosize;
 
if (cur >= i_size) {
-   btrfs_writepage_endio_finish_ordered(page, cur, end, 1);
+   btrfs_writepage_endio_finish_ordered(inode, page, cur,
+end, 1);
break;
}
em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
@@ -3777,8 +3781,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct btrfs_inode *inode,
if (compressed)
nr++;
else
-   btrfs_writepage_endio_finish_ordered(page, cur,
-   cur + iosize - 1, 1);
+   btrfs_writepage_endio_finish_ordered(inode,
+   page, cur, cur + iosize - 1, 1);
cur += iosize;
continue;
}
@@ -4842,8 +4846,8 @@ int extent_write_locked_range(struct inode *inode, u64 
start, u64 end,
if (clear_page_dirty_for_io(page))
ret = __extent_writepage(page, &wbc_writepages, &epd);
else {
-   btrfs_writepage_endio_finish_ordered(page, start,
-   start + PAGE_SIZE - 1, 1);
+

[PATCH 06/42] btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page

Function btrfs_bio_fits_in_stripe() now requires a bio with at least one
page added.
Or btrfs_get_chunk_map() will fail with -ENOENT.

But in fact this requirement is not needed at all, as we can just pass
sectorsize for btrfs_get_chunk_map().

This tiny behavior change is important for later subpage refactor on
submit_extent_page().

As for 64K page size, we can have a page range with pgoff=0 and
size=64K.
If the logical bytenr is just 16K before the stripe boundary, we have to
split the page range into two bios.

This means, we must check page range against stripe boundary, even adding
the range to an empty bio.

This tiny refactor is for the incoming change, but on its own, regular
sectorsize == PAGE_SIZE is not affected anyway.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4c1a06736371..74ee34fc820d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2198,25 +2198,22 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
size, struct bio *bio,
struct inode *inode = page->mapping->host;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
u64 logical = bio->bi_iter.bi_sector << 9;
+   u32 bio_len = bio->bi_iter.bi_size;
struct extent_map *em;
-   u64 length = 0;
-   u64 map_length;
int ret = 0;
struct btrfs_io_geometry geom;
 
if (bio_flags & EXTENT_BIO_COMPRESSED)
return 0;
 
-   length = bio->bi_iter.bi_size;
-   map_length = length;
-   em = btrfs_get_chunk_map(fs_info, logical, map_length);
+   em = btrfs_get_chunk_map(fs_info, logical, fs_info->sectorsize);
if (IS_ERR(em))
return PTR_ERR(em);
ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, &geom);
if (ret < 0)
goto out;
 
-   if (geom.len < length + size)
+   if (geom.len < bio_len + size)
ret = 1;
 out:
free_extent_map(em);
-- 
2.31.1

[PATCH 07/42] btrfs: use u32 for length related members of btrfs_ordered_extent

Unlike btrfs_file_extent_item, btrfs_ordered_extent has its length
limit (BTRFS_MAX_EXTENT_SIZE), which is far smaller than U32_MAX.

Using u64 for those length related members are just a waste of memory.

This patch will make the following members u32:
- num_bytes
- disk_num_bytes
- bytes_left
- truncated_len

This will save 16 bytes for btrfs_ordered_extent structure.

For btrfs_add_ordered_extent*() call sites, they are mostly deeply
inside other functions passing u64.
Thus this patch will keep those u64, but do internal ASSERT() to ensure
the correct length values are passed in.

For btrfs_dec_test_.*_ordered_extent() call sites, length related
parameters are converted to u32, with extra ASSERT() added to ensure we
get correct values passed in.

There is special convert needed in btrfs_remove_ordered_extent(), which
needs s64, using "-entry->num_bytes" from u32 directly will cause
underflow.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c| 11 ---
 fs/btrfs/ordered-data.c | 21 ++---
 fs/btrfs/ordered-data.h | 25 ++---
 3 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 74ee34fc820d..554effbf307e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3066,6 +3066,7 @@ void btrfs_writepage_endio_finish_ordered(struct page 
*page, u64 start,
struct btrfs_ordered_extent *ordered_extent = NULL;
struct btrfs_workqueue *wq;
 
+   ASSERT(end + 1 - start < U32_MAX);
trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
ClearPagePrivate2(page);
@@ -7969,6 +7970,7 @@ static void __endio_write_update_ordered(struct 
btrfs_inode *inode,
else
wq = fs_info->endio_write_workers;
 
+   ASSERT(bytes < U32_MAX);
while (ordered_offset < offset + bytes) {
last_offset = ordered_offset;
if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
@@ -8415,10 +8417,13 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
if (TestClearPagePrivate2(page)) {
spin_lock_irq(&inode->ordered_tree.lock);
set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-   ordered->truncated_len = min(ordered->truncated_len,
-start - 
ordered->file_offset);
+   ASSERT(start - ordered->file_offset < U32_MAX);
+   ordered->truncated_len = min_t(u32,
+   ordered->truncated_len,
+   start - ordered->file_offset);
spin_unlock_irq(&inode->ordered_tree.lock);
 
+   ASSERT(end - start + 1 < U32_MAX);
if (btrfs_dec_test_ordered_pending(inode, &ordered,
   start,
   end - start + 1, 1)) 
{
@@ -8937,7 +8942,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
break;
else {
btrfs_err(root->fs_info,
- "found ordered extent %llu %llu on inode 
cleanup",
+ "found ordered extent %llu %u on inode 
cleanup",
  ordered->file_offset, ordered->num_bytes);
btrfs_remove_ordered_extent(inode, ordered);
btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 07b0b4218791..8e6d9d906bdd 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -160,6 +160,12 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode 
*inode, u64 file_offset
struct btrfs_ordered_extent *entry;
int ret;
 
+   /*
+* Basic size check, all length related members should be smaller
+* than U32_MAX.
+*/
+   ASSERT(num_bytes < U32_MAX && disk_num_bytes < U32_MAX);
+
if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
/* For nocow write, we can release the qgroup rsv right now */
ret = btrfs_qgroup_free_data(inode, NULL, file_offset, 
num_bytes);
@@ -186,7 +192,7 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode 
*inode, u64 file_offset
entry->bytes_left = num_bytes;
entry->inode = igrab(&inode->vfs_inode);
entry->compress_type = compress_type;
-   entry->truncated_len = (u64)-1;
+   entry->truncated_len = (u32)-1;
entry->qgroup_rsv = ret;
entry->physical = (u64)-1;
entry->disk = NULL;
@@ -320,7 +326,7 @@ void btrfs_add_ordered_su

[PATCH 05/42] btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()

The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
just remove it.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c   | 5 ++---
 fs/btrfs/volumes.c | 5 +++--
 fs/btrfs/volumes.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1a349759efae..4c1a06736371 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2212,8 +2212,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
size, struct bio *bio,
em = btrfs_get_chunk_map(fs_info, logical, map_length);
if (IS_ERR(em))
return PTR_ERR(em);
-   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
-   map_length, &geom);
+   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical, &geom);
if (ret < 0)
goto out;
 
@@ -8169,7 +8168,7 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, 
struct iomap *iomap,
goto out_err_em;
}
ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
-   logical, submit_len, &geom);
+   logical, &geom);
if (ret) {
status = errno_to_blk_status(ret);
goto out_err_em;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6d9b2369f17a..c33830efe460 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6117,10 +6117,11 @@ static bool need_full_stripe(enum btrfs_map_op op)
  * usually shouldn't happen unless @logical is corrupted, 0 otherwise.
  */
 int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map *em,
- enum btrfs_map_op op, u64 logical, u64 len,
+ enum btrfs_map_op op, u64 logical,
  struct btrfs_io_geometry *io_geom)
 {
struct map_lookup *map;
+   u64 len;
u64 offset;
u64 stripe_offset;
u64 stripe_nr;
@@ -6226,7 +6227,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
em = btrfs_get_chunk_map(fs_info, logical, *length);
ASSERT(!IS_ERR(em));
 
-   ret = btrfs_get_io_geometry(fs_info, em, op, logical, *length, &geom);
+   ret = btrfs_get_io_geometry(fs_info, em, op, logical, &geom);
if (ret < 0)
return ret;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d4c3e0dd32b8..0abe00402f21 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -443,7 +443,7 @@ int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum 
btrfs_map_op op,
 u64 logical, u64 *length,
 struct btrfs_bio **bbio_ret);
 int btrfs_get_io_geometry(struct btrfs_fs_info *fs_info, struct extent_map 
*map,
- enum btrfs_map_op op, u64 logical, u64 len,
+ enum btrfs_map_op op, u64 logical,
  struct btrfs_io_geometry *io_geom);
 int btrfs_read_sys_array(struct btrfs_fs_info *fs_info);
 int btrfs_read_chunk_tree(struct btrfs_fs_info *fs_info);
-- 
2.31.1

[PATCH 04/42] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page

The new function, submit_eb_subpage(), will submit all the dirty extent
buffers in the page.

The major difference between submit_eb_page() and submit_eb_subpage()
is:
- How to grab extent buffer
  Now we use find_extent_buffer_nospinlock() other than using
  page::private.

All other different handling is already done in functions like
lock_extent_buffer_for_io() and write_one_eb().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 95 
 1 file changed, 95 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c068c2fcba09..7d1fca9b87f0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4323,6 +4323,98 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
return ret;
 }
 
+/*
+ * Submit one subpage btree page.
+ *
+ * The main difference between submit_eb_page() is:
+ * - Page locking
+ *   For subpage, we don't rely on page locking at all.
+ *
+ * - Flush write bio
+ *   We only flush bio if we may be unable to fit current extent buffers into
+ *   current bio.
+ *
+ * Return >=0 for the number of submitted extent buffers.
+ * Return <0 for fatal error.
+ */
+static int submit_eb_subpage(struct page *page,
+struct writeback_control *wbc,
+struct extent_page_data *epd)
+{
+   struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+   int submitted = 0;
+   u64 page_start = page_offset(page);
+   int bit_start = 0;
+   int nbits = BTRFS_SUBPAGE_BITMAP_SIZE;
+   int sectors_per_node = fs_info->nodesize >> fs_info->sectorsize_bits;
+   int ret;
+
+   /* Lock and write each dirty extent buffers in the range */
+   while (bit_start < nbits) {
+   struct btrfs_subpage *subpage = (struct btrfs_subpage 
*)page->private;
+   struct extent_buffer *eb;
+   unsigned long flags;
+   u64 start;
+
+   /*
+* Take private lock to ensure the subpage won't be detached
+* halfway.
+*/
+   spin_lock(&page->mapping->private_lock);
+   if (!PagePrivate(page)) {
+   spin_unlock(&page->mapping->private_lock);
+   break;
+   }
+   spin_lock_irqsave(&subpage->lock, flags);
+   if (!((1 << bit_start) & subpage->dirty_bitmap)) {
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   spin_unlock(&page->mapping->private_lock);
+   bit_start++;
+   continue;
+   }
+
+   start = page_start + bit_start * fs_info->sectorsize;
+   bit_start += sectors_per_node;
+
+   /*
+* Here we just want to grab the eb without touching extra
+* spin locks. So here we call find_extent_buffer_nospinlock().
+*/
+   eb = find_extent_buffer_nospinlock(fs_info, start);
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   spin_unlock(&page->mapping->private_lock);
+
+   /*
+* The eb has already reached 0 refs thus find_extent_buffer()
+* doesn't return it. We don't need to write back such eb
+* anyway.
+*/
+   if (!eb)
+   continue;
+
+   ret = lock_extent_buffer_for_io(eb, epd);
+   if (ret == 0) {
+   free_extent_buffer(eb);
+   continue;
+   }
+   if (ret < 0) {
+   free_extent_buffer(eb);
+   goto cleanup;
+   }
+   ret = write_one_eb(eb, wbc, epd);
+   free_extent_buffer(eb);
+   if (ret < 0)
+   goto cleanup;
+   submitted++;
+   }
+   return submitted;
+
+cleanup:
+   /* We hit error, end bio for the submitted extent buffers */
+   end_write_bio(epd, ret);
+   return ret;
+}
+
 /*
  * Submit all page(s) of one extent buffer.
  *
@@ -4355,6 +4447,9 @@ static int submit_eb_page(struct page *page, struct 
writeback_control *wbc,
if (!PagePrivate(page))
return 0;
 
+   if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
+   return submit_eb_subpage(page, wbc, epd);
+
spin_lock(&mapping->private_lock);
if (!PagePrivate(page)) {
spin_unlock(&mapping->private_lock);
-- 
2.31.1

[PATCH 03/42] btrfs: make lock_extent_buffer_for_io() to be subpage compatible

For subpage metadata, we don't use page locking at all.
So just skip the page locking part for subpage.

All the remaining routine can be reused.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f32163a465ec..c068c2fcba09 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3967,7 +3967,13 @@ static noinline_for_stack int 
lock_extent_buffer_for_io(struct extent_buffer *eb
 
btrfs_tree_unlock(eb);
 
-   if (!ret)
+   /*
+* Either we don't need to submit any tree block, or we're submitting
+* subpage.
+* Subpage metadata doesn't use page locking at all, so we can skip
+* the page locking.
+*/
+   if (!ret || fs_info->sectorsize < PAGE_SIZE)
return ret;
 
num_pages = num_extent_pages(eb);
-- 
2.31.1

[PATCH 02/42] btrfs: introduce write_one_subpage_eb() function

The new function, write_one_subpage_eb(), as a subroutine for subpage
metadata write, will handle the extent buffer bio submission.

The major differences between the new write_one_subpage_eb() and
write_one_eb() is:
- No page locking
  When entering write_one_subpage_eb() the page is no longer locked.
  We only lock the page for its status update, and unlock immediately.
  Now we completely rely on extent io tree locking.

- Extra bitmap update along with page status update
  Now page dirty and writeback is controlled by
  btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
  They both follow the schema that any sector is dirty/writeback, then
  the full page get dirty/writeback.

- When to update the nr_written number
  Now we take a short cut, if we have cleared the last dirty bit of the
  page, we update nr_written.
  This is not completely perfect, but should emulate the old behavior
  good enough.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 55 
 1 file changed, 55 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 21a14b1cb065..f32163a465ec 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4196,6 +4196,58 @@ static void end_bio_extent_buffer_writepage(struct bio 
*bio)
bio_put(bio);
 }
 
+/*
+ * Unlike the work in write_one_eb(), we rely completely on extent locking.
+ * Page locking is only utizlied at minimal to keep the VM code happy.
+ *
+ * Caller should still call write_one_eb() other than this function directly.
+ * As write_one_eb() has extra prepration before submitting the extent buffer.
+ */
+static int write_one_subpage_eb(struct extent_buffer *eb,
+   struct writeback_control *wbc,
+   struct extent_page_data *epd)
+{
+   struct btrfs_fs_info *fs_info = eb->fs_info;
+   struct page *page = eb->pages[0];
+   unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
+   bool no_dirty_ebs = false;
+   int ret;
+
+   /* clear_page_dirty_for_io() in subpage helper need page locked. */
+   lock_page(page);
+   btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
+
+   /* If we're the last dirty bit to update nr_written */
+   no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
+ eb->start, eb->len);
+   if (no_dirty_ebs)
+   clear_page_dirty_for_io(page);
+
+   ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
+   eb->start, eb->len, eb->start - page_offset(page),
+   &epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
+   false);
+   if (ret) {
+   btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+ eb->len);
+   set_btree_ioerr(page, eb);
+   unlock_page(page);
+
+   if (atomic_dec_and_test(&eb->io_pages))
+   end_extent_buffer_writeback(eb);
+   return -EIO;
+   }
+   unlock_page(page);
+   /*
+* Submission finishes without problem, if no range of the page is
+* dirty anymore, we have submitted a page.
+* Update the nr_written in wbc.
+*/
+   if (no_dirty_ebs)
+   update_nr_written(wbc, 1);
+   return ret;
+}
+
 static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
struct writeback_control *wbc,
struct extent_page_data *epd)
@@ -4227,6 +4279,9 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
memzero_extent_buffer(eb, start, end - start);
}
 
+   if (eb->fs_info->sectorsize < PAGE_SIZE)
+   return write_one_subpage_eb(eb, wbc, epd);
+
for (i = 0; i < num_pages; i++) {
struct page *p = eb->pages[i];
 
-- 
2.31.1

[PATCH 01/42] btrfs: introduce end_bio_subpage_eb_writepage() function

The new function, end_bio_subpage_eb_writepage(), will handle the
metadata writeback endio.

The major differences involved are:
- How to grab extent buffer
  Now page::private is a pointer to btrfs_subpage, we can no longer grab
  extent buffer directly.
  Thus we need to use the bv_offset to locate the extent buffer manually
  and iterate through the whole range.

- Use btrfs_subpage_end_writeback() caller
  This helper will handle the subpage writeback for us.

Since this function is executed under endio context, when grabbing
extent buffers it can't grab eb->refs_lock as that lock is not designed
to be grabbed under hardirq context.

So here introduce a helper, find_extent_buffer_nospinlock(), for such
situation, and convert find_extent_buffer() to use that helper.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 135 +--
 1 file changed, 106 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a50adbd8808d..21a14b1cb065 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4080,13 +4080,97 @@ static void set_btree_ioerr(struct page *page, struct 
extent_buffer *eb)
}
 }
 
+/*
+ * This is the endio specific version which won't touch any unsafe spinlock
+ * in endio context.
+ */
+static struct extent_buffer *find_extent_buffer_nospinlock(
+   struct btrfs_fs_info *fs_info, u64 start)
+{
+   struct extent_buffer *eb;
+
+   rcu_read_lock();
+   eb = radix_tree_lookup(&fs_info->buffer_radix,
+  start >> fs_info->sectorsize_bits);
+   if (eb && atomic_inc_not_zero(&eb->refs)) {
+   rcu_read_unlock();
+   return eb;
+   }
+   rcu_read_unlock();
+   return NULL;
+}
+/*
+ * The endio function for subpage extent buffer write.
+ *
+ * Unlike end_bio_extent_buffer_writepage(), we only call end_page_writeback()
+ * after all extent buffers in the page has finished their writeback.
+ */
+static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
+struct bio *bio)
+{
+   struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
+
+   ASSERT(!bio_flagged(bio, BIO_CLONED));
+   bio_for_each_segment_all(bvec, bio, iter_all) {
+   struct page *page = bvec->bv_page;
+   u64 bvec_start = page_offset(page) + bvec->bv_offset;
+   u64 bvec_end = bvec_start + bvec->bv_len - 1;
+   u64 cur_bytenr = bvec_start;
+
+   ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+
+   /* Iterate through all extent buffers in the range */
+   while (cur_bytenr <= bvec_end) {
+   struct extent_buffer *eb;
+   int done;
+
+   /*
+* Here we can't use find_extent_buffer(), as it may
+* try to lock eb->refs_lock, which is not safe in endio
+* context.
+*/
+   eb = find_extent_buffer_nospinlock(fs_info, cur_bytenr);
+   ASSERT(eb);
+
+   cur_bytenr = eb->start + eb->len;
+
+   ASSERT(test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags));
+   done = atomic_dec_and_test(&eb->io_pages);
+   ASSERT(done);
+
+   if (bio->bi_status ||
+   test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+   ClearPageUptodate(page);
+   set_btree_ioerr(page, eb);
+   }
+
+   btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+ eb->len);
+   end_extent_buffer_writeback(eb);
+   /*
+* free_extent_buffer() will grab spinlock which is not
+* safe in endio context. Thus here we manually dec
+* the ref.
+*/
+   atomic_dec(&eb->refs);
+   }
+   }
+   bio_put(bio);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio)
 {
+   struct btrfs_fs_info *fs_info;
struct bio_vec *bvec;
struct extent_buffer *eb;
int done;
struct bvec_iter_all iter_all;
 
+   fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
+   if (fs_info->sectorsize < PAGE_SIZE)
+   return end_bio_subpage_eb_writepage(fs_info, bio);
+
ASSERT(!bio_flagged(bio, BIO_CLONED));
bio_for_each_segment_all(bvec, bio, iter_all) {
struct page *page = bvec->bv_page;
@@ -5465,36 +5549,29 @@ struct extent_bu

[PATCH 00/42] btrfs: add full read-write support for subpage

This huge patchset can be fetched from github:
https://github.com/adam900710/linux/tree/subpage

=== Current stage ===
The tests on x86 pass without new failure, and generic test group on
arm64 with 64K page size passes except known failure and defrag group.

Although full fstests run needs to disable the warning message in
mkfs.btrfs, or it will cause too many false alerts.
The patch for mkfs.btrfs to use new sysfs interface to avoid such
behavior is under way.

But considering how slow my ARM boards are, I haven't run that many
loops.
So extra test will always help.

=== Limitation ===
There are several limitations introduced just for subpage:
- No compressed write support
  Read is no problem, but compression write path has more things left to
  be modified.
  Thus for current patchset, no matter what inode attribute or mount
  option is, no new compressed extent can be created for subpage case.

- No sector-sized defrag support
  Currently defrag is still done in PAGE_SIZE, meaning if there is a
  hole in a 64K page, we still write a full 64K back to disk.
  This causes more disk space usage.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the isize, causing inline extent being
  created along with regular extents.

- No sector size base repair for read-time data repair
  Btrfs supports repair for corrupted data at read time.
  But for current subpage repair, the unit is bvec, which can var from
  4K to 64K.
  If one data extent is only 4K sized, then we can do the repair in 4K size.
  But if the extent size grows, then the repair size grows until it
  reaches 64K.
  This behavior can be later enhanced by introducing a bitmap for
  corrupted blocks.

=== Patchset structure ===

Patch 01~04:The missing patches for metadata write path
My bad, during previous submission I forgot them.
No code change, just re-send.
Patch 05~08:Cleanups and small refactors.
Patch 09~13:Code refactors around btrfs_invalidate() and endio
This is one critical part for subpage.
Although this part has no subpage related code yet, just
pure refactor.
Patch 14~15:Refactor around __precess_pages_contig() for incoming
subpage support.
--- Above are all refactors/cleanups ---
Patch 16~31:The main part of subpage support
Patch 32~39:Subpage code corner case fixes
--- Above is the main part of the subpage support ---
Patch 40:   Refactor submit_extent_page() for incoming subpage
support.
This refactor would also reduce the overhead for X86, as
it removed the per-page boundary check, making the check
only executed once for one bio.
Patch 41:   Make submit_extent_page() able to split large page to
two bios. A subpage specific requirement.
Patch 42:   Enable subpage data write path.


Qu Wenruo (42):
  btrfs: introduce end_bio_subpage_eb_writepage() function
  btrfs: introduce write_one_subpage_eb() function
  btrfs: make lock_extent_buffer_for_io() to be subpage compatible
  btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
  btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
  btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page
  btrfs: use u32 for length related members of btrfs_ordered_extent
  btrfs: pass btrfs_inode into btrfs_writepage_endio_finish_ordered()
  btrfs: refactor how we finish ordered extent io for endio functions
  btrfs: update the comments in btrfs_invalidatepage()
  btrfs: refactor btrfs_invalidatepage()
  btrfs: make Private2 lifespan more consistent
  btrfs: rename PagePrivate2 to PageOrdered inside btrfs
  btrfs: pass bytenr directly to __process_pages_contig()
  btrfs: refactor the page status update into process_one_page()
  btrfs: provide btrfs_page_clamp_*() helpers
  btrfs: only require sector size alignment for
end_bio_extent_writepage()
  btrfs: make btrfs_dirty_pages() to be subpage compatible
  btrfs: make __process_pages_contig() to handle subpage
dirty/error/writeback status
  btrfs: make end_bio_extent_writepage() to be subpage compatible
  btrfs: make process_one_page() to handle subpage locking
  btrfs: introduce helpers for subpage ordered status
  btrfs: make page Ordered bit to be subpage compatible
  btrfs: update locked page dirty/writeback/error bits in
__process_pages_contig
  btrfs: prevent extent_clear_unlock_delalloc() to unlock page not
locked by __process_pages_contig()
  btrfs: make btrfs_set_range_writeback() subpage compatible
  btrfs: make __extent_writepage_io() only submit dirty range for
subpage
  btrfs: add extra assert for submit_extent_page()
  btrfs: make btrfs_truncate_block()

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-12 Thread Qu Wenruo





On 2021/4/2 下午4:52, Qu Wenruo wrote:



On 2021/4/2 下午4:46, Ritesh Harjani wrote:

On 21/04/02 04:36PM, Qu Wenruo wrote:



On 2021/4/2 下午4:33, Ritesh Harjani wrote:

On 21/03/29 10:01AM, Qu Wenruo wrote:



On 2021/3/29 上午4:02, Ritesh Harjani wrote:

On 21/03/25 09:16PM, Qu Wenruo wrote:



On 2021/3/25 下午8:20, Neal Gompa wrote:

On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo  wrote:


This patchset can be fetched from the following github repo,
along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.

[FULL RW TEST]
Since the data write path is not included in this patchset, we
can't
really test the patchset itself, but anyone can grab the patch
from
github repo and do fstests/generic tests.

But at least the full RW patchset can pass -g generic/quick -x
defrag
for now.

There are some known issues:

- Defrag behavior change
  Since current defrag is doing per-page defrag, to support
subpage
  defrag, we need some change in the loop.
  E.g. if a page has both hole and regular extents in it,
then defrag
  will rewrite the full 64K page.

  Thus for now, defrag related failure is expected.
  But this should only cause behavior difference, no crash
nor hang is
  expected.

- No compression support yet
  There are at least 2 known bugs if forcing compression
for subpage
  * Some hard coded PAGE_SIZE screwing up space rsv
  * Subpage ASSERT() triggered
    This is because some compression code is unlocking
locked_page by
    calling extent_clear_unlock_delalloc() with locked_page
== NULL.
  So for now compression is also disabled.

- Inode nbytes mismatch
  Still debugging.
  The fastest way to trigger is fsx using the following
parameters:

    fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file
> /tmp/fsx

  Which would cause inode nbytes differs from expected
value and
  triggers btrfs check error.

[DIFFERENCE AGAINST REGULAR SECTORSIZE]
The metadata part in fact has more new code than data part, as
ithas
some different behaviors compared to the regular sector size
handling:

- No more page locking
  Now metadata read/write relies on extent io tree locking,
other than
  page locking.
  This is to allow behaviors like read lock one eb while
alsotry to
  read lock another eb in the same page.
  We can't rely on page lock as now we have multiple extent
buffers in
  the same page.

- Page status update
  Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
  Instead of just grabbing extent buffer from
page::private, we need to
  iterate all dirty extent buffers in the page and submit
them.

[CHANGELOG]
v2:
- Rebased to latest misc-next
  No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
  This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
  No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
  Patch 2~3, originally in RW part.

- Fix one uninitialized variable
  Patch 6.

v3:
- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
  This removes 2 cleanup patches.

- Add new overview comment for subpage metadata

Qu Wenruo (13):
  btrfs: add sysfs interface for supported sectorsize
  btrfs: use min() to replace open-code in
btrfs_invalidatepage()
  btrfs: remove unnecessary variable shadowing in
btrfs_invalidatepage()
  btrfs: refactor how we iterate ordered extent in
    btrfs_invalidatepage()
  btrfs: introduce helpers for subpage dirty status
  btrfs: introduce helpers for subpage writeback status
  btrfs: allow btree_set_page_dirty() to do more sanity
checkon subpage
    metadata
  btrfs: support subpage metadata csum calculation at write
time
  btrfs: make alloc_extent_buffer() check subpage dirty bitmap
  btrfs: make the page uptodate assert to be subpage
compatible
  btrfs: make set/clear_extent_buffer_dirty() to be subpage
compatible
  btrfs: make set_btree_ioerr() accept extent buffer and to
be subpage
    compatible
  btrfs: add subpage overview comments

 fs/btrfs/disk-io.c   | 143
++-
 fs/btrfs/extent_io.c | 127
--
 fs/btrfs/inode.c | 128
++
 fs/btrfs/subpage.c   | 127
++
 fs/btrfs/subpage.h   |  17 +
 fs/btrfs/sysfs.c |  15 +
 6 files changed, 441 insertions(+), 116 deletions(-)

--
2.30.1



Why wouldn't we just integrate full read-write support with the
caveats as described now? It seems to be relatively reasonable
to do
that, and this patch set is essentially unusable without the
rest of
it that does

[PATCH v2] btrfs: do more graceful error/warning for 32bit kernel

Due to the pagecache limit of 32bit systems, btrfs can't access metadata
at or beyond (ULONG_MAX + 1) << PAGE_SHIFT.
This is 16T for 4K page size while 256T for 64K page size.

And unlike other fses, btrfs uses internally mapped u64 address space for
all of its metadata, this is more tricky than other fses.

Users can have a fs which doesn't have metadata beyond the boundary at
mount time, but later balance can cause btrfs to create metadata beyond
the boundary.

And modification to MM layer is unrealistic just for such minor use
case.

To address such problem, this patch will introduce the following checks,
much like how XFS handles such problem:

- Mount time rejection
  This will reject any fs which has metadata chunk at or beyond the
  boundary.

- Mount time early warning
  If there is any metadata chunk beyond 5/8 of the boundary, we do an
  early warning and hope the end user will see it.

- Runtime extent buffer rejection
  If we're going to allocate an extent buffer at or beyond the boundary,
  reject such request with -EOVERFLOW.
  This is definitely going to cause problems like transaction abort, but
  we have no better ways.

- Runtime extent buffer early warning
  If an extent buffer beyond 5/8 of the max file size is allocated, do
  an early warning.

Above error/warning message will only be outputted once for each fs to
reduce dmesg flood.

Reported-by: Erik Jensen 
Signed-off-by: Qu Wenruo 
Reviewed-by: Josef Bacik 
---
Changelog:
v2:
- Rebased to latest misc-next and add reviewed-by tag
---
 fs/btrfs/ctree.h | 18 +++
 fs/btrfs/extent_io.c | 12 ++
 fs/btrfs/super.c | 26 ++
 fs/btrfs/volumes.c   | 53 ++--
 4 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f2fd73e58ee6..f679e02f65a9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -565,6 +565,12 @@ enum {
 
/* Indicate whether there are any tree modification log users */
BTRFS_FS_TREE_MOD_LOG_USERS,
+
+#if BITS_PER_LONG == 32
+   /* Indicate if we have error/warn message outputted for 32bit system */
+   BTRFS_FS_32BIT_ERROR,
+   BTRFS_FS_32BIT_WARN,
+#endif
 };
 
 /*
@@ -3392,6 +3398,18 @@ static inline void assertfail(const char *expr, const 
char* file, int line) { }
 #define ASSERT(expr)   (void)(expr)
 #endif
 
+#if BITS_PER_LONG == 32
+#define BTRFS_32BIT_MAX_FILE_SIZE (((u64)ULONG_MAX + 1) << PAGE_SHIFT)
+/*
+ * The warning threshold is 5/8 of the max file size.
+ *
+ * For 4K page size it should be 10T, for 64K it would 160T.
+ */
+#define BTRFS_32BIT_EARLY_WARN_THRESHOLD (BTRFS_32BIT_MAX_FILE_SIZE * 5 / 8)
+void btrfs_warn_32bit_limit(struct btrfs_fs_info *fs_info);
+void btrfs_err_32bit_limit(struct btrfs_fs_info *fs_info);
+#endif
+
 /*
  * Get the correct offset inside the page of extent buffer.
  *
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7ad2169e7487..a5f5c092c90f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5594,6 +5594,18 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
return ERR_PTR(-EINVAL);
}
 
+#if BITS_PER_LONG == 32
+   if (start >= MAX_LFS_FILESIZE) {
+   btrfs_err(fs_info,
+   "extent buffer %llu is beyond 32bit page cache limit",
+ start);
+   btrfs_err_32bit_limit(fs_info);
+   return ERR_PTR(-EOVERFLOW);
+   }
+   if (start >= BTRFS_32BIT_EARLY_WARN_THRESHOLD)
+   btrfs_warn_32bit_limit(fs_info);
+#endif
+
if (fs_info->sectorsize < PAGE_SIZE &&
offset_in_page(start) + len > PAGE_SIZE) {
btrfs_err(fs_info,
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f7a4ad86adee..1a36be6bced2 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -252,6 +252,32 @@ void __cold btrfs_printk(const struct btrfs_fs_info 
*fs_info, const char *fmt, .
 }
 #endif
 
+#if BITS_PER_LONG == 32
+void __cold btrfs_warn_32bit_limit(struct btrfs_fs_info *fs_info)
+{
+   if (!test_and_set_bit(BTRFS_FS_32BIT_WARN, &fs_info->flags)) {
+   btrfs_warn(fs_info, "btrfs is reaching 32bit kernel limit.");
+   btrfs_warn(fs_info,
+"due to 32bit page cache limit, btrfs can't access metadata at or beyond 
%lluT.",
+  BTRFS_32BIT_MAX_FILE_SIZE >> 40);
+   btrfs_warn(fs_info,
+  "please consider upgrade to 64bit kernel/hardware.");
+   }
+}
+
+void __cold btrfs_err_32bit_limit(struct btrfs_fs_info *fs_info)
+{
+   if (!test_and_set_bit(BTRFS_FS_32BIT_ERROR, &fs_info->flags)) {
+   btrfs_err(fs_info, "btrfs reached 32bit kernel limit.");
+   btrfs_err(fs_info,
+"due to 32bit page cache limit, btrfs can'

Re: [PATCH] btrfs: Correct try_lock_extent() usage in read_extent_buffer_subpage()





On 2021/4/8 下午8:40, Goldwyn Rodrigues wrote:

try_lock_extent() returns 1 on success or 0 for failure and not an error
code. If try_lock_extent() fails, read_extent_buffer_subpage() returns
zero indicating subpage extent read success.

Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the
extent.

Signed-off-by: Goldwyn Rodrigues 


Reviewed-by: Qu Wenruo 

Thankfully the only metadata reader who will pass wait == WAIT_NONE is
readahead, so no real damage.

But still a nice fix!

Thanks,
Qu

---
  fs/btrfs/extent_io.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7ad2169e7487..3536feedd6c5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5914,10 +5914,8 @@ static int read_extent_buffer_subpage(struct 
extent_buffer *eb, int wait,
io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;

if (wait == WAIT_NONE) {
-   ret = try_lock_extent(io_tree, eb->start,
- eb->start + eb->len - 1);
-   if (ret <= 0)
-   return ret;
+   if (!try_lock_extent(io_tree, eb->start, eb->start + eb->len - 
1))
+   return -EAGAIN;
} else {
ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1);
if (ret < 0)

Re: btrfs crash on armv7

On 2021/4/8 下午7:15, riteshh wrote:

Please excuse my silly queries here.

On 21/04/08 04:38PM, Qu Wenruo wrote:

On 2021/4/8 下午4:16, Joe Hermaszewski wrote:

It took a while but I managed to get hold of another one of these
arm32 boards. Very disappointingly this exact "bitflip" is still
present (log enclosed).

Yeah, we got to the conclusion it's not bitflip, but completely 32bit
limit on armv7.

For ARMv7, it's a 32bit system, where unsigned long is only 32bit.

This means, things like page->index is only 32bit long, and for 4K page
size, it also means all filesystems (not only btrfs) can only utilize at
most 16T bytes.

But there is pitfall for btrfs, btrfs uses its internal address space
for its meatadata, and the address space is U64.

Can you pls point me to the code you are referring here?
So IIUC, you mean since page->index can hold a value which can be upto 32bit in
size so the maximum FS address range which can be accessed is 16T.
This should be true in general for any FS no?

The code is in definition of "struct page",from "include/linux/mm_types.h"

Yes, for all fs.

But no other fs has another internal address space, unlike btrfs.

Btrfs uses its internal space to implement multi-device support.

And furthermore, for btrfs it can have metadata at bytenr way larger
than the total device size.

Is this because of multi-device support?

Yes.

This is possible because btrfs maps part of its address space to real
disks, thus it can have bytenr way larger than device size.

Please a code pointing to that will help me understand this better.
Thanks.

You need to understand btrfs chunk tree first.

Each btrfs chunk item is a mapping from btrfs logical address to each
real device.

The easiest way to understand it is not code, but "btrfs ins dump-tree
-t chunk " to experience it by yourself.

But this brings to a problem, 32bit Linux can only handle 16T, but in
your case, some of your metadata is already beyond 16T in btrfs address
space.

Sorry I am not much aware of the history. Was this disk mkfs on 64-bit system
and then connected to a 32bit board?

Possible.

But there are other cases to go beyond that limit, especially with balance.

So I'm not confident enough to say what's the exact event to make the fs
cross the line.

This also brings me to check with you about other filesystems.
See the capacity section from below wiki[1]. Depending upon the host OS
limitation on the max size of the filesystem may vary right?

[1] https://en.wikipedia.org/wiki/XFS

The last time Dave Chineer said, for xfs larger than 16T, 32bit kernel
will just refuse to mount.

Thanks,
Qu

-ritesh

Then a lot of things are going to be wrong.

I have submitted a patch to do extra check, at least allowing user to
know this is the limit of 32bit:
https://patchwork.kernel.org/project/linux-btrfs/patch/20210225011814.24009-1-...@suse.com/

Unfortunately, this will not help existing fs though.

Re: btrfs crash on armv7

On 2021/4/8 下午6:11, Joe Hermaszewski wrote:

Thanks for explaining so patiently, I have a couple more questions if
you have the time:

With the patch I assume that this FS will just refuse to mount on
arm32,

Yes.

If the fs has a chunk beyond 16T, it will be definitely be rejected.
For your case, since you already have such metadata, meaning there is
definitely at least one chunk at or beyond that boundary, thus
immediately rejection will be triggered.

and in general such a large FS can't be used reliably there.

Not "reliably", but completely "unusable".

As btrfs can't even read some metadata, it would be a miracle to mount
the fs while avoiding all metadata read beyond 16T.

(If I'm wrong about this, is it possible to use a 64 bit machine to
move the offending metadata back below 16TB? (I feel that this may be
a gross misunderstanding!))

Moving is possible (balance is here for the work).

But balance can only move data to larger bytenr, meaning it will just
make the case worse.

The requirement for balance to move data/metadata to larger bytenr is,
btrfs assumes chunk with larger bytenr are created newer.
Thus things like balance itself may use that feature to make sure if it
has balanced the full fs.

Thus in btrfs, the bytenr of new chunks are increamental, no way to
create a new chunk with lower bytenr.

I'm not sure I understood this part of your reply:

Btrfs balance will also go forward larger bytenr, thus it's unrelated to size 
at all.

I just realised that the last couple of messages didn't go to the
list, I'm happy for you to reply to the list if you feel that this
conversation would be beneficial there!

Sorry my bad...

Thanks,
Qu

On Thu, Apr 8, 2021 at 5:41 PM Qu Wenruo  wrote:

On 2021/4/8 下午5:36, Joe Hermaszewski wrote:

Thanks for the quick reply! In this case what's the best course of
action for me?

- Can this array be mounted and recovered without problems on a 64 bit
machine?

There should be no problem as long as btrfs check reports no error.

It's purely runtime assumption failed to get met, thus no real damage to
your fs.

- Does this 16TB limit refer to the total size of the devices, or the
apparent size after RAID? (I assume the former as this array is only
14TB after raid I think).

The limit is to any address space, not only some file/device size.

For btrfs even you only have 1 disk with 1T size, you can still get
metadata beyond 16T.

Btrfs balance will also go forward larger bytenr, thus it's unrelated to
size at all.

Thanks,
Qu

Thanks for the patch, I'll take a look at building my system with it in
future.

Best wishes,
Joe

On Thu, Apr 8, 2021, 4:38 PM Qu Wenruo mailto:quwenruo.bt...@gmx.com>> wrote:

 On 2021/4/8 下午4:16, Joe Hermaszewski wrote:
  > It took a while but I managed to get hold of another one of these
  > arm32 boards. Very disappointingly this exact "bitflip" is still
  > present (log enclosed).

 Yeah, we got to the conclusion it's not bitflip, but completely 32bit
 limit on armv7.

 For ARMv7, it's a 32bit system, where unsigned long is only 32bit.

 This means, things like page->index is only 32bit long, and for 4K page
 size, it also means all filesystems (not only btrfs) can only utilize at
 most 16T bytes.

 But there is pitfall for btrfs, btrfs uses its internal address space
 for its meatadata, and the address space is U64.

 And furthermore, for btrfs it can have metadata at bytenr way larger
 than the total device size.
 This is possible because btrfs maps part of its address space to real
 disks, thus it can have bytenr way larger than device size.

 But this brings to a problem, 32bit Linux can only handle 16T, but in
 your case, some of your metadata is already beyond 16T in btrfs address
 space.

 Then a lot of things are going to be wrong.

 I have submitted a patch to do extra check, at least allowing user to
 know this is the limit of 32bit:

https://patchwork.kernel.org/project/linux-btrfs/patch/20210225011814.24009-1-...@suse.com/

<https://patchwork.kernel.org/project/linux-btrfs/patch/20210225011814.24009-1-...@suse.com/>

 Unfortunately, this will not help existing fs though.

 Thanks,
 Qu
  >
  > To summarise, as it's been a while:
  >
  > - When running scrub, a "page_start" and "eb_start" mismatch is
  > detected (off by a single bit).
  > - `btrfs check` reports no significant errors on aarch64 or arm32.
  > - `btrfs scrub` completes successfully on aarch64!
  > - Now, I can confirm that `btrfs scrub` fails in the same manner on
  > two arm32 machines.
  >
  > Not really sure where to go from here. The only things I can
 think of are:
  >
  > - Bug in `btr

Re: btrfs crash on armv7

096
[  578.496793] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  578.505280] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.539117] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.547207] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.556371] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.564922] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.573356] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  580.581573] BTRFS critical (device sda1): unable to find logical
2412789760 length 4096
[  581.427776] verify_parent_transid: 28 callbacks suppressed
... many more of these
```

On Sun, Dec 20, 2020 at 8:30 AM Qu Wenruo  wrote:




On 2020/12/19 下午6:35, Joe Hermaszewski wrote:

Ok, so I managed to get hold of a 64bit machine on which to run btrfs
check. `btrfs check` returns exactly the same output as the armv7 box
(no serious problems) which I suppose is good. `btrfs scrub` also
finds no problems. Boring Logs below.

What I don't quite understand is how the scrub problem on armv7l is so
reliable when it's not persisted on the disks, is the same physical
memory location being used for this breaking value, or is it perhaps a
specific pattern of data on the bus which causes this?


If it's so reliably reproducible, then I guess that would be the case,
either a specific memory range has the problem, or a specific pattern on
the bus causing the problem.



If it's the former, how easy would it be to find this broken location
and blacklist it? If it's the latter then I guess there's no hope but
to try replacing the psu/machine.

The machine survived a couple of days of memtester (on about 95% of
the RAM) and `7z b` with no problems, *shrug*


The memtester should rule out the former case, the latter case may be
resolved by newer kernel if it's not a hardware problem but a software one.

Thanks,
Qu



Best wishes and thanks for the generous help so far.
Joe

btrfs scrub aarch64:
```
[j@nixos:~]$ sudo btrfs scrub status -d /mnt
UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
scrub device /dev/sda1 (id 1) history
Scrub started:Fri Dec 18 14:24:30 2020
Status:   finished
Duration: 7:36:31
Total to scrub:   2.40TiB
Rate: 91.95MiB/s
Error summary:no errors found
scrub device /dev/sdb1 (id 2) history
Scrub started:Fri Dec 18 14:24:30 2020
Status:   finished
Duration: 7:12:51
Total to scrub:   2.40TiB
Rate: 96.90MiB/s
Error summary:no errors found
scrub device /dev/sdd1 (id 3) history
Scrub started:Fri Dec 18 14:24:30 2020
Status:   finished
Duration: 19:47:01
Total to scrub:   7.86TiB
Rate: 115.70MiB/s
Error summary:no errors found
scrub device /dev/sdc1 (id 4) history
Scrub started:Fri Dec 18 14:24:30 2020
Status:   finished
Duration: 19:46:38
Total to scrub:   7.86TiB
Rate: 115.74MiB/s
Error summary:no errors found
```

btrfs check aarch64:
```
[nixos@nixos:/]$ sudo btrfs check --readonly /dev/sda1
Opening filesystem to check...
Checking filesystem on /dev/sda1
UUID: b8f4ad49-29c8-4d19-a886-cef9c487f124
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 294 inode 24665 errors 100, file extent discount
Found file extent holes:
  start: 3709534208, len: 163840
root 294 inode 406548 errors 100, file extent discount
Found file extent holes:
  start: 98729984, len: 720896
ERROR: errors found in fs roots
found 11280701063168 bytes used, error(s) found
total csum bytes: 10937464120
total tree bytes: 18538053632
total fs tree bytes: 5877579776
total extent tree bytes: 534052864
btree space waste bytes: 2316660292
file data blocks allocated: 17244587220992
   referenced 14211684794368%
```


On Sat, Nov 28, 2020 at 8:46 AM Qu Wenruo  wrote:




On 2020/11/27 下午11:15, Joe Hermaszewski wrote:

Hi Qu,

Thanks for the patch. I recompiled the kernel ran the scrub and your
patch worked as expected, here is the log:

```
[  337.365239] BTRFS info (device sda1): scrub: started on devid 2
[  337.366283] BTRFS info (device sda1): scrub: started on devid 1
[  337.402822] BTRFS info (device sda1): scrub: started on devid 3
[  337.411944] BTRFS info (device sda1): scrub: started on devid 4
[  471.997496] [ cut here ]
[  471.997614] WARNING: CPU: 0 PID: 218 at fs/btrfs/disk-io.c:531
btree_csum_one_bio+0x22c/0x278 [btrfs]
[  471.997616] Modules linked in: cfg80211 rfkill 8021q ip6table_nat
iptable_nat nf_nat ftdi_sio phy_generic usbserial xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter
ip6table_raw uio_pdrv_genirq iptable_raw uio xt_pkttype nf_log_ipv6
nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp ip

Re: [PATCH v2] btrfs: do more graceful error/warning for 32bit kernel


Gentle ping?

Any update? I didn't see it merged into misc-next.

Thanks,
Qu

On 2021/2/25 上午9:18, Qu Wenruo wrote:

Due to the pagecache limit of 32bit systems, btrfs can't access metadata
at or beyond (ULONG_MAX + 1) << PAGE_SHIFT.
This is 16T for 4K page size while 256T for 64K page size.

And unlike other fses, btrfs uses internally mapped u64 address space for
all of its metadata, this is more tricky than other fses.

Users can have a fs which doesn't have metadata beyond the boundary at
mount time, but later balance can cause btrfs to create metadata beyond
the boundary.

And modification to MM layer is unrealistic just for such minor use
case.

To address such problem, this patch will introduce the following checks,
much like how XFS handles such problem:

- Mount time rejection
   This will reject any fs which has metadata chunk at or beyond the
   boundary.

- Mount time early warning
   If there is any metadata chunk beyond 5/8 of the boundary, we do an
   early warning and hope the end user will see it.

- Runtime extent buffer rejection
   If we're going to allocate an extent buffer at or beyond the boundary,
   reject such request with -EOVERFLOW.
   This is definitely going to cause problems like transaction abort, but
   we have no better ways.

- Runtime extent buffer early warning
   If an extent buffer beyond 5/8 of the max file size is allocated, do
   an early warning.

Above error/warning message will only be outputted once for each fs to
reduce dmesg flood.

Reported-by: Erik Jensen 
Signed-off-by: Qu Wenruo 
---
Since we're here, there are some alternative methods to support 32bit
better:

- Multiple inodes/address spaces for metadata inodes
   This means we would have multiple metadata inodes.
   Inode 1 for 0~16TB, inodes 2 for 16~32TB, etc.

   The problem is we need to have extra wrapper to read/write metadata
   ranges.

- Remap metadata into 0~16TB range at runtime
   This doesn't really solve the problem, as for fs with metadata usage
   larger than 16T, then we're busted again.
   And the remap mechanism can be pretty complex.

- Use an btrfs internal page cache mechanism
   This can be the most complex way, but it would definitely solve the
   problem.

For now, I prefer method 1, but I still doubt about the test coverage
for 32bit systems, and not sure if it's really worthy.

Changelog:
v2:
- Calculate the boundary using PAGE_SHIFT
- Output the calculated boundary other than hardcoded value
---
  fs/btrfs/ctree.h | 18 +++
  fs/btrfs/extent_io.c | 12 ++
  fs/btrfs/super.c | 26 ++
  fs/btrfs/volumes.c   | 53 ++--
  4 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 40ec3393d2a1..1373cae2db4f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -572,6 +572,12 @@ enum {
  
  	/* Indicate that we can't trust the free space tree for caching yet */

BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
+
+#if BITS_PER_LONG == 32
+   /* Indicate if we have error/warn message outputted for 32bit system */
+   BTRFS_FS_32BIT_ERROR,
+   BTRFS_FS_32BIT_WARN,
+#endif
  };
  
  /*

@@ -3405,6 +3411,18 @@ static inline void assertfail(const char *expr, const 
char* file, int line) { }
  #define ASSERT(expr)  (void)(expr)
  #endif
  
+#if BITS_PER_LONG == 32

+#define BTRFS_32BIT_MAX_FILE_SIZE (((u64)ULONG_MAX + 1) << PAGE_SHIFT)
+/*
+ * The warning threshold is 5/8 of the max file size.
+ *
+ * For 4K page size it should be 10T, for 64K it would 160T.
+ */
+#define BTRFS_32BIT_EARLY_WARN_THRESHOLD (BTRFS_32BIT_MAX_FILE_SIZE * 5 / 8)
+void btrfs_warn_32bit_limit(struct btrfs_fs_info *fs_info);
+void btrfs_err_32bit_limit(struct btrfs_fs_info *fs_info);
+#endif
+
  /*
   * Get the correct offset inside the page of extent buffer.
   *
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4dfb3ead1175..6af6714d49c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5554,6 +5554,18 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
return ERR_PTR(-EINVAL);
}
  
+#if BITS_PER_LONG == 32

+   if (start >= MAX_LFS_FILESIZE) {
+   btrfs_err(fs_info,
+   "extent buffer %llu is beyond 32bit page cache limit",
+ start);
+   btrfs_err_32bit_limit(fs_info);
+   return ERR_PTR(-EOVERFLOW);
+   }
+   if (start >= BTRFS_32BIT_EARLY_WARN_THRESHOLD)
+   btrfs_warn_32bit_limit(fs_info);
+#endif
+
if (fs_info->sectorsize < PAGE_SIZE &&
offset_in_page(start) + len > PAGE_SIZE) {
btrfs_err(fs_info,
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f8435641b912..d3f0e5294f50 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -252,6 +252,32 @@ void __cold

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-06 Thread Qu Wenruo





On 2021/4/6 上午10:31, Anand Jain wrote:

On 05/04/2021 14:14, Qu Wenruo wrote:



On 2021/4/3 下午7:08, David Sterba wrote:

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:

This patchset can be fetched from the following github repo, along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.



Qu Wenruo (13):
   btrfs: add sysfs interface for supported sectorsize
   btrfs: use min() to replace open-code in btrfs_invalidatepage()
   btrfs: remove unnecessary variable shadowing in
btrfs_invalidatepage()
   btrfs: refactor how we iterate ordered extent in
 btrfs_invalidatepage()
   btrfs: introduce helpers for subpage dirty status
   btrfs: introduce helpers for subpage writeback status
   btrfs: allow btree_set_page_dirty() to do more sanity check on
subpage
 metadata
   btrfs: support subpage metadata csum calculation at write time
   btrfs: make alloc_extent_buffer() check subpage dirty bitmap
   btrfs: make the page uptodate assert to be subpage compatible
   btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
   btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
 compatible
   btrfs: add subpage overview comments


Moved from topic branch to misc-next.



Note sure if it's too late, but I inserted the last comment patch into
the wrong location.

In fact, there are 4 more patches to make




subpage metadata RW really work:


I took some time to go through these patches, which are lined up for
integration.

With this set of patches that are being integrated, we don't yet
support RW mount of filesystem if PAGESIZE > sectorsize as a whole.
Subpage metadata RW support, how is it to be used in the production?


I'd say, without the ability to write subpage metadata, how would
subpage even be utilized in production environment?


OR How is this supposed to be tested?


There are two ways:
- Craft some scripts to only do metadata operations without any data
  writes

- Wait for my data write support then run regular full test suites

I used to go method 1, but since in my local branch it's already full
subpage RW support, I'm doing method 2.

Although it exposes quite some bugs in data write path, it has been
quite a long time after last metadata related bug.



OR should you just cleanup the title as preparatory patches to support
subpage RW? It is confusing.


Well, considering this is the last patchset before full subpage RW, such
"preparatory" mention would be saved for next big function add.
(Thankfully, there is no such plan yet)

Thanks,
Qu



Thanks, Anand



btrfs: make lock_extent_buffer_for_io() to be subpage compatible
btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
btrfs: introduce end_bio_subpage_eb_writepage() function
btrfs: introduce write_one_subpage_eb() function

Those 4 patches should be before the final comment patch.

Should I just send the 4 patches in a separate series?

Sorry for the bad split, it looks like multi-series patches indeed has
such problem...

Thanks,
Qu

[PATCH 4/4] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page

The new function, submit_eb_subpage(), will submit all the dirty extent
buffers in the page.

The major difference between submit_eb_page() and submit_eb_subpage()
is:
- How to grab extent buffer
  Now we use find_extent_buffer_nospinlock() other than using
  page::private.

All other different handling is already done in functions like
lock_extent_buffer_for_io() and write_one_eb().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 95 
 1 file changed, 95 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index db40bc701a03..12321c06e212 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4323,6 +4323,98 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
return ret;
 }
 
+/*
+ * Submit one subpage btree page.
+ *
+ * The main difference between submit_eb_page() is:
+ * - Page locking
+ *   For subpage, we don't rely on page locking at all.
+ *
+ * - Flush write bio
+ *   We only flush bio if we may be unable to fit current extent buffers into
+ *   current bio.
+ *
+ * Return >=0 for the number of submitted extent buffers.
+ * Return <0 for fatal error.
+ */
+static int submit_eb_subpage(struct page *page,
+struct writeback_control *wbc,
+struct extent_page_data *epd)
+{
+   struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+   int submitted = 0;
+   u64 page_start = page_offset(page);
+   int bit_start = 0;
+   int nbits = BTRFS_SUBPAGE_BITMAP_SIZE;
+   int sectors_per_node = fs_info->nodesize >> fs_info->sectorsize_bits;
+   int ret;
+
+   /* Lock and write each dirty extent buffers in the range */
+   while (bit_start < nbits) {
+   struct btrfs_subpage *subpage = (struct btrfs_subpage 
*)page->private;
+   struct extent_buffer *eb;
+   unsigned long flags;
+   u64 start;
+
+   /*
+* Take private lock to ensure the subpage won't be detached
+* halfway.
+*/
+   spin_lock(&page->mapping->private_lock);
+   if (!PagePrivate(page)) {
+   spin_unlock(&page->mapping->private_lock);
+   break;
+   }
+   spin_lock_irqsave(&subpage->lock, flags);
+   if (!((1 << bit_start) & subpage->dirty_bitmap)) {
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   spin_unlock(&page->mapping->private_lock);
+   bit_start++;
+   continue;
+   }
+
+   start = page_start + bit_start * fs_info->sectorsize;
+   bit_start += sectors_per_node;
+
+   /*
+* Here we just want to grab the eb without touching extra
+* spin locks. So here we call find_extent_buffer_nospinlock().
+*/
+   eb = find_extent_buffer_nospinlock(fs_info, start);
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   spin_unlock(&page->mapping->private_lock);
+
+   /*
+* The eb has already reached 0 refs thus find_extent_buffer()
+* doesn't return it. We don't need to write back such eb
+* anyway.
+*/
+   if (!eb)
+   continue;
+
+   ret = lock_extent_buffer_for_io(eb, epd);
+   if (ret == 0) {
+   free_extent_buffer(eb);
+   continue;
+   }
+   if (ret < 0) {
+   free_extent_buffer(eb);
+   goto cleanup;
+   }
+   ret = write_one_eb(eb, wbc, epd);
+   free_extent_buffer(eb);
+   if (ret < 0)
+   goto cleanup;
+   submitted++;
+   }
+   return submitted;
+
+cleanup:
+   /* We hit error, end bio for the submitted extent buffers */
+   end_write_bio(epd, ret);
+   return ret;
+}
+
 /*
  * Submit all page(s) of one extent buffer.
  *
@@ -4355,6 +4447,9 @@ static int submit_eb_page(struct page *page, struct 
writeback_control *wbc,
if (!PagePrivate(page))
return 0;
 
+   if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
+   return submit_eb_subpage(page, wbc, epd);
+
spin_lock(&mapping->private_lock);
if (!PagePrivate(page)) {
spin_unlock(&mapping->private_lock);
-- 
2.31.1

[PATCH 3/4] btrfs: make lock_extent_buffer_for_io() to be subpage compatible

For subpage metadata, we don't use page locking at all.
So just skip the page locking part for subpage.

All the remaining routine can be reused.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9567e7b2b6cf..db40bc701a03 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3967,7 +3967,13 @@ static noinline_for_stack int 
lock_extent_buffer_for_io(struct extent_buffer *eb
 
btrfs_tree_unlock(eb);
 
-   if (!ret)
+   /*
+* Either we don't need to submit any tree block, or we're submitting
+* subpage.
+* Subpage metadata doesn't use page locking at all, so we can skip
+* the page locking.
+*/
+   if (!ret || fs_info->sectorsize < PAGE_SIZE)
return ret;
 
num_pages = num_extent_pages(eb);
-- 
2.31.1

[PATCH 2/4] btrfs: introduce write_one_subpage_eb() function

The new function, write_one_subpage_eb(), as a subroutine for subpage
metadata write, will handle the extent buffer bio submission.

The major differences between the new write_one_subpage_eb() and
write_one_eb() is:
- No page locking
  When entering write_one_subpage_eb() the page is no longer locked.
  We only lock the page for its status update, and unlock immediately.
  Now we completely rely on extent io tree locking.

- Extra bitmap update along with page status update
  Now page dirty and writeback is controlled by
  btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
  They both follow the schema that any sector is dirty/writeback, then
  the full page get dirty/writeback.

- When to update the nr_written number
  Now we take a short cut, if we have cleared the last dirty bit of the
  page, we update nr_written.
  This is not completely perfect, but should emulate the old behavior
  good enough.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 55 
 1 file changed, 55 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ea8ee925738a..9567e7b2b6cf 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4196,6 +4196,58 @@ static void end_bio_extent_buffer_writepage(struct bio 
*bio)
bio_put(bio);
 }
 
+/*
+ * Unlike the work in write_one_eb(), we rely completely on extent locking.
+ * Page locking is only utizlied at minimal to keep the VM code happy.
+ *
+ * Caller should still call write_one_eb() other than this function directly.
+ * As write_one_eb() has extra prepration before submitting the extent buffer.
+ */
+static int write_one_subpage_eb(struct extent_buffer *eb,
+   struct writeback_control *wbc,
+   struct extent_page_data *epd)
+{
+   struct btrfs_fs_info *fs_info = eb->fs_info;
+   struct page *page = eb->pages[0];
+   unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
+   bool no_dirty_ebs = false;
+   int ret;
+
+   /* clear_page_dirty_for_io() in subpage helper need page locked. */
+   lock_page(page);
+   btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
+
+   /* If we're the last dirty bit to update nr_written */
+   no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
+ eb->start, eb->len);
+   if (no_dirty_ebs)
+   clear_page_dirty_for_io(page);
+
+   ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
+   eb->start, eb->len, eb->start - page_offset(page),
+   &epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
+   false);
+   if (ret) {
+   btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+ eb->len);
+   set_btree_ioerr(page, eb);
+   unlock_page(page);
+
+   if (atomic_dec_and_test(&eb->io_pages))
+   end_extent_buffer_writeback(eb);
+   return -EIO;
+   }
+   unlock_page(page);
+   /*
+* Submission finishes without problem, if no range of the page is
+* dirty anymore, we have submitted a page.
+* Update the nr_written in wbc.
+*/
+   if (no_dirty_ebs)
+   update_nr_written(wbc, 1);
+   return ret;
+}
+
 static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
struct writeback_control *wbc,
struct extent_page_data *epd)
@@ -4227,6 +4279,9 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
memzero_extent_buffer(eb, start, end - start);
}
 
+   if (eb->fs_info->sectorsize < PAGE_SIZE)
+   return write_one_subpage_eb(eb, wbc, epd);
+
for (i = 0; i < num_pages; i++) {
struct page *p = eb->pages[i];
 
-- 
2.31.1

[PATCH 1/4] btrfs: introduce end_bio_subpage_eb_writepage() function

The new function, end_bio_subpage_eb_writepage(), will handle the
metadata writeback endio.

The major differences involved are:
- How to grab extent buffer
  Now page::private is a pointer to btrfs_subpage, we can no longer grab
  extent buffer directly.
  Thus we need to use the bv_offset to locate the extent buffer manually
  and iterate through the whole range.

- Use btrfs_subpage_end_writeback() caller
  This helper will handle the subpage writeback for us.

Since this function is executed under endio context, when grabbing
extent buffers it can't grab eb->refs_lock as that lock is not designed
to be grabbed under hardirq context.

So here introduce a helper, find_extent_buffer_nospinlock(), for such
situation, and convert find_extent_buffer() to use that helper.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 135 +--
 1 file changed, 106 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9bebc6786b15..ea8ee925738a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4080,13 +4080,97 @@ static void set_btree_ioerr(struct page *page, struct 
extent_buffer *eb)
}
 }
 
+/*
+ * This is the endio specific version which won't touch any unsafe spinlock
+ * in endio context.
+ */
+static struct extent_buffer *find_extent_buffer_nospinlock(
+   struct btrfs_fs_info *fs_info, u64 start)
+{
+   struct extent_buffer *eb;
+
+   rcu_read_lock();
+   eb = radix_tree_lookup(&fs_info->buffer_radix,
+  start >> fs_info->sectorsize_bits);
+   if (eb && atomic_inc_not_zero(&eb->refs)) {
+   rcu_read_unlock();
+   return eb;
+   }
+   rcu_read_unlock();
+   return NULL;
+}
+/*
+ * The endio function for subpage extent buffer write.
+ *
+ * Unlike end_bio_extent_buffer_writepage(), we only call end_page_writeback()
+ * after all extent buffers in the page has finished their writeback.
+ */
+static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
+struct bio *bio)
+{
+   struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
+
+   ASSERT(!bio_flagged(bio, BIO_CLONED));
+   bio_for_each_segment_all(bvec, bio, iter_all) {
+   struct page *page = bvec->bv_page;
+   u64 bvec_start = page_offset(page) + bvec->bv_offset;
+   u64 bvec_end = bvec_start + bvec->bv_len - 1;
+   u64 cur_bytenr = bvec_start;
+
+   ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+
+   /* Iterate through all extent buffers in the range */
+   while (cur_bytenr <= bvec_end) {
+   struct extent_buffer *eb;
+   int done;
+
+   /*
+* Here we can't use find_extent_buffer(), as it may
+* try to lock eb->refs_lock, which is not safe in endio
+* context.
+*/
+   eb = find_extent_buffer_nospinlock(fs_info, cur_bytenr);
+   ASSERT(eb);
+
+   cur_bytenr = eb->start + eb->len;
+
+   ASSERT(test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags));
+   done = atomic_dec_and_test(&eb->io_pages);
+   ASSERT(done);
+
+   if (bio->bi_status ||
+   test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+   ClearPageUptodate(page);
+   set_btree_ioerr(page, eb);
+   }
+
+   btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+ eb->len);
+   end_extent_buffer_writeback(eb);
+   /*
+* free_extent_buffer() will grab spinlock which is not
+* safe in endio context. Thus here we manually dec
+* the ref.
+*/
+   atomic_dec(&eb->refs);
+   }
+   }
+   bio_put(bio);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio)
 {
+   struct btrfs_fs_info *fs_info;
struct bio_vec *bvec;
struct extent_buffer *eb;
int done;
struct bvec_iter_all iter_all;
 
+   fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
+   if (fs_info->sectorsize < PAGE_SIZE)
+   return end_bio_subpage_eb_writepage(fs_info, bio);
+
ASSERT(!bio_flagged(bio, BIO_CLONED));
bio_for_each_segment_all(bvec, bio, iter_all) {
struct page *page = bvec->bv_page;
@@ -5467,36 +5551,29 @@ struct extent_bu

[PATCH 0/4] btrfs: the missing 4 patches to implement metadata write path

When adding the comments for subpage metadata code, I inserted the
comment patch into the wrong position, and then use that patch as a
separator between data and metadata write path.

Thus the submitted metadata write path patchset lacks the real functions
to submit subpage metadata write bio.

Qu Wenruo (4):
  btrfs: introduce end_bio_subpage_eb_writepage() function
  btrfs: introduce write_one_subpage_eb() function
  btrfs: make lock_extent_buffer_for_io() to be subpage compatible
  btrfs: introduce submit_eb_subpage() to submit a subpage metadata page

 fs/btrfs/extent_io.c | 293 ++-
 1 file changed, 263 insertions(+), 30 deletions(-)

-- 
2.31.1

Re: Device missing with RAID1 on boot - observations





On 2021/4/5 下午11:18, Steven Davies wrote:

Kernel: 5.11.8 vanilla, btrfs-progs 5.11.1

I booted a box with a root btrfs raid1 across two devices,
/dev/nvme0n1p2 (devid 2) and /dev/sda2 (devid 3). For whatever reason
during the initrd stage, btrfs device scan was unable to see the NVMe
device and mounted the rootfs degraded after multiple retries as I had
designed in the init script.


It looks more like a problem in your initramfs environment.

The more possible cause is, your initramfs only has driver for SATA
disks, but no NVME modules.

You may try to include nvme module in your initramfs to see if that
solves the problem.

Thanks,
Qu



Once booted apparently the kernel was able to see nvme0n1p2 again (with
no intervention from me) and btrfs device usage / btrfs filesystem show
did not report any missing devices. btrfs scrub reported that devid 2
was unwriteable but the scrub completed successfully on devid 3 with no
errors. New block groups for data and metadata were being created as
single on devid 3.

I balanced with -dconvert=single -mconvert=dup which moved all block
groups to devid 3 and completed successfully; there was nothing
remaining on devid 2 so I removed the device from the filesystem and
re-added it as devid 4. Once I'd balanced the filesystem back to
-dconvert=raid1 -mconvert=raid1 everything was back to normal.

My main observation was that it was very hard to notice that there was
an issue. Yes, I'd purposefully mounted as degraded, but there was no
indication from the btrfs tools as to why new block groups were only
being created as single on one device: nothing was marked as missing or
unwriteable. Is this behavour expected? How can a device be unwriteable
but not marked as missing?

Was my course of action to correct the issue correct - is there a better
way to re-sync a raid1 device which has temporarily been removed?

(Afterwards I realised what caused the issue - missing libraries in the
initrd - and I can reproduce it if necessary.)

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-04 Thread Qu Wenruo





On 2021/4/3 下午7:08, David Sterba wrote:

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:

This patchset can be fetched from the following github repo, along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.



Qu Wenruo (13):
   btrfs: add sysfs interface for supported sectorsize
   btrfs: use min() to replace open-code in btrfs_invalidatepage()
   btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
   btrfs: refactor how we iterate ordered extent in
 btrfs_invalidatepage()
   btrfs: introduce helpers for subpage dirty status
   btrfs: introduce helpers for subpage writeback status
   btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
 metadata
   btrfs: support subpage metadata csum calculation at write time
   btrfs: make alloc_extent_buffer() check subpage dirty bitmap
   btrfs: make the page uptodate assert to be subpage compatible
   btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
   btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
 compatible
   btrfs: add subpage overview comments


Moved from topic branch to misc-next.



Note sure if it's too late, but I inserted the last comment patch into
the wrong location.

In fact, there are 4 more patches to make subpage metadata RW really work:
 btrfs: make lock_extent_buffer_for_io() to be subpage compatible
 btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
 btrfs: introduce end_bio_subpage_eb_writepage() function
 btrfs: introduce write_one_subpage_eb() function

Those 4 patches should be before the final comment patch.

Should I just send the 4 patches in a separate series?

Sorry for the bad split, it looks like multi-series patches indeed has
such problem...

Thanks,
Qu

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-02 Thread Qu Wenruo





On 2021/4/2 下午4:46, Ritesh Harjani wrote:

On 21/04/02 04:36PM, Qu Wenruo wrote:



On 2021/4/2 下午4:33, Ritesh Harjani wrote:

On 21/03/29 10:01AM, Qu Wenruo wrote:



On 2021/3/29 上午4:02, Ritesh Harjani wrote:

On 21/03/25 09:16PM, Qu Wenruo wrote:



On 2021/3/25 下午8:20, Neal Gompa wrote:

On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo  wrote:


This patchset can be fetched from the following github repo, along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.

[FULL RW TEST]
Since the data write path is not included in this patchset, we can't
really test the patchset itself, but anyone can grab the patch from
github repo and do fstests/generic tests.

But at least the full RW patchset can pass -g generic/quick -x defrag
for now.

There are some known issues:

- Defrag behavior change
  Since current defrag is doing per-page defrag, to support subpage
  defrag, we need some change in the loop.
  E.g. if a page has both hole and regular extents in it, then defrag
  will rewrite the full 64K page.

  Thus for now, defrag related failure is expected.
  But this should only cause behavior difference, no crash nor hang is
  expected.

- No compression support yet
  There are at least 2 known bugs if forcing compression for subpage
  * Some hard coded PAGE_SIZE screwing up space rsv
  * Subpage ASSERT() triggered
This is because some compression code is unlocking locked_page by
calling extent_clear_unlock_delalloc() with locked_page == NULL.
  So for now compression is also disabled.

- Inode nbytes mismatch
  Still debugging.
  The fastest way to trigger is fsx using the following parameters:

fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx

  Which would cause inode nbytes differs from expected value and
  triggers btrfs check error.

[DIFFERENCE AGAINST REGULAR SECTORSIZE]
The metadata part in fact has more new code than data part, as it has
some different behaviors compared to the regular sector size handling:

- No more page locking
  Now metadata read/write relies on extent io tree locking, other than
  page locking.
  This is to allow behaviors like read lock one eb while also try to
  read lock another eb in the same page.
  We can't rely on page lock as now we have multiple extent buffers in
  the same page.

- Page status update
  Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
  Instead of just grabbing extent buffer from page::private, we need to
  iterate all dirty extent buffers in the page and submit them.

[CHANGELOG]
v2:
- Rebased to latest misc-next
  No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
  This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
  No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
  Patch 2~3, originally in RW part.

- Fix one uninitialized variable
  Patch 6.

v3:
- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
  This removes 2 cleanup patches.

- Add new overview comment for subpage metadata

Qu Wenruo (13):
  btrfs: add sysfs interface for supported sectorsize
  btrfs: use min() to replace open-code in btrfs_invalidatepage()
  btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
  btrfs: refactor how we iterate ordered extent in
btrfs_invalidatepage()
  btrfs: introduce helpers for subpage dirty status
  btrfs: introduce helpers for subpage writeback status
  btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
metadata
  btrfs: support subpage metadata csum calculation at write time
  btrfs: make alloc_extent_buffer() check subpage dirty bitmap
  btrfs: make the page uptodate assert to be subpage compatible
  btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
  btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
compatible
  btrfs: add subpage overview comments

 fs/btrfs/disk-io.c   | 143 ++-
 fs/btrfs/extent_io.c | 127 --
 fs/btrfs/inode.c | 128 ++
 fs/btrfs/subpage.c   | 127 ++
 fs/btrfs/subpage.h   |  17 +
 fs/btrfs/sysfs.c |  15 +
 6 files changed, 441 insertions(+), 116 deletions(-)

--
2.30.1



Why wouldn't we just integrate full read-write support with the
caveats as described now? It seems to be relatively reasonable to do
that, and this patch set is essentially unusable without the rest of
it that does enable full read-write support.


The

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-02 Thread Qu Wenruo





On 2021/4/2 下午4:33, Ritesh Harjani wrote:

On 21/03/29 10:01AM, Qu Wenruo wrote:



On 2021/3/29 上午4:02, Ritesh Harjani wrote:

On 21/03/25 09:16PM, Qu Wenruo wrote:



On 2021/3/25 下午8:20, Neal Gompa wrote:

On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo  wrote:


This patchset can be fetched from the following github repo, along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.

[FULL RW TEST]
Since the data write path is not included in this patchset, we can't
really test the patchset itself, but anyone can grab the patch from
github repo and do fstests/generic tests.

But at least the full RW patchset can pass -g generic/quick -x defrag
for now.

There are some known issues:

- Defrag behavior change
 Since current defrag is doing per-page defrag, to support subpage
 defrag, we need some change in the loop.
 E.g. if a page has both hole and regular extents in it, then defrag
 will rewrite the full 64K page.

 Thus for now, defrag related failure is expected.
 But this should only cause behavior difference, no crash nor hang is
 expected.

- No compression support yet
 There are at least 2 known bugs if forcing compression for subpage
 * Some hard coded PAGE_SIZE screwing up space rsv
 * Subpage ASSERT() triggered
   This is because some compression code is unlocking locked_page by
   calling extent_clear_unlock_delalloc() with locked_page == NULL.
 So for now compression is also disabled.

- Inode nbytes mismatch
 Still debugging.
 The fastest way to trigger is fsx using the following parameters:

   fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx

 Which would cause inode nbytes differs from expected value and
 triggers btrfs check error.

[DIFFERENCE AGAINST REGULAR SECTORSIZE]
The metadata part in fact has more new code than data part, as it has
some different behaviors compared to the regular sector size handling:

- No more page locking
 Now metadata read/write relies on extent io tree locking, other than
 page locking.
 This is to allow behaviors like read lock one eb while also try to
 read lock another eb in the same page.
 We can't rely on page lock as now we have multiple extent buffers in
 the same page.

- Page status update
 Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
 Instead of just grabbing extent buffer from page::private, we need to
 iterate all dirty extent buffers in the page and submit them.

[CHANGELOG]
v2:
- Rebased to latest misc-next
 No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
 This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
 No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
 Patch 2~3, originally in RW part.

- Fix one uninitialized variable
 Patch 6.

v3:
- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
 This removes 2 cleanup patches.

- Add new overview comment for subpage metadata

Qu Wenruo (13):
 btrfs: add sysfs interface for supported sectorsize
 btrfs: use min() to replace open-code in btrfs_invalidatepage()
 btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
 btrfs: refactor how we iterate ordered extent in
   btrfs_invalidatepage()
 btrfs: introduce helpers for subpage dirty status
 btrfs: introduce helpers for subpage writeback status
 btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
   metadata
 btrfs: support subpage metadata csum calculation at write time
 btrfs: make alloc_extent_buffer() check subpage dirty bitmap
 btrfs: make the page uptodate assert to be subpage compatible
 btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
 btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
   compatible
 btrfs: add subpage overview comments

fs/btrfs/disk-io.c   | 143 ++-
fs/btrfs/extent_io.c | 127 --
fs/btrfs/inode.c | 128 ++
fs/btrfs/subpage.c   | 127 ++
fs/btrfs/subpage.h   |  17 +
fs/btrfs/sysfs.c |  15 +
6 files changed, 441 insertions(+), 116 deletions(-)

--
2.30.1



Why wouldn't we just integrate full read-write support with the
caveats as described now? It seems to be relatively reasonable to do
that, and this patch set is essentially unusable without the rest of
it that does enable full read-write support.


The metadata part is much more stable than data path (almost not touched
for several months), and the metadata part already has some difference
in

Re: [PATCH v3 04/13] btrfs: refactor how we iterate ordered extent in btrfs_invalidatepage()

2021-04-01 Thread Qu Wenruo




On 2021/4/2 上午9:15, Anand Jain wrote:

On 25/03/2021 15:14, Qu Wenruo wrote:

In btrfs_invalidatepage(), we need to iterate through all ordered
extents and finish them.

This involved a loop to exhaust all ordered extents, but that loop is
implemented using again: label and goto.

Refactor the code by:
- Use a while() loop


Just an observation.
At a minimum, while loop does 2 iterations before breaking. Whereas
label and goto could do it without reaching goto at all for the same
value of %length. So the label and goto approach is still faster.


Although it's dead patch now, I feel it's still to address some 
questions here, as even in newer refactors, there will be some similar code.


First, the loop only do 1 loop for the real work.
After one loop body of work, @cur will be at page_end, thus exit the loop.

The loop body will never get executed twice, just the condition is 
checked twice, which is the same as the old code.


A question below.


- Extract the code to finish/dec an ordered extent into its own function
   The new function, invalidate_ordered_extent(), will handle the
   extent locking, extent bit update, and to finish/dec ordered extent.

In fact, for regular sectorsize == PAGE_SIZE case, there can only be at
most one ordered extent for one page, thus the code is from ancient
subpage preparation patchset.

But there is a bug hidden inside the ordered extent finish/dec part.

This patch will remove the ability to handle multiple ordered extent,
and add extra ASSERT() to make sure for regular sectorsize we won't have
anything wrong.

For the proper subpage support, it will be added in later patches.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/inode.c | 122 +--
  1 file changed, 75 insertions(+), 47 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d777f67d366b..99dcadd31870 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8355,17 +8355,72 @@ static int btrfs_migratepage(struct 
address_space *mapping,

  }
  #endif
+/*
+ * Helper to finish/dec one ordered extent for btrfs_invalidatepage().
+ *
+ * Return true if the ordered extent is finished.
+ * Return false otherwise
+ */
+static bool invalidate_ordered_extent(struct btrfs_inode *inode,
+  struct btrfs_ordered_extent *ordered,
+  struct page *page,
+  struct extent_state **cached_state,
+  bool inode_evicting)
+{
+    u64 start = page_offset(page);
+    u64 end = page_offset(page) + PAGE_SIZE - 1;
+    u32 len = PAGE_SIZE;
+    bool completed_ordered = false;
+
+    /*
+ * For regular sectorsize == PAGE_SIZE, if the ordered extent covers
+ * the page, then it must cover the full page.
+ */
+    ASSERT(ordered->file_offset <= start &&
+   ordered->file_offset + ordered->num_bytes > end);
+    /*
+ * IO on this page will never be started, so we need to account
+ * for any ordered extents now. Don't clear EXTENT_DELALLOC_NEW
+ * here, must leave that up for the ordered extent completion.
+ */
+    if (!inode_evicting)
+    clear_extent_bit(&inode->io_tree, start, end,
+ EXTENT_DELALLOC | EXTENT_LOCKED |
+ EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 1, 0,
+ cached_state);
+    /*
+ * Whoever cleared the private bit is responsible for the
+ * finish_ordered_io
+ */
+    if (TestClearPagePrivate2(page)) {
+    spin_lock_irq(&inode->ordered_tree.lock);
+    set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+    ordered->truncated_len = min(ordered->truncated_len,
+ start - ordered->file_offset);
+    spin_unlock_irq(&inode->ordered_tree.lock);
+
+    if (btrfs_dec_test_ordered_pending(inode, &ordered, start, 
len, 1)) {

+    btrfs_finish_ordered_io(ordered);
+    completed_ordered = true;
+    }
+    }
+    btrfs_put_ordered_extent(ordered);
+    if (!inode_evicting) {
+    *cached_state = NULL;
+    lock_extent_bits(&inode->io_tree, start, end, cached_state);
+    }
+    return completed_ordered;
+}
+
  static void btrfs_invalidatepage(struct page *page, unsigned int 
offset,

   unsigned int length)
  {
  struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
  struct extent_io_tree *tree = &inode->io_tree;
-    struct btrfs_ordered_extent *ordered;
  struct extent_state *cached_state = NULL;
  u64 page_start = page_offset(page);
  u64 page_end = page_start + PAGE_SIZE - 1;
-    u64 start;
-    u64 end;
+    u64 cur;
  int inode_evicting = inode->vfs_inode.i_state & I_FREEING;
  bool found_ordered = false;
  bool completed_ordered = false;
@@ -8387,51 +8442,24 @@ static void btrfs_invalidatepage(struct page 
*page, unsigned int offset,

  if (!inode_evicting)

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-01 Thread Qu Wenruo





On 2021/4/2 上午9:39, Anand Jain wrote:

On 29/03/2021 10:01, Qu Wenruo wrote:



On 2021/3/29 上午4:02, Ritesh Harjani wrote:

On 21/03/25 09:16PM, Qu Wenruo wrote:



On 2021/3/25 下午8:20, Neal Gompa wrote:

On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo  wrote:


This patchset can be fetched from the following github repo, along
with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.

[FULL RW TEST]
Since the data write path is not included in this patchset, we can't
really test the patchset itself, but anyone can grab the patch from
github repo and do fstests/generic tests.

But at least the full RW patchset can pass -g generic/quick -x defrag
for now.

There are some known issues:

- Defrag behavior change
    Since current defrag is doing per-page defrag, to support subpage
    defrag, we need some change in the loop.
    E.g. if a page has both hole and regular extents in it, then
defrag
    will rewrite the full 64K page.

    Thus for now, defrag related failure is expected.
    But this should only cause behavior difference, no crash nor
hangis
    expected.

- No compression support yet
    There are at least 2 known bugs if forcing compression for
subpage
    * Some hard coded PAGE_SIZE screwing up space rsv
    * Subpage ASSERT() triggered
  This is because some compression code is unlocking
locked_page by
  calling extent_clear_unlock_delalloc() with locked_page ==
NULL.
    So for now compression is also disabled.

- Inode nbytes mismatch
    Still debugging.
    The fastest way to trigger is fsx using the following parameters:

  fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file >
/tmp/fsx

    Which would cause inode nbytes differs from expected value and
    triggers btrfs check error.

[DIFFERENCE AGAINST REGULAR SECTORSIZE]
The metadata part in fact has more new code than data part, as it has
some different behaviors compared to the regular sector size
handling:

- No more page locking
    Now metadata read/write relies on extent io tree locking,
other than
    page locking.
    This is to allow behaviors like read lock one eb while also
try to
    read lock another eb in the same page.
    We can't rely on page lock as now we have multiple extent
buffersin
    the same page.

- Page status update
    Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
    Instead of just grabbing extent buffer from page::private, we
need to
    iterate all dirty extent buffers in the page and submit them.

[CHANGELOG]
v2:
- Rebased to latest misc-next
    No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
    This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
    No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
    Patch 2~3, originally in RW part.

- Fix one uninitialized variable
    Patch 6.

v3:
- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
    This removes 2 cleanup patches.

- Add new overview comment for subpage metadata

Qu Wenruo (13):
    btrfs: add sysfs interface for supported sectorsize
    btrfs: use min() to replace open-code in btrfs_invalidatepage()
    btrfs: remove unnecessary variable shadowing in
btrfs_invalidatepage()
    btrfs: refactor how we iterate ordered extent in
  btrfs_invalidatepage()
    btrfs: introduce helpers for subpage dirty status
    btrfs: introduce helpers for subpage writeback status
    btrfs: allow btree_set_page_dirty() to do more sanity check on
subpage
  metadata
    btrfs: support subpage metadata csum calculation at write time
    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
    btrfs: make the page uptodate assert to be subpage compatible
    btrfs: make set/clear_extent_buffer_dirty() to be subpage
compatible
    btrfs: make set_btree_ioerr() accept extent buffer and to be
subpage
  compatible
    btrfs: add subpage overview comments

   fs/btrfs/disk-io.c   | 143
++-
   fs/btrfs/extent_io.c | 127 --
   fs/btrfs/inode.c | 128 ++
   fs/btrfs/subpage.c   | 127 ++
   fs/btrfs/subpage.h   |  17 +
   fs/btrfs/sysfs.c |  15 +
   6 files changed, 441 insertions(+), 116 deletions(-)

--
2.30.1



Why wouldn't we just integrate full read-write support with the
caveats as described now? It seems to be relatively reasonable to do
that, and this patch set is essentially unusable without the rest of
it that does enable full read-write support.


The metadata part is much more stable than data path (almost not
touched
for several months), and the metadata part already has some difference
in its behavior, which needs review.

You point makes some sens

[PATCH v2] btrfs: use u32 for length related members of btrfs_ordered_extent

2021-04-01 Thread Qu Wenruo

Unlike btrfs_file_extent_item, btrfs_ordered_extent has its length
limit (BTRFS_MAX_EXTENT_SIZE), which is far smaller than U32_MAX.

Using u64 for those length related members are just a waste of memory.

This patch will make the following members u32:
- num_bytes
- disk_num_bytes
- bytes_left
- truncated_len

This will save 16 bytes for btrfs_ordered_extent structure.

For btrfs_add_ordered_extent*() call sites, they are mostly deeply
inside other functions passing u64.
Thus this patch will keep those u64, but do internal ASSERT() to ensure
the correct length values are passed in.

For btrfs_dec_test_.*_ordered_extent() call sites, length related
parameters are converted to u32, with extra ASSERT() added to ensure we
get correct values passed in.

There is special convert needed in btrfs_remove_ordered_extent(), which
needs s64, using "-entry->num_bytes" from u32 directly will cause
underflow.

Signed-off-by: Qu Wenruo 
---
Changelog:
v2:
- Fix a underflow caused by incorrect convert from u32 to s64
  This would cause error messages like
  "BTRFS info (device dm-3): at unmount dio bytes count 14478334754816"
  However this error message doesn't trigger any kernel calltrace, thus
  not exposed by previous fstests run.
---
 fs/btrfs/inode.c|  8 ++--
 fs/btrfs/ordered-data.c | 21 ++---
 fs/btrfs/ordered-data.h | 25 ++---
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9b87c3e4fa7b..0113598e6ba1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3070,6 +3070,7 @@ void btrfs_writepage_endio_finish_ordered(struct page 
*page, u64 start,
struct btrfs_ordered_extent *ordered_extent = NULL;
struct btrfs_workqueue *wq;
 
+   ASSERT(end + 1 - start < U32_MAX);
trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
ClearPagePrivate2(page);
@@ -7969,6 +7970,7 @@ static void __endio_write_update_ordered(struct 
btrfs_inode *inode,
else
wq = fs_info->endio_write_workers;
 
+   ASSERT(bytes < U32_MAX);
while (ordered_offset < offset + bytes) {
last_offset = ordered_offset;
if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
@@ -8415,10 +8417,12 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
if (TestClearPagePrivate2(page)) {
spin_lock_irq(&inode->ordered_tree.lock);
set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-   ordered->truncated_len = min(ordered->truncated_len,
+   ASSERT(start - ordered->file_offset < U32_MAX);
+   ordered->truncated_len = min_t(u32, 
ordered->truncated_len,
start - ordered->file_offset);
spin_unlock_irq(&inode->ordered_tree.lock);
 
+   ASSERT(end - start + 1 < U32_MAX);
if (btrfs_dec_test_ordered_pending(inode, &ordered,
   start,
   end - start + 1, 1)) 
{
@@ -8937,7 +8941,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
break;
else {
btrfs_err(root->fs_info,
- "found ordered extent %llu %llu on inode 
cleanup",
+ "found ordered extent %llu %u on inode 
cleanup",
  ordered->file_offset, ordered->num_bytes);
btrfs_remove_ordered_extent(inode, ordered);
btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 07b0b4218791..8e6d9d906bdd 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -160,6 +160,12 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode 
*inode, u64 file_offset
struct btrfs_ordered_extent *entry;
int ret;
 
+   /*
+* Basic size check, all length related members should be smaller
+* than U32_MAX.
+*/
+   ASSERT(num_bytes < U32_MAX && disk_num_bytes < U32_MAX);
+
if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
/* For nocow write, we can release the qgroup rsv right now */
ret = btrfs_qgroup_free_data(inode, NULL, file_offset, 
num_bytes);
@@ -186,7 +192,7 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode 
*inode, u64 file_offset
entry->bytes_left = num_bytes;
entry->inode = igrab(&inode->vfs_inode);
entry->compress_type = compress_type;
-   entry->truncated_len = (u64)-1;
+   entry->truncated

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-03-31 Thread Qu Wenruo





On 2021/3/30 上午2:53, David Sterba wrote:

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:

v3:
- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
   This removes 2 cleanup patches.

- Add new overview comment for subpage metadata


V3 is now in for-next, targeting merge for 5.13. Please post any fixups
as replies to the individual patches, I'll fold them in, rather a full
series resend. Thanks.


Is it possible to drop patch "[PATCH v3 04/13] btrfs: refactor how we
iterate ordered extent in btrfs_invalidatepage()"?

Since in the series, there are no other patches touching it, dropping it
should not involve too much hassle.

The problem here is, how we handle ordered extent really belongs to the
data write path.

Furthermore, after all the data RW related testing, it turns out that
the ordered extent code has several problems:

- Separate indicators for ordered extent
  We use PagePriavte2 to indicate whether we have pending ordered extent
  io.
  But it is not properly integrated into ordered extent code, nor really
  properly documented.

- Complex call sites requirement
  For endio we don't care whether we finished the ordered extent, while
  for invalidatepage, we don't really need to bother if we finished all
  the ordered extents in the range.

  Thus we really don't need to bother who finished the ordered extents,
  but just want to mark the io finished for the range.

- Lack subpage compatibility
  That's why I'm here complaining, especially due to the PagePrivate2
  usage.
  It needs to be converted to a new bitmap.

There will be a refactor on the btrfs_dec_test_*_ordered_pending()
functions soon, and obvious the existing call sites will all be gone.

Thus that fourth patch makes no sense.

If needed, I can resend the patchset without that patch.

Thanks,
Qu

[PATCH] btrfs: use u32 for length related members of btrfs_ordered_extent

2021-03-29 Thread Qu Wenruo

Unlike btrfs_file_extent_item, btrfs_ordered_extent has its length
limit (BTRFS_MAX_EXTENT_SIZE), which is far smaller than U32_MAX.

Using u64 for those length related members are just a waste of memory.

This patch will make the following members u32:
- num_bytes
- disk_num_bytes
- bytes_left
- truncated_len

This will save 16 bytes for btrfs_ordered_extent structure.

For btrfs_add_ordered_extent*() call sites, they are mostly deeply
inside other functions passing u64.
Thus this patch will keep those u64, but do internal ASSERT() to ensure
the correct length values are passed in.

For btrfs_dec_test_.*_ordered_extent() call sites, length related
parameters are converted to u32, with extra ASSERT() added to ensure we
get correct values passed in.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c|  5 -
 fs/btrfs/ordered-data.c | 18 --
 fs/btrfs/ordered-data.h | 25 ++---
 3 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 288c7ce63a32..1278c808c737 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3070,6 +3070,7 @@ void btrfs_writepage_endio_finish_ordered(struct page 
*page, u64 start,
struct btrfs_ordered_extent *ordered_extent = NULL;
struct btrfs_workqueue *wq;
 
+   ASSERT(end + 1 - start < U32_MAX);
trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
ClearPagePrivate2(page);
@@ -7965,6 +7966,7 @@ static void __endio_write_update_ordered(struct 
btrfs_inode *inode,
else
wq = fs_info->endio_write_workers;
 
+   ASSERT(bytes < U32_MAX);
while (ordered_offset < offset + bytes) {
last_offset = ordered_offset;
if (btrfs_dec_test_first_ordered_pending(inode, &ordered,
@@ -8421,6 +8423,7 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
ordered->truncated_len = new_len;
spin_unlock_irq(&tree->lock);
 
+   ASSERT(end - start + 1 < U32_MAX);
if (btrfs_dec_test_ordered_pending(inode, &ordered,
   start,
   end - start + 1, 1)) 
{
@@ -8939,7 +8942,7 @@ void btrfs_destroy_inode(struct inode *vfs_inode)
break;
else {
btrfs_err(root->fs_info,
- "found ordered extent %llu %llu on inode 
cleanup",
+ "found ordered extent %llu %u on inode 
cleanup",
  ordered->file_offset, ordered->num_bytes);
btrfs_remove_ordered_extent(inode, ordered);
btrfs_put_ordered_extent(ordered);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 07b0b4218791..386f6ef8fe2f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -160,6 +160,12 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode 
*inode, u64 file_offset
struct btrfs_ordered_extent *entry;
int ret;
 
+   /*
+* Basic size check, all length related members should be smaller
+* than U32_MAX.
+*/
+   ASSERT(num_bytes < U32_MAX && disk_num_bytes < U32_MAX);
+
if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) {
/* For nocow write, we can release the qgroup rsv right now */
ret = btrfs_qgroup_free_data(inode, NULL, file_offset, 
num_bytes);
@@ -186,7 +192,7 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode 
*inode, u64 file_offset
entry->bytes_left = num_bytes;
entry->inode = igrab(&inode->vfs_inode);
entry->compress_type = compress_type;
-   entry->truncated_len = (u64)-1;
+   entry->truncated_len = (u32)-1;
entry->qgroup_rsv = ret;
entry->physical = (u64)-1;
entry->disk = NULL;
@@ -320,7 +326,7 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent 
*entry,
  */
 bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode,
   struct btrfs_ordered_extent **finished_ret,
-  u64 *file_offset, u64 io_size, int uptodate)
+  u64 *file_offset, u32 io_size, int uptodate)
 {
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct btrfs_ordered_inode_tree *tree = &inode->ordered_tree;
@@ -330,7 +336,7 @@ bool btrfs_dec_test_first_ordered_pending(struct 
btrfs_inode *inode,
unsigned long flags;
u64 dec_end;
u64 dec_start;
-   u64 to_dec;
+   u32 to_dec;
 
spin_lock_irqsave(&tree->lock, flags);
node = tree_search(tree, *f

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-03-28 Thread Qu Wenruo





On 2021/3/29 上午4:02, Ritesh Harjani wrote:

On 21/03/25 09:16PM, Qu Wenruo wrote:



On 2021/3/25 下午8:20, Neal Gompa wrote:

On Thu, Mar 25, 2021 at 3:17 AM Qu Wenruo  wrote:


This patchset can be fetched from the following github repo, along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.

[FULL RW TEST]
Since the data write path is not included in this patchset, we can't
really test the patchset itself, but anyone can grab the patch from
github repo and do fstests/generic tests.

But at least the full RW patchset can pass -g generic/quick -x defrag
for now.

There are some known issues:

- Defrag behavior change
Since current defrag is doing per-page defrag, to support subpage
defrag, we need some change in the loop.
E.g. if a page has both hole and regular extents in it, then defrag
will rewrite the full 64K page.

Thus for now, defrag related failure is expected.
But this should only cause behavior difference, no crash nor hang is
expected.

- No compression support yet
There are at least 2 known bugs if forcing compression for subpage
* Some hard coded PAGE_SIZE screwing up space rsv
* Subpage ASSERT() triggered
  This is because some compression code is unlocking locked_page by
  calling extent_clear_unlock_delalloc() with locked_page == NULL.
So for now compression is also disabled.

- Inode nbytes mismatch
Still debugging.
The fastest way to trigger is fsx using the following parameters:

  fsx -l 262144 -o 65536 -S 30073 -N 256 -R -W $mnt/file > /tmp/fsx

Which would cause inode nbytes differs from expected value and
triggers btrfs check error.

[DIFFERENCE AGAINST REGULAR SECTORSIZE]
The metadata part in fact has more new code than data part, as it has
some different behaviors compared to the regular sector size handling:

- No more page locking
Now metadata read/write relies on extent io tree locking, other than
page locking.
This is to allow behaviors like read lock one eb while also try to
read lock another eb in the same page.
We can't rely on page lock as now we have multiple extent buffers in
the same page.

- Page status update
Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
Instead of just grabbing extent buffer from page::private, we need to
iterate all dirty extent buffers in the page and submit them.

[CHANGELOG]
v2:
- Rebased to latest misc-next
No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
Patch 2~3, originally in RW part.

- Fix one uninitialized variable
Patch 6.

v3:
- Rename the sysfs to supported_sectorsizes

- Rebased to latest misc-next branch
This removes 2 cleanup patches.

- Add new overview comment for subpage metadata

Qu Wenruo (13):
btrfs: add sysfs interface for supported sectorsize
btrfs: use min() to replace open-code in btrfs_invalidatepage()
btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
btrfs: refactor how we iterate ordered extent in
  btrfs_invalidatepage()
btrfs: introduce helpers for subpage dirty status
btrfs: introduce helpers for subpage writeback status
btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
  metadata
btrfs: support subpage metadata csum calculation at write time
btrfs: make alloc_extent_buffer() check subpage dirty bitmap
btrfs: make the page uptodate assert to be subpage compatible
btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
  compatible
btrfs: add subpage overview comments

   fs/btrfs/disk-io.c   | 143 ++-
   fs/btrfs/extent_io.c | 127 --
   fs/btrfs/inode.c | 128 ++
   fs/btrfs/subpage.c   | 127 ++
   fs/btrfs/subpage.h   |  17 +
   fs/btrfs/sysfs.c |  15 +
   6 files changed, 441 insertions(+), 116 deletions(-)

--
2.30.1



Why wouldn't we just integrate full read-write support with the
caveats as described now? It seems to be relatively reasonable to do
that, and this patch set is essentially unusable without the rest of
it that does enable full read-write support.


The metadata part is much more stable than data path (almost not touched
for several months), and the metadata part already has some difference
in its behavior, which needs review.

You point makes some sense, but I still don't believe pushing a super
large patchset does any hel

[PATCH 3/3] btrfs-progs: misc-tests: add test to ensure the restored image can be mounted

This new test case is to make sure the restored image file has been
properly enlarged so that newer kernel won't complain.

Signed-off-by: Qu Wenruo 
---
 .../047-image-restore-mount/test.sh   | 19 +++
 1 file changed, 19 insertions(+)
 create mode 100755 tests/misc-tests/047-image-restore-mount/test.sh

diff --git a/tests/misc-tests/047-image-restore-mount/test.sh 
b/tests/misc-tests/047-image-restore-mount/test.sh
new file mode 100755
index ..7f12afa2bab6
--- /dev/null
+++ b/tests/misc-tests/047-image-restore-mount/test.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+# Verify that the restored image of an empty btrfs can still be mounted
+
+source "$TEST_TOP/common"
+
+check_prereq btrfs-image
+check_prereq mkfs.btrfs
+check_prereq btrfs
+
+tmp=$(mktemp -d --tmpdir btrfs-progs-image.)
+prepare_test_dev
+
+run_check_mkfs_test_dev
+run_check "$TOP/btrfs-image" "$TEST_DEV" "$tmp/dump"
+run_check "$TOP/btrfs-image" -r "$tmp/dump" "$tmp/restored"
+
+run_check $SUDO_HELPER mount -t btrfs -o loop "$tmp/restored" "$TEST_MNT"
+umount "$TEST_MNT" &> /dev/null
+rm -rf -- "$tmp"
-- 
2.30.1

[PATCH 0/3] btrfs-progs: image: make restored image file to be properly enlarged

Recent kernel will refuse to mount restored image, even the source fs is
empty:
 # mkfs.btrfs -f /dev/test/test
 # btrfs-image /dev/test/test /tmp/dump
 # btrfs-image -r /tmp/dump ~/test.img
 # mount ~/test.img /mnt/btrfs
 mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, 
missing codepage or helper program, or other error.
 # dmesg -t | tail -n 7
 loop0: detected capacity change from 10592 to 0
 BTRFS info (device loop0): disk space caching is enabled
 BTRFS info (device loop0): has skinny extents
 BTRFS info (device loop0): flagging fs with big metadata feature
 BTRFS error (device loop0): device total_bytes should be at most 5423104 but 
found 10737418240
 BTRFS error (device loop0): failed to read chunk tree: -22
 BTRFS error (device loop0): open_ctree failed

This is triggered by a recent kernel commit 3a160a933111 ("btrfs: drop never 
met disk total
bytes check in verify_one_dev_extent").

But the root cause is, we didn't enlarge the output file if the source
image only contains single device.

This bug won't affect restore to block device, or the destination file
is already large enough.

This patchset will fix the problem, and with new test case to detect
such problem.

Also remove one dead code exposed during the development.

Qu Wenruo (3):
  btrfs: image: remove the dead stat() call
  btrfs-progs: image: enlarge the output file if no tree modification is
needed for restore
  btrfs-progs: misc-tests: add test to ensure the restored image can be
mounted

 image/main.c  | 51 ---
 .../047-image-restore-mount/test.sh   | 19 +++
 2 files changed, 62 insertions(+), 8 deletions(-)
 create mode 100755 tests/misc-tests/047-image-restore-mount/test.sh

-- 
2.30.1

[PATCH 2/3] btrfs-progs: image: enlarge the output file if no tree modification is needed for restore

[BUG]
If restoring dumpped image into a new file, under most cases kernel will
reject it:

 # mkfs.btrfs -f /dev/test/test
 # btrfs-image /dev/test/test /tmp/dump
 # btrfs-image -r /tmp/dump ~/test.img
 # mount ~/test.img /mnt/btrfs
 mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, 
missing codepage or helper program, or other error.
 # dmesg -t | tail -n 7
 loop0: detected capacity change from 10592 to 0
 BTRFS info (device loop0): disk space caching is enabled
 BTRFS info (device loop0): has skinny extents
 BTRFS info (device loop0): flagging fs with big metadata feature
 BTRFS error (device loop0): device total_bytes should be at most 5423104 but 
found 10737418240
 BTRFS error (device loop0): failed to read chunk tree: -22
 BTRFS error (device loop0): open_ctree failed

[CAUSE]
When btrfs-image restores an image into a file, and the source image
contains only single device, then we don't need to modify the
chunk/device tree, as we can reuse the existing chunk/dev tree without
any problem.

This also means, for such restore, we also won't do any target file
enlarge. This behavior itself is fine, as at that time, kernel won't
check if the device is smaller than the device size recorded in device
tree.

But later kernel commit 3a160a933111 ("btrfs: drop never met disk total
bytes check in verify_one_dev_extent") introduces new check on device
size at mount time, rejecting any loop file which is smaller than the
original device size.

[FIX]
Do extra file enlarge for single device restore.

Reported-by: Nikolay Borisov 
Signed-off-by: Qu Wenruo 
---
 image/main.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/image/main.c b/image/main.c
index 24393188e5e3..9933f69d0fdb 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2706,6 +2706,49 @@ static int restore_metadump(const char *input, FILE 
*out, int old_restore,
close_ctree(info->chunk_root);
if (ret)
goto out;
+   } else {
+   struct btrfs_root *root;
+   struct stat st;
+   u64 dev_size;
+
+   if (!info) {
+   root = open_ctree_fd(fileno(out), target, 0, 0);
+   if (!root) {
+   error("open ctree failed in %s", target);
+   ret = -EIO;
+   goto out;
+   }
+
+   info = root->fs_info;
+
+   dev_size = btrfs_stack_device_total_bytes(
+   &info->super_copy->dev_item);
+   close_ctree(root);
+   info = NULL;
+   } else {
+   dev_size = btrfs_stack_device_total_bytes(
+   &info->super_copy->dev_item);
+   }
+
+   /*
+* We don't need extra tree modification, but if the output is
+* a file, we need to enlarge the output file so that
+* newer kernel won't report error.
+*/
+   ret = fstat(fileno(out), &st);
+   if (ret < 0) {
+   error("failed to stat result image: %m");
+   ret = -errno;
+   goto out;
+   }
+   if (S_ISREG(st.st_mode)) {
+   ret = ftruncate64(fileno(out), dev_size);
+   if (ret < 0) {
+   error("failed to enlarge result image: %m");
+   ret = -errno;
+   goto out;
+   }
+   }
}
 out:
mdrestore_destroy(&mdrestore, num_threads);
-- 
2.30.1

[PATCH 1/3] btrfs: image: remove the dead stat() call