Re: Status of RAID5/6

2018-04-01 Thread Zygo Blaxell
On Sun, Apr 01, 2018 at 03:11:04PM -0600, Chris Murphy wrote:
> (I hate it when my palm rubs the trackpad and hits send prematurely...)
> 
> 
> On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy  wrote:
> 
> >> Users can run scrub immediately after _every_ unclean shutdown to
> >> reduce the risk of inconsistent parity and unrecoverable data should
> >> a disk fail later, but this can only prevent future write hole events,
> >> not recover data lost during past events.
> >
> > Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
> > such a leaf containing EXTENT_CSUM means that EXTENT_CSUM
> 
> means that EXTENT_CSUM is assumed to be correct. But in fact it could
> be stale. It's just as possible the metadata and superblock update is
> what's missing due to the interruption, while both data and parity
> strip writes succeeded. The window for either the data or parity write
> to fail is way shorter of a time interval, than that of the numerous
> metadata writes, followed by superblock update. 

csums cannot be wrong due to write interruption.  The data and metadata
blocks are written first, then barrier, then superblock updates pointing
to the data and csums previously written in the same transaction.
Unflushed data is not included in the metadata.  If there is a write
interruption then the superblock update doesn't occur and btrfs reverts
to the previous unmodified data+csum trees.

This works on non-raid5/6 because all the writes that make up a
single transaction are ordered and independent, and no data from older
transactions is modified during any tree update.

On raid5/6 every RMW operation modifies data from old transactions
by creating data/parity inconsistency.  If there was no data in the
stripe from an old transaction, the operation would be just a write,
no read and modify.  In the write hole case, the csum *is* correct,
it is the data that is wrong.

> In such a case, the
> old metadata is what's pointed to, including EXTENT_CSUM. Therefore
> your scrub would always show csum error, even if both data and parity
> are correct. You'd have to init-csum in this case, I suppose.

No, the csums are correct.  The data does not match the csum because the
data is corrupted.  Assuming barriers work on your disk, and you're not
having some kind of direct IO data consistency bug, and you can read the
csum tree at all, then the csums are correct, even with write hole.

When write holes and other write interruption patterns affect the csum
tree itself, this results in parent transid verify failures, csum tree
page csum failures, or both.  This forces the filesystem read-only so
it's easy to spot when it happens.

Note that the data blocks with wrong csum from raid5/6 reconstruction
after a write hole event always belong to _old_ transactions damaged
by the write hole.  If the writes are interrupted, the new data blocks
in a RMW stripe will not be committed and will have no csums to verify,
so they can't have _wrong_ csums.  The old data blocks do not have their
csum changed by the write hole (the csum is stored on a separate tree
in a different block group) so the csums are intact.  When a write hole
event corrupts the data reconstruction on a degraded array, the csum
doesn't match because the csum is correct and the data is not.

> Pretty much it's RMW with a (partial) stripe overwrite upending COW,
> and therefore upending the atomicity, and thus consistency of Btrfs in
> the raid56 case where any portion of the transaction is interrupted.

Not any portion, only the RMW stripe update can produce data loss due
to write interruption (well, that, and fsync() log-tree replay bugs).

If any other part of the transaction is interrupted then btrfs recovers
just fine with its COW tree update algorithm and write barriers.

> And this is amplified if metadata is also raid56.

Data and metadata are mangled the same way.  The difference is the impact.

btrfs tolerates exactly 0 bits of damaged metadata after RAID recovery,
and enforces this intolerance with metadata transids and csums, so write
hole on metadata _always_ breaks the filesystem.

> ZFS avoids the problem at the expense of probably a ton of
> fragmentation, by taking e.g. 4KiB RMW and writing a full length
> stripe of 8KiB fully COW, rather than doing stripe modification with
> an overwrite. And that's because it has dynamic stripe lengths. 

I think that's technically correct but could be clearer.

ZFS never does RMW.  It doesn't need to.  Parity blocks are allocated
at the extent level and RAID stripes are built *inside* the extents (or
"groups of contiguous blocks written in a single transaction" which
seems to be the closest ZFS equivalent of the btrfs extent concept).

Since every ZFS RAID stripe is bespoke sized to exactly fit a single
write operation, no two ZFS transactions can ever share a RAID stripe.
No transactions sharing a stripe means no write hole.

There is no impact on fragmentation on ZFS--space is 

[PATCH] btrfs-progs: Let function find_device to be consistent with kernel

2018-04-01 Thread Gu Jinxiang
Make find_device to be consistent with kernel according
35c70103a528 ("btrfs: refactor find_device helper")

And, modify the compare condition from both devid and uuid to
devid or devid and uuid according
8f18cf13396c ("Btrfs: Make the resizer work based on shrinking and growing 
devices")

Signed-off-by: Gu Jinxiang 
---
 volumes.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/volumes.c b/volumes.c
index edad367b..1320b55b 100644
--- a/volumes.c
+++ b/volumes.c
@@ -54,14 +54,22 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 static LIST_HEAD(fs_uuids);
 
-static struct btrfs_device *__find_device(struct list_head *head, u64 devid,
- u8 *uuid)
+/*
+ * Find a device specified by @devid or @uuid in the list of @fs_devices, or
+ * return NULL.
+ *
+ * If devid and uuid are both specified, the match must be exact, otherwise
+ * only devid is used.
+ */
+static struct btrfs_device *find_device(struct btrfs_fs_devices *fs_devices,
+   u64 devid, u8 *uuid)
 {
+   struct list_head *head = _devices->devices;
struct btrfs_device *dev;
 
list_for_each_entry(dev, head, dev_list) {
if (dev->devid == devid &&
-   !memcmp(dev->uuid, uuid, BTRFS_UUID_SIZE)) {
+   (!uuid || !memcmp(dev->uuid, uuid, BTRFS_UUID_SIZE))) {
return dev;
}
}
@@ -100,7 +108,7 @@ static int device_list_add(const char *path,
fs_devices->lowest_devid = (u64)-1;
device = NULL;
} else {
-   device = __find_device(_devices->devices, devid,
+   device = find_device(fs_devices, devid,
   disk_super->dev_item.uuid);
}
if (!device) {
@@ -1616,8 +1624,7 @@ struct btrfs_device *btrfs_find_device(struct 
btrfs_fs_info *fs_info, u64 devid,
if (!fsid ||
(!memcmp(cur_devices->fsid, fsid, BTRFS_UUID_SIZE) ||
 fs_info->ignore_fsid_mismatch)) {
-   device = __find_device(_devices->devices,
-  devid, uuid);
+   device = find_device(cur_devices, devid, uuid);
if (device)
return device;
}
-- 
2.14.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: mkfs rootdir: use lgetxattr() not to follow a symbolic link

2018-04-01 Thread Qu Wenruo


On 2018年04月02日 09:59, Misono Tomohiro wrote:
> mkfs-test 016 "rootdir-bad-symbolic-link" fails when selinux is enabled.
> This is because add_xattr_item() uses getxattr() and tries to follow a
> bad symbolic link for selinux item, which causes ENOENT error.
> 
> The line above already uses llistxattr() for getting list of xattr in
> order not to follow a symbolic link, so just use lgetxattr() too.
> 
> Signed-off-by: Tomohiro Misono 

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  mkfs/rootdir.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mkfs/rootdir.c b/mkfs/rootdir.c
> index 33c3ff1e..ff00bb0f 100644
> --- a/mkfs/rootdir.c
> +++ b/mkfs/rootdir.c
> @@ -249,7 +249,7 @@ static int add_xattr_item(struct btrfs_trans_handle 
> *trans,
>   cur_name_len = strlen(cur_name);
>   next_location += cur_name_len + 1;
>  
> - ret = getxattr(file_name, cur_name, cur_value, XATTR_SIZE_MAX);
> + ret = lgetxattr(file_name, cur_name, cur_value, XATTR_SIZE_MAX);
>   if (ret < 0) {
>   if (errno == ENOTSUP)
>   return 0;
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs-progs: mkfs rootdir: use lgetxattr() not to follow a symbolic link

2018-04-01 Thread Misono Tomohiro
mkfs-test 016 "rootdir-bad-symbolic-link" fails when selinux is enabled.
This is because add_xattr_item() uses getxattr() and tries to follow a
bad symbolic link for selinux item, which causes ENOENT error.

The line above already uses llistxattr() for getting list of xattr in
order not to follow a symbolic link, so just use lgetxattr() too.

Signed-off-by: Tomohiro Misono 
---
 mkfs/rootdir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mkfs/rootdir.c b/mkfs/rootdir.c
index 33c3ff1e..ff00bb0f 100644
--- a/mkfs/rootdir.c
+++ b/mkfs/rootdir.c
@@ -249,7 +249,7 @@ static int add_xattr_item(struct btrfs_trans_handle *trans,
cur_name_len = strlen(cur_name);
next_location += cur_name_len + 1;
 
-   ret = getxattr(file_name, cur_name, cur_value, XATTR_SIZE_MAX);
+   ret = lgetxattr(file_name, cur_name, cur_value, XATTR_SIZE_MAX);
if (ret < 0) {
if (errno == ENOTSUP)
return 0;
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug? fstrim only trims unallocated space, not unused in bg's

2018-04-01 Thread Chris Murphy
[chris@f27h linux]$ git apply -v
~/RESEND-1-2-btrfs-Enhance-btrfs_trim_fs-function-to-handle-error-better.patch
Checking patch fs/btrfs/extent-tree.c...
Hunk #1 succeeded at 10942 (offset -6 lines).
Hunk #2 succeeded at 10962 (offset -6 lines).
Hunk #3 succeeded at 10974 (offset -6 lines).
Hunk #4 succeeded at 10988 (offset -6 lines).
Hunk #5 succeeded at 11025 (offset -6 lines).
Applied patch fs/btrfs/extent-tree.c cleanly.
[chris@f27h linux]$ git apply -v
~/v2.2-2-2-btrfs-Ensure-btrfs_trim_fs-can-trim-the-whole-fs.patch
Checking patch fs/btrfs/extent-tree.c...
Hunk #1 succeeded at 10961 (offset -6 lines).
Checking patch fs/btrfs/ioctl.c...
Hunk #1 succeeded at 364 (offset -1 lines).
Hunk #2 succeeded at 387 (offset -1 lines).
Applied patch fs/btrfs/extent-tree.c cleanly.
Applied patch fs/btrfs/ioctl.c cleanly.

compiles fine, and test appears to trim more than before, and the file
system still works and scrubs with no errors (did scrub with
4.15.14.fc28).

[chris@f27h ~]$ uname -r
4.16.0-rc7 (mainline, unpatched)
[chris@f27h ~]$ sudo btrfs fi us /
Overall:
Device size:  70.00GiB
Device allocated:  46.06GiB
Device unallocated:  23.94GiB
Device missing: 0.00B
Used:  43.90GiB
Free (estimated):  25.86GiB(min: 13.89GiB)
Data ratio:  1.00
Metadata ratio:  1.00
Global reserve: 109.94MiB(used: 0.00B)

Data,single: Size:44.00GiB, Used:42.08GiB
   /dev/nvme0n1p9  44.00GiB

Metadata,single: Size:2.00GiB, Used:1.82GiB
   /dev/nvme0n1p9   2.00GiB

System,DUP: Size:32.00MiB, Used:16.00KiB
   /dev/nvme0n1p9  64.00MiB

Unallocated:
   /dev/nvme0n1p9  23.94GiB
[chris@f27h ~]$ sudo fstrim -v /
[sudo] password for chris:
/: 24 GiB (25701646336 bytes) trimmed
[chris@f27h ~]$

==

[chris@f27h ~]$ uname -r
4.16.0-rc7+ (mainline, plus patch)
[chris@f27h ~]$ sudo btrfs fi us /
Overall:
Device size:  70.00GiB
Device allocated:  46.06GiB
Device unallocated:  23.94GiB
Device missing: 0.00B
Used:  43.90GiB
Free (estimated):  25.86GiB(min: 13.89GiB)
Data ratio:  1.00
Metadata ratio:  1.00
Global reserve: 110.06MiB(used: 0.00B)

Data,single: Size:44.00GiB, Used:42.08GiB
   /dev/nvme0n1p9  44.00GiB

Metadata,single: Size:2.00GiB, Used:1.82GiB
   /dev/nvme0n1p9   2.00GiB

System,DUP: Size:32.00MiB, Used:16.00KiB
   /dev/nvme0n1p9  64.00MiB

Unallocated:
   /dev/nvme0n1p9  23.94GiB
[chris@f27h ~]$ sudo fstrim -v /
[sudo] password for chris:
/: 26.5 GiB (28394635264 bytes) trimmed
[chris@f27h ~]$


---
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-01 Thread Chris Murphy
(I hate it when my palm rubs the trackpad and hits send prematurely...)


On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy  wrote:

>> Users can run scrub immediately after _every_ unclean shutdown to
>> reduce the risk of inconsistent parity and unrecoverable data should
>> a disk fail later, but this can only prevent future write hole events,
>> not recover data lost during past events.
>
> Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
> such a leaf containing EXTENT_CSUM means that EXTENT_CSUM

means that EXTENT_CSUM is assumed to be correct. But in fact it could
be stale. It's just as possible the metadata and superblock update is
what's missing due to the interruption, while both data and parity
strip writes succeeded. The window for either the data or parity write
to fail is way shorter of a time interval, than that of the numerous
metadata writes, followed by superblock update. In such a case, the
old metadata is what's pointed to, including EXTENT_CSUM. Therefore
your scrub would always show csum error, even if both data and parity
are correct. You'd have to init-csum in this case, I suppose.

Pretty much it's RMW with a (partial) stripe overwrite upending COW,
and therefore upending the atomicity, and thus consistency of Btrfs in
the raid56 case where any portion of the transaction is interrupted.

And this is amplified if metadata is also raid56.

ZFS avoids the problem at the expense of probably a ton of
fragmentation, by taking e.g. 4KiB RMW and writing a full length
stripe of 8KiB fully COW, rather than doing stripe modification with
an overwrite. And that's because it has dynamic stripe lengths. For
Btrfs to always do COW would mean that 4KiB change goes into a new
full stripe, 64KiB * num devices, assuming no other changes are ready
at commit time.

So yeah, avoiding the problem is best. But if it's going to be a
journal, it's going to make things pretty damn slow I'd think, unless
the journal can be explicitly placed something faster than the array,
like an SSD/NVMe device. And that's what mdadm allows and expects.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-01 Thread Chris Murphy
On Sat, Mar 31, 2018 at 9:45 PM, Zygo Blaxell
 wrote:
> On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:

>> Write hole happens on disk in Btrfs, but the ensuing corruption on
>> rebuild is detected. Corrupt data never propagates.
>
> Data written with nodatasum or nodatacow is corrupted without detection
> (same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
> journal device).

Yeah I guess I'm not very worried about nodatasum/nodatacow if the
user isn't. Perhaps it's not a fair bias, but bias nonetheless.


>
> Metadata always has csums, and files have checksums if they are created
> with default attributes and mount options.  Those cases are covered,
> any corrupted data will give EIO on reads (except once per 4 billion
> blocks, where the corrupted CRC matches at random).
>
>> The problem is that Btrfs gives up when it's detected.
>
> Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
> combinations of recovery blocks for raid6, and earlier kernels than
> those would not recover correctly for raid5 either.  I think this has
> all been fixed in recent kernels but I haven't tested these myself so
> don't quote me on that.

Looks like 4.15
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.15=v4.14

And those parts aren't yet backported to 4.14
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/raid56.c?id=v4.15.15=v4.14.32

And more in 4.16
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.16-rc7=v4.15


>
>> If it assumes just a bit flip - not always a correct assumption but
>> might be reasonable most of the time, it could iterate very quickly.
>
> That is not how write hole works (or csum recovery for that matter).
> Write hole producing a single bit flip would occur extremely rarely
> outside of contrived test cases.

Yes, what I wrote is definitely wrong, and I know better. I guess I
had a torn write in my brain!



> Users can run scrub immediately after _every_ unclean shutdown to
> reduce the risk of inconsistent parity and unrecoverable data should
> a disk fail later, but this can only prevent future write hole events,
> not recover data lost during past events.

Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
such a leaf containing EXTENT_CSUM means that EXTENT_CSUM




>
> If one of the data blocks is not available, its content cannot be
> recomputed from parity due to the inconsistency within the stripe.
> This will likely be detected as a csum failure (unless the data block
> is part of a nodatacow/nodatasum file, in which case corruption occurs
> but is not detected) except for the one time out of 4 billion when
> two CRC32s on random data match at random.
>
> If a damaged block contains btrfs metadata, the filesystem will be
> severely affected:  read-only, up to 100% of data inaccessible, only
> recovery methods involving brute force search will work.
>
>> Flip bit, and recompute and compare checksum. It doesn't have to
>> iterate across 64KiB times the number of devices. It really only has
>> to iterate bit flips on the particular 4KiB block that has failed csum
>> (or in the case of metadata, 16KiB for the default leaf size, up to a
>> max of 64KiB).
>
> Write hole is effectively 32768 possible bit flips in a 4K block--assuming
> only one block is affected, which is not very likely.  Each disk in an
> array can have dozens of block updates in flight when an interruption
> occurs, so there can be millions of bits corrupted in a single write
> interruption event (and dozens of opportunities to encounter the nominally
> rare write hole itself).
>
> An experienced forensic analyst armed with specialized tools, a database
> of file formats, and a recent backup of the filesystem might be able to
> recover the damaged data or deduce what it was.  btrfs, being only mere
> software running in the kernel, cannot.
>
> There are two ways to solve the write hole problem and this is not one
> of them.
>
>> That's a maximum of 4096 iterations and comparisons. It'd be quite
>> fast. And going for two bit flips while a lot slower is probably not
>> all that bad either.
>
> You could use that approach to fix a corrupted parity or data block
> on a degraded array, but not a stripe that has data blocks destroyed
> by an update with a write hole event.  Also this approach assumes that
> whatever is flipping bits in RAM is not in and of itself corrupting data
> or damaging the filesystem in unrecoverable ways, but most RAM-corrupting
> agents in the real world do not limit themselves only to detectable and
> recoverable mischief.
>
> Aside:  As a best practice, if you see one-bit corruptions on your
> btrfs filesystem, it is time to start replacing hardware, possibly also
> finding a new hardware vendor or model (assuming the corruption is coming
> from hardware, not a kernel 

Re: [PATCH RESEND 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better

2018-04-01 Thread Qu Wenruo
Gentle ping?

The patch is small and (with its 2nd patch) should fix trim behavior
inside block groups.

Thanks,
Qu

On 2017年11月28日 15:08, Qu Wenruo wrote:
> Function btrfs_trim_fs() doesn't handle errors in a consistent way, if
> error happens when trimming existing block groups, it will skip the
> remaining blocks and continue to trim unallocated space for each device.
> 
> And the return value will only reflect the final error from device
> trimming.
> 
> This patch will fix such behavior by:
> 
> 1) Recording first error from block group or device trimming
>So return value will also reflect any error found when trimming.
>Make developer more aware of the problem.
> 
> 2) Outputting btrfs warning message for each trimming failure
>Any error for block group or device trimming will cause btrfs warning
>kernel message.
> 
> 3) Continuing trimming if we can
>If we failed to trim one block group or device, we could still try
>next block group or device.
> 
> Such behavior can avoid confusion for case like failure to trim the
> first block group and then only unallocated space is trimmed.
> 
> Reported-by: Chris Murphy 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/extent-tree.c | 59 
> --
>  1 file changed, 43 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 673ac4e01dd0..f830aa91ac3d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10948,6 +10948,16 @@ static int btrfs_trim_free_extents(struct 
> btrfs_device *device,
>   return ret;
>  }
>  
> +/*
> + * Trim the whole fs, by:
> + * 1) Trimming free space in each block group
> + * 2) Trimming unallocated space in each device
> + *
> + * Will try to continue trimming even if we failed to trim one block group or
> + * device.
> + * The return value will be the error return value of the first error.
> + * Or 0 if nothing wrong happened.
> + */
>  int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
>  {
>   struct btrfs_block_group_cache *cache = NULL;
> @@ -10958,6 +10968,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
> struct fstrim_range *range)
>   u64 end;
>   u64 trimmed = 0;
>   u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
> + int bg_ret = 0;
> + int dev_ret = 0;
>   int ret = 0;
>  
>   /*
> @@ -10968,7 +10980,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
> struct fstrim_range *range)
>   else
>   cache = btrfs_lookup_block_group(fs_info, range->start);
>  
> - while (cache) {
> + for (; cache; cache = next_block_group(fs_info, cache)) {
>   if (cache->key.objectid >= (range->start + range->len)) {
>   btrfs_put_block_group(cache);
>   break;
> @@ -10982,29 +10994,36 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
> struct fstrim_range *range)
>   if (!block_group_cache_done(cache)) {
>   ret = cache_block_group(cache, 0);
>   if (ret) {
> - btrfs_put_block_group(cache);
> - break;
> + btrfs_warn_rl(fs_info,
> + "failed to cache block group %llu ret %d",
> +cache->key.objectid, ret);
> + if (!bg_ret)
> + bg_ret = ret;
> + continue;
>   }
>   ret = wait_block_group_cache_done(cache);
>   if (ret) {
> - btrfs_put_block_group(cache);
> - break;
> + btrfs_warn_rl(fs_info,
> + "failed to wait cache for block group %llu ret %d",
> +cache->key.objectid, ret);
> + if (!bg_ret)
> + bg_ret = ret;
> + continue;
>   }
>   }
> - ret = btrfs_trim_block_group(cache,
> -  _trimmed,
> -  start,
> -  end,
> -  range->minlen);
> + ret = btrfs_trim_block_group(cache, _trimmed,
> + start, end, range->minlen);
>  
>   trimmed += group_trimmed;
>   if (ret) {
> - btrfs_put_block_group(cache);
> - 

Re: bug? fstrim only trims unallocated space, not unused in bg's

2018-04-01 Thread Qu Wenruo


On 2018年04月01日 11:28, Chris Murphy wrote:
> On Mon, Nov 20, 2017 at 11:10 PM, Qu Wenruo  wrote:
>>
>>
>> On 2017年11月21日 13:58, Chris Murphy wrote:
>>> On Mon, Nov 20, 2017 at 9:58 PM, Qu Wenruo  wrote:


 On 2017年11月21日 12:49, Chris Murphy wrote:
> On Mon, Nov 20, 2017 at 9:43 PM, Qu Wenruo  wrote:
>>
>>
>>>
>>> Apply in addition to previous patch? Or apply to clean v4.14?
>>
>> On previous patch.
>
> Refuses to apply with or without previous patch.
>
> $ git apply -v ~/qufstrim3.patch
> Checking patch fs/btrfs/extent-tree.c...
> error: while searching for:
>int dev_ret = 0;
>int ret = 0;
>
>/*
> * try to trim all FS space, our block group may start from 
> non-zero.
> */
>
> error: patch failed: fs/btrfs/extent-tree.c:10972
> error: fs/btrfs/extent-tree.c: patch does not apply
>

 Please try this branch.

 It's just previous patch and diff merged together and applied on v4.14
 tag from torvalds.

 https://github.com/adam900710/linux/tree/tmp
>>>
>>> # fstrim -v /
>>> /: 38 GiB (40767586304 bytes) trimmed
>>> # dmesg
>>>
>>> ..snip...
>>> [   46.408792] BTRFS info (device nvme0n1p8): trimming btrfs, start=0
>>> len=75161927680 minlen=512
>>> [   46.408800] BTRFS info (device nvme0n1p8): bg start=140882477056
>>> len=1073741824
>>> [   46.433867] BTRFS info (device nvme0n1p8): trimming done
>>
>> Great (for the output, not for the trimming failure).
>>
>> And the problem is very obvious now.
>> 140882477056 << First chunk start
>> 75161927680  << length of fstrim_range passed in
>>
>> Obviously, fstrim_range passed in is using the filesystem size it
>> assumes to be.
>>
>> While we stupidly use the range in fstrim_range without considering the
>> fact that, we're dealing with *btrfs logical address space*.
>> Where our chunk can start from any bytenr (well, at least aligned with
>> sectorsize).
>> When I read the code I also think the range check has nothing wrong at all.
>>
>> So the truth here is, we should not ever try to check the range from
>> fstrim_range.
>>
>> And the problem means that, a normal btrfs with some usage and after
>> several full balance, fstrim will only trim the unallocated space for btrfs.
>>
>> Now the fix should not be a hard to craft.
>>
>> Great thanks for all your help to locate the problem.
> 
> I still see this problem in 4.16.0-0.rc7.git1.1.fc29.x86_64. Are there
> any patches to test?
> 
> [chris@f27h mnt]$ sudo btrfs fi us /
> Overall:
> Device size:  70.00GiB
> Device allocated:  60.06GiB
> Device unallocated:   9.94GiB
> Device missing: 0.00B
> Used:  36.89GiB
> Free (estimated):  29.75GiB(min: 24.78GiB)
> Data ratio:  1.00
> Metadata ratio:  1.00
> Global reserve: 107.77MiB(used: 12.38MiB)
> 
> Data,single: Size:55.00GiB, Used:35.19GiB
>/dev/nvme0n1p9  55.00GiB
> 
> Metadata,single: Size:5.00GiB, Used:1.70GiB
>/dev/nvme0n1p9   5.00GiB
> 
> System,DUP: Size:32.00MiB, Used:16.00KiB
>/dev/nvme0n1p9  64.00MiB
> 
> Unallocated:
>/dev/nvme0n1p9   9.94GiB
> [chris@f27h mnt]$ sudo fstrim -v /
> [sudo] password for chris:
> /: 10 GiB (10669260800 bytes) trimmed

The latest version is here
https://patchwork.kernel.org/patch/10078773/
https://patchwork.kernel.org/patch/10083979/

And I didn't see it in misc-next either, I may need to ping it soon.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Btrfs: bail out on error during replay_dir_deletes

2018-04-01 Thread Nikolay Borisov


On 31.03.2018 01:11, Liu Bo wrote:
> If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
> to bail out, otherwise @ret would be forced to be 0 after 'break;' and
> the caller won't be aware of it.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: Nikolay Borisov 

This is also a long-standing bug and needs stable@
> ---
>  fs/btrfs/tree-log.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 4ee9431..11e2c26 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -2356,8 +2356,10 @@ static noinline int replay_dir_deletes(struct 
> btrfs_trans_handle *trans,
>   nritems = btrfs_header_nritems(path->nodes[0]);
>   if (path->slots[0] >= nritems) {
>   ret = btrfs_next_leaf(root, path);
> - if (ret)
> + if (ret == 1)
>   break;
> + else if (ret < 0)
> + goto out;
>   }
>   btrfs_item_key_to_cpu(path->nodes[0], _key,
> path->slots[0]);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix NULL pointer dereference in log_dir_items

2018-04-01 Thread Nikolay Borisov


On 31.03.2018 01:11, Liu Bo wrote:
> 0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
> returned, path->nodes[0] could be NULL, log_dir_items lacks such a
> check for <0 and we may run into a null pointer dereference panic.
> 
> Signed-off-by: Liu Bo 
Reviewed-by: Nikolay Borisov 

This bug has been present ever since 2.6.29 (e02119d5a7b4 ("Btrfs: Add a
write ahead tree log to optimize synchronous operations"))
 so this needs a stable tag.

> ---
>  fs/btrfs/tree-log.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 4344577..4ee9431 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3518,8 +3518,11 @@ static noinline int log_dir_items(struct 
> btrfs_trans_handle *trans,
>* from this directory and from this transaction
>*/
>   ret = btrfs_next_leaf(root, path);
> - if (ret == 1) {
> - last_offset = (u64)-1;
> + if (ret) {
> + if (ret == 1)
> + last_offset = (u64)-1;
> + else
> + err = ret;
>   goto done;
>   }
>   btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html