date:20180314

This patch enhance the tree block level mismatch by the following
methods:

1) Merge same warning branches into one
   We had two branches showing the same message, and their condition
   is also the same. Merge them

2) Only skip bad slot
   The old code skipped all the remaining slots, here we just skip one
   slot to output as many correct tree blocks as possible.

3) Enhance warning message
   Ouput the parent bytenr and expected and wrong level, so we don't
   need refer to the stdout to get which tree block is corrupted.

Signed-off-by: Qu Wenruo 
---
 print-tree.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/print-tree.c b/print-tree.c
index 45350fea0963..c1e7af65ce4b 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -1389,19 +1389,16 @@ void btrfs_print_tree(struct btrfs_root *root, struct 
extent_buffer *eb, int fol
(unsigned long long)btrfs_header_owner(eb));
continue;
}
-   if (btrfs_is_leaf(next) && btrfs_header_level(eb) != 1) {
-   warning(
-   "eb corrupted: item %d eb level %d next level %d, skipping the rest",
-   i, btrfs_header_level(next),
-   btrfs_header_level(eb));
-   goto out;
-   }
if (btrfs_header_level(next) != btrfs_header_level(eb) - 1) {
warning(
-   "eb corrupted: item %d eb level %d next level %d, skipping the rest",
-   i, btrfs_header_level(next),
-   btrfs_header_level(eb));
-   goto out;
+   "eb corrupted: parent bytenr %llu slot %d level %d child bytenr %llu 
level has %d expect %d, skipping the slot",
+   btrfs_header_bytenr(eb), i,
+   btrfs_header_level(eb),
+   btrfs_header_bytenr(next),
+   btrfs_header_level(next),
+   btrfs_header_level(eb) - 1);
+   free_extent_buffer(next);
+   continue;
}
btrfs_print_tree(root, next, 1);
free_extent_buffer(next);
@@ -1409,6 +1406,5 @@ void btrfs_print_tree(struct btrfs_root *root, struct 
extent_buffer *eb, int fol
 
return;
 
-out:
free_extent_buffer(next);
 }
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: zerofree btrfs support?

2018-03-14 Thread Christoph Anton Mitterer

Hey.


On Wed, 2018-03-14 at 20:38 +0100, David Sterba wrote:
> I have a prototype code for that and after the years, seeing the
> request
> again, I'm not against adding it as long as it's not advertised as a
> security feature.
I'd expect that anyone in the security area should know that securely
deleting data is not done by overwriting it (not even overwriting it
multiple times may be enough).
So I don't think that it would be btrfs' or zerofree's duty do teach
that to users.

The later's manpage doesn't advertise it for such purpose and even
contains a (though perhaps a bit too vague) warning:
>It  may  however  be useful in other situations: for instance it can be
>used to make it more difficult to retrieve deleted  data.  Beware  that
>securely  deleting  sensitive  data  is not in general an easy task and
>usually requires writing several times on the deleted blocks.

They should probably drop the first "can be used to make it difficult"
sentence... and add that even overwriting multiple times is often not
enough.


> The zeroing simply builds on top of the trim code, so it's merely
> adding
> the ioctl interface and passing down the desired operation.
Well I think what would be really mandatory if such support is added to
an 3rd party tools is, that it will definitely continue to work
(without causing corruptions or so), even if btrfs changes.

And if it's just using existing btrfs kernel code (and zerofree itself
would mostly do nothing)... than that seems quite promising. :-)


I personally don't need it that desperate anymore, since I got discard
support working in my qemu... but others may still benefit from it, so
if it's easy, why not!? :-)

Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Verify extent allocated by find_free_extent() won't overlap with extents from previous transaction



On 2018年03月15日 05:09, David Sterba wrote:
> On Tue, Feb 13, 2018 at 09:13:32AM +0800, Qu Wenruo wrote:
>> There are reports in mail list, even with latest mainline kernel, btrfs
>> can't survive a power loss.
>>
>> Unlike journal based filesystem, btrfs doesn't use journal for such
>> work. (log tree is an optimization for fsync, not to keep fs healthy)
>> In btrfs we use metadata CoW to ensure all tree blocks are as atomic as
>> superblock.
>>
>> This leads to an obvious assumption, some code breaks such metadata CoW
>> makes btrfs no longer bullet-proof against power loss.
>>
>> This patch adds extra runtime selftest to find_free_extent(), which
>> will check the range in commit root of extent tree to ensure there is no
>> overlap at all.
>>
>> And hopes this could help us to catch the cause of the problem.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>> Unfortunately, no new problem exposed yet.
>> ---
>>  fs/btrfs/extent-tree.c | 118 
>> +
>>  1 file changed, 118 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 2f4328511ac8..3b3cd82bce3a 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -7458,6 +7458,114 @@ btrfs_release_block_group(struct 
>> btrfs_block_group_cache *cache,
>>  btrfs_put_block_group(cache);
>>  }
>>  
>> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
>> +/*
>> + * Verify if the given extent range [@start, @start + @len) conflicts any
>> + * existing extent in commit root.
>> + *
>> + * Btrfs doesn't use journal, but depends on metadata (and data) CoW to keep
>> + * the whole filesystem consistent against powerloss.
>> + * If we have overwritten any extent used by previous trans (commit root),
>> + * and powerloss happen we will corrupt our filesystem.
>> + *
>> + * Return 0 if nothing wrong.
>> + * Return <0 (including ENOMEM) means we have something wrong.
>> + * Except NOEMEM, this normally means we have extent conflicts with previous
>> + * transaction.
>> + */
>> +static int check_extent_conflicts(struct btrfs_fs_info *fs_info,
>> +  u64 start, u64 len)
>> +{
>> +struct btrfs_key key;
>> +struct btrfs_path *path;
>> +struct btrfs_root *extent_root = fs_info->extent_root;
>> +u64 extent_start;
>> +u64 extent_len;
>> +int ret;
>> +
>> +path = btrfs_alloc_path();
>> +if (!path)
>> +return -ENOMEM;
>> +
>> +
>> +key.objectid = start + len;
>> +key.type = 0;
>> +key.offset = 0;
>> +
>> +/*
>> + * Here we use extent commit root to search for any conflicts.
>> + * If extent commit tree is already overwritten, we will get transid
>> + * error and error out any way.
>> + * If extent commit tree is OK, but other extent conflicts, we will
>> + * find it.
>> + * So anyway, such search should be OK to find the conflicts.
>> + */
>> +path->search_commit_root = true;
>> +ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0);
>> +
>> +if (ret < 0)
>> +goto out;
>> +/* Impossible as no type/offset should be (u64)-1 */
> 
> I don't understand this, can you please clarify? It's not clear where
> does the (u64)-1 come from.

Sorry, left over comment.
The original search key is using (u64)-1.


> 
>> +if (ret == 0) {
>> +ret = -EUCLEAN;
>> +goto out;
>> +}
>> +ret = btrfs_previous_extent_item(extent_root, path, start);
>> +if (ret < 0)
>> +goto out;
>> +if (ret == 0) {
>> +btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
>> +extent_start = key.objectid;
>> +if (key.type == BTRFS_EXTENT_ITEM_KEY)
>> +extent_len = key.offset;
>> +else
>> +extent_len = fs_info->nodesize;
>> +goto report;
>> +}
>> +/*
>> + * Even we didn't found extent starts after @start, we still need to
>> + * ensure previous extent doesn't overlap with [@start, @start + @len)
>> + */
>> +while (1) {
>> +extent_len = 0;
>> +btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
>> +if (key.type == BTRFS_EXTENT_ITEM_KEY)
>> +extent_len = key.offset;
>> +else if (key.type == BTRFS_METADATA_ITEM_KEY)
>> +extent_len = fs_info->nodesize;
>> +
>> +if (extent_len) {
>> +if (extent_len + key.objectid <= start) {
>> +ret = 0;
>> +goto out;
>> +}
>> +extent_start = key.objectid;
>> +goto report;
>> +}
>> +if (path->slots[0] == 0) {
>> +ret = btrfs_prev_leaf(extent_root, path);
>> +if (ret > 0) {
>> +ret = 0;
>> +goto out;
>> +

Re: Ongoing Btrfs stability issues

2018-03-14 Thread Goffredo Baroncelli

On 03/14/2018 08:27 PM, Austin S. Hemmelgarn wrote:
> On 2018-03-14 14:39, Goffredo Baroncelli wrote:
>> On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
>> [...]

 In btrfs, a checksum mismatch creates an -EIO error during the reading. In 
 a conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
 checksum, so this problem doesn't exist.

 I am curious how ZFS solves this problem.
>>> It doesn't support disabling COW or the O_DIRECT flag, so it just never has 
>>> the problem in the first place.
>>
>> I would like to perform some tests: however I think that you are right. if 
>> you make a "double buffering" approach (copy the data in the page cache, 
>> compute the checksum, then write the data to disk), the mismatch should not 
>> happen. Of course this is incompatible with O_DIRECT; but disabling O_DIRECT 
>> is a prerequisite to the "double buffering"; alone it couldn't be 
>> sufficient; what about mmap ? Are we sure that this does a double buffering ?
> There's a whole lot of applications that would be showing some pretty serious 
> issues if checksumming didn't work correctly with mmap(), so I think it does 
> work correctly given that we don't have hordes of angry users and sysadmins 
> beating down the doors.

I tried to do in parallel updating a page and writing in different thread; I 
was unable to reproduce a checksum mismatch; so it seems that mmap are safe 
from this point of view;

>>
>> I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I 
>> prefer this to the checksum mismatch bug.
> This is only reasonable if you are writing to the files.  Checksums appear to 
> be checked on O_DIRECT reads, and outside of databases and VM's, read-only 
> access accounts for a significant percentage of O_DIRECT usage, partly 
> because it is needed for AIO support (nginx for example can serve files using 
> AIO and O_DIRECT and gets a pretty serious performance boost on heavily 
> loaded systems by doing so).
> 

So O_DIRECT should be unsupported/ignored only for the writing ? It could be a 
good compromise...

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] btrfs: Remove extra run_delayed_items call

On Tue, Feb 13, 2018 at 04:16:48PM +0200, Nikolay Borisov wrote:
> Before commit 581227d0d2b8 ("Btrfs: remove the time check in
> btrfs_commit_transaction()") there would be a loop in the transaction
> commit path that will go on until all existing writers ended their
> transaction handles. This loop would try and flush as much pending stuff
> as possible before going into the transaction critical section. However,
> the aforementioned commit removed it, yet continued calling this
> "extra" code for the sake of optimisation.
> 
> As it stands the rules for running the delayed items are a bit of
> a mess. First we have the one call which this patch removes, then we
> have another one, acting as a "safety net" by catching any delayed
> items introduced after the previous call is finished and extwriters
> becoming 0 (meaning no more TRANS_START/TRANS_ATTACHS are possible, only
> joins). Then we have the delayed items running as part of creating a
> snapshot (once per pedning snapshot). Finally, delayed items are run
> once more _after_ snapshots have been created. All in all this amounts
> to 3 + N (N = number of snapshots slated for creation). I think this
> could really be reduced to 2 calls - 1 before create_pending_snapshots
> is called and 1 after it's finished. This suffices to ensure consistent
> fs state during snapshot creation and after they are created.
> 
> I did a full xfstest run to ensure no consistency issues came up. Then
> I picked the tests which exhibitted rather long runtimes and looked
> closer to see if it was as a result of this commit - which it wasn't.
> Finally I did an artifical test with fsstres with only those operations
> that put stress on the delayed items code:
> 
> fsstress -z -f mkdir=25% -f rmdir=25% -f creat=25 -f unlink=25%
> -f attr_set=25% -p5 -n 25000 and also didn't observe any performance
> regressions
> 
> Signed-off-by: Nikolay Borisov 

I don't see anything wrong with the patches, but this is touching the
transaction commit, so I'll keep the review open. Running xfstests
might not be enough, this asks for some stress testing. I'll add the
patches to for-next eventually (probably a separate branch as a reminder
for myself to merge it after we're sure it's ok).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Verify extent allocated by find_free_extent() won't overlap with extents from previous transaction

On Tue, Feb 13, 2018 at 09:13:32AM +0800, Qu Wenruo wrote:
> There are reports in mail list, even with latest mainline kernel, btrfs
> can't survive a power loss.
> 
> Unlike journal based filesystem, btrfs doesn't use journal for such
> work. (log tree is an optimization for fsync, not to keep fs healthy)
> In btrfs we use metadata CoW to ensure all tree blocks are as atomic as
> superblock.
> 
> This leads to an obvious assumption, some code breaks such metadata CoW
> makes btrfs no longer bullet-proof against power loss.
> 
> This patch adds extra runtime selftest to find_free_extent(), which
> will check the range in commit root of extent tree to ensure there is no
> overlap at all.
> 
> And hopes this could help us to catch the cause of the problem.
> 
> Signed-off-by: Qu Wenruo 
> ---
> Unfortunately, no new problem exposed yet.
> ---
>  fs/btrfs/extent-tree.c | 118 
> +
>  1 file changed, 118 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2f4328511ac8..3b3cd82bce3a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7458,6 +7458,114 @@ btrfs_release_block_group(struct 
> btrfs_block_group_cache *cache,
>   btrfs_put_block_group(cache);
>  }
>  
> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> +/*
> + * Verify if the given extent range [@start, @start + @len) conflicts any
> + * existing extent in commit root.
> + *
> + * Btrfs doesn't use journal, but depends on metadata (and data) CoW to keep
> + * the whole filesystem consistent against powerloss.
> + * If we have overwritten any extent used by previous trans (commit root),
> + * and powerloss happen we will corrupt our filesystem.
> + *
> + * Return 0 if nothing wrong.
> + * Return <0 (including ENOMEM) means we have something wrong.
> + * Except NOEMEM, this normally means we have extent conflicts with previous
> + * transaction.
> + */
> +static int check_extent_conflicts(struct btrfs_fs_info *fs_info,
> +   u64 start, u64 len)
> +{
> + struct btrfs_key key;
> + struct btrfs_path *path;
> + struct btrfs_root *extent_root = fs_info->extent_root;
> + u64 extent_start;
> + u64 extent_len;
> + int ret;
> +
> + path = btrfs_alloc_path();
> + if (!path)
> + return -ENOMEM;
> +
> +
> + key.objectid = start + len;
> + key.type = 0;
> + key.offset = 0;
> +
> + /*
> +  * Here we use extent commit root to search for any conflicts.
> +  * If extent commit tree is already overwritten, we will get transid
> +  * error and error out any way.
> +  * If extent commit tree is OK, but other extent conflicts, we will
> +  * find it.
> +  * So anyway, such search should be OK to find the conflicts.
> +  */
> + path->search_commit_root = true;
> + ret = btrfs_search_slot(NULL, extent_root, , path, 0, 0);
> +
> + if (ret < 0)
> + goto out;
> + /* Impossible as no type/offset should be (u64)-1 */

I don't understand this, can you please clarify? It's not clear where
does the (u64)-1 come from.

> + if (ret == 0) {
> + ret = -EUCLEAN;
> + goto out;
> + }
> + ret = btrfs_previous_extent_item(extent_root, path, start);
> + if (ret < 0)
> + goto out;
> + if (ret == 0) {
> + btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
> + extent_start = key.objectid;
> + if (key.type == BTRFS_EXTENT_ITEM_KEY)
> + extent_len = key.offset;
> + else
> + extent_len = fs_info->nodesize;
> + goto report;
> + }
> + /*
> +  * Even we didn't found extent starts after @start, we still need to
> +  * ensure previous extent doesn't overlap with [@start, @start + @len)
> +  */
> + while (1) {
> + extent_len = 0;
> + btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
> + if (key.type == BTRFS_EXTENT_ITEM_KEY)
> + extent_len = key.offset;
> + else if (key.type == BTRFS_METADATA_ITEM_KEY)
> + extent_len = fs_info->nodesize;
> +
> + if (extent_len) {
> + if (extent_len + key.objectid <= start) {
> + ret = 0;
> + goto out;
> + }
> + extent_start = key.objectid;
> + goto report;
> + }
> + if (path->slots[0] == 0) {
> + ret = btrfs_prev_leaf(extent_root, path);
> + if (ret > 0) {
> + ret = 0;
> + goto out;
> + }
> + if (ret < 0)
> + goto out;
> + } else {
> + path->slots[0]--;
> + }
> +

Re: [PATCH] btrfs: volumes: Remove the meaningless condition of minimal nr_devs when allocating a chunk

On Wed, Jan 31, 2018 at 01:56:15PM +0800, Qu Wenruo wrote:
> When checking the minimal nr_devs, there is one dead and meaningless
> condition:
> 
> if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
> 
> 
> This condition is meaningless, @devs_increment has nothing to do with
> @sub_stripes.
> 
> In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1
> (RAID10) already has the @devs_increment set to 2.
> So no need to multiple it by @sub_stripes.
> 
> And above condition is also dead.
> For RAID10, @devs_increment * @sub_stripes equals 4, which is also the
> @devs_min of RAID10.
> For other profiles, @sub_stripes is always 1, and since @ndevs is
> rounded down to @devs_increment, the condition will always be true.
> 
> Remove the meaningless condition to make later reader wander less.

I think the condition is a leftover from times when we did not have the nice
raid table and the various values were defined in the function.

Reviewed-by: David Sterba 

> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/volumes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 215e85e22c8e..cb0a8d27661b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4729,7 +4729,7 @@ static int __btrfs_alloc_chunk(struct 
> btrfs_trans_handle *trans,
>   /* round down to number of usable stripes */
>   ndevs = round_down(ndevs, devs_increment);
>  
> - if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
> + if (ndevs < devs_min) {

The redundant condtion is duplicated in the error message a few lines
below:

4840 if (ndevs < devs_min) {
4841 ret = -ENOSPC;
4842 if (btrfs_test_opt(info, ENOSPC_DEBUG)) {
4843 btrfs_debug(info,
4844 "%s: not enough devices with free space: have=%d minimum 
required=%d",
4845 __func__, ndevs, min(devs_min,
4846 devs_increment * sub_stripes));
 

so I'll remove it as well.

4847 }
4848 goto error;
4849 }
>   ret = -ENOSPC;
>   goto error;
>   }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Document parameters of btrfs_reserve_extent

On Tue, Mar 13, 2018 at 09:56:23PM +0800, Anand Jain wrote:
> 
> 
> On 03/13/2018 06:22 PM, Nikolay Borisov wrote:
> > This function is the entry to the extent allocator and as such has
> > quite a number of parameters. Some of those have subtle effects on the
> > allocation algorithm. Document the parameters.
> > 
> > Signed-off-by: Nikolay Borisov 
> 
>   Reviewed-by: Anand Jain 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: zerofree btrfs support?

On Sat, Mar 10, 2018 at 03:55:25AM +0100, Christoph Anton Mitterer wrote:
> Hi.
> 
> Just wondered... was it ever planned (or is there some equivalent) to
> get support for btrfs in zerofree?

The zerofree was requested, eg. here
https://bugzilla.kernel.org/show_bug.cgi?id=69121

and there was a discussion under proposed patches on fsdevel 3 years
ago, message-id is 1422896713-25367-1-git-send-email-hol...@ahsoftware.de

tl;dr there's nothing like zeofree now, because.

Because the security reasons are not considered good enough and giving
false sense of security. Clearing the qemu images sounds good but
there's another way to do that via trim.

I have a prototype code for that and after the years, seeing the request
again, I'm not against adding it as long as it's not advertised as a
security feature.

The zeroing simply builds on top of the trim code, so it's merely adding
the ioctl interface and passing down the desired operation.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ongoing Btrfs stability issues

2018-03-14 Thread Austin S. Hemmelgarn

On 2018-03-14 14:39, Goffredo Baroncelli wrote:

On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
[...]

In btrfs, a checksum mismatch creates an -EIO error during the reading. In a
conventional filesystem (or a btrfs filesystem w/o datasum) there is no
checksum, so this problem doesn't exist.

I am curious how ZFS solves this problem.

It doesn't support disabling COW or the O_DIRECT flag, so it just never has the
problem in the first place.

I would like to perform some tests: however I think that you are right. if you make a "double
buffering" approach (copy the data in the page cache, compute the checksum, then write the
data to disk), the mismatch should not happen. Of course this is incompatible with O_DIRECT; but
disabling O_DIRECT is a prerequisite to the "double buffering"; alone it couldn't be
sufficient; what about mmap ? Are we sure that this does a double buffering ?
There's a whole lot of applications that would be showing some pretty
serious issues if checksumming didn't work correctly with mmap(), so I
think it does work correctly given that we don't have hordes of angry
users and sysadmins beating down the doors.

I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer
this to the checksum mismatch bug.
This is only reasonable if you are writing to the files. Checksums
appear to be checked on O_DIRECT reads, and outside of databases and
VM's, read-only access accounts for a significant percentage of O_DIRECT
usage, partly because it is needed for AIO support (nginx for example
can serve files using AIO and O_DIRECT and gets a pretty serious
performance boost on heavily loaded systems by doing so).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Handle error from btrfs_uuid_tree_rem call in _btrfs_ioctl_set_received_subvol

On Mon, Mar 12, 2018 at 02:48:09PM +0200, Nikolay Borisov wrote:
> As with every function which deals with modifying the btree
> btrfs_uuid_tree_rem can fail for any number of reasons (ie. EIO/ENOMEM).
> Handle return error value from this function gracefully by aborting the
> transaction.
> 
> Fixes: dd5f9615fc5c ("Btrfs: maintain subvolume items in the UUID tree")
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ongoing Btrfs stability issues

2018-03-14 Thread Goffredo Baroncelli

On 03/14/2018 01:02 PM, Austin S. Hemmelgarn wrote:
[...]
>>
>> In btrfs, a checksum mismatch creates an -EIO error during the reading. In a 
>> conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
>> checksum, so this problem doesn't exist.
>>
>> I am curious how ZFS solves this problem.
> It doesn't support disabling COW or the O_DIRECT flag, so it just never has 
> the problem in the first place.

I would like to perform some tests: however I think that you are right. if you 
make a "double buffering" approach (copy the data in the page cache, compute 
the checksum, then write the data to disk), the mismatch should not happen. Of 
course this is incompatible with O_DIRECT; but disabling O_DIRECT is a 
prerequisite to the "double buffering"; alone it couldn't be sufficient; what 
about mmap ? Are we sure that this does a double buffering ?

I would prefer that btrfs doesn't allow O_DIRECT with the COW files. I prefer 
this to the checksum mismatch bug.


>>
>> However I have to point out that this problem is not solved by the COW. COW 
>> solved only the problem about an interrupted commit of the filesystem, where 
>> the data is update in place (so it is available by the user), but the 
>> metadata not.
> COW is irrelevant if you're bypassing it.  It's only enforced for metadata so 
> that you don't have to check the FS every time you mount it (because the way 
> BTRFS uses it guarantees consistency of the metadata).
>>
>>>
>>> Even if not... I should be only a problem in case of a crash during
>>> that,.. and than I'd still prefer to get the false positive than bad
>>> data.
>>
>> How you can know if it is a "bad data" or a "bad checksum" ?
> You can't directly.  Just like you can't know which copy in a two-device MD 
> RAID1 array is bad when they mismatch.
> 
> That's part of why I'm not all that fond of the idea of having checksums 
> without COW, you need to verify the data using secondary means anyway, so why 
> exactly should you waste time verifying it twice?

This is true

>>
>>>
>>> Anyway... it's not going to happen so the discussion is pointless.
>>> I think people can probably use dm-integrity (which btw: does no CoW
>>> either (IIRC) and still can provide integrity... ;-) ) to see whether
>>> their data is valid.
>>> No nice but since it won't change on btrfs, a possible alternative.
>>
>> Even in this case I am curious about dm-integrity would sole this issue.
> dm-integrity uses journaling, and actually based on the testing I've done, 
> will typically have much worse performance than the overhead of just enabling 
> COW on files on BTRFS and manually defragmenting them on a regular basis.

Good to know
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] btrfs: Use sizeof directly instead of a constant variable

On Tue, Mar 13, 2018 at 10:26:06AM +0200, Nikolay Borisov wrote:
> The kernel would like to have all stack VLA usage removed[1].
> Unfortunately using an integer constant variable as the size of an
> array is still considered a VLA. Instead let's use directly sizeof(var)
> which removes the VLA usage. Use the occasion to remove csum_size
> altogether and use sizeof() also for the size passed to memcmp
> 
> [1]: https://lkml.org/lkml/2018/3/7/621
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> V2: 
>  * Use sizeof(result) in memcmp to make it more clear what we are comparing. 

Reviewed-by: David Sterba 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: drop a VLA in btrfs_check_super_csum()

2018-03-14 Thread Salvatore Mesoraca

2018-03-14 16:42 GMT+01:00 David Sterba :
> We already have a patch for that from a few days ago, but thanks anyway.
>
> https://patchwork.kernel.org/patch/10277901/

Oh.. OK

Best regards,

Salvatore
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: check for SB checksum when scanned

On Tue, Mar 13, 2018 at 11:06:37PM +0800, Anand Jain wrote:
> We aren't checking the SB csum when the device scanned,
> instead we do that when mounting the device, and if the
> csum fails we fail the mount. How if we check the csum
> when the device is scanned, I can't see any reason for
> why not? any idea?
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/disk-io.c | 9 +
>  fs/btrfs/disk-io.h | 2 ++
>  fs/btrfs/volumes.c | 4 
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 156116655a32..28e71e2aaa92 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -393,8 +393,8 @@ static int verify_parent_transid(struct extent_io_tree 
> *io_tree,
>   * Return 0 if the superblock checksum type matches the checksum value of 
> that
>   * algorithm. Pass the raw disk superblock data.
>   */
> -static int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
> -   char *raw_disk_sb)
> +int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
> +char *raw_disk_sb)
>  {
>   struct btrfs_super_block *disk_sb =
>   (struct btrfs_super_block *)raw_disk_sb;
> @@ -420,8 +420,9 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
> *fs_info,
>   }
>  
>   if (csum_type >= ARRAY_SIZE(btrfs_csum_sizes)) {
> - btrfs_err(fs_info, "unsupported checksum algorithm %u",
> - csum_type);
> + if (fs_info)
> + btrfs_err(fs_info, "unsupported checksum algorithm %u",
> +   csum_type);
>   ret = 1;
>   }
>  
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 70a88d61b547..683e4b0c4e24 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -70,6 +70,8 @@ struct buffer_head *btrfs_read_dev_super(struct 
> block_device *bdev);
>  int btrfs_read_dev_one_super(struct block_device *bdev, int copy_num,
>   struct buffer_head **bh_ret);
>  int btrfs_commit_super(struct btrfs_fs_info *fs_info);
> +int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
> +char *raw_disk_sb);
>  struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root,
> struct btrfs_key *location);
>  int btrfs_init_fs_root(struct btrfs_root *root);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1e72357bdfa8..af34b8a611a0 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -677,6 +677,10 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
> *fs_devices,
>   if (ret)
>   return ret;
>  
> + if (btrfs_check_super_csum(NULL, bh->b_data)) {

I think the error reports during scan should be minimized, in this case
each time the scan will be run on that device, the checksum mismatch
will be reported.

I don't really like to see the NULL passed to btrfs_check_super_csum but
it's still better than adding an extra parameter to set the verbosity.
Documenting that at btrfs_check_super_csum would be good, mentioning
that the function can be called from mount or scan.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: drop a VLA in btrfs_check_super_csum()

On Tue, Mar 13, 2018 at 08:50:22PM +0100, Salvatore Mesoraca wrote:
> Avoid a VLA[1] by using a real constant expression instead of a variable.
> The compiler should be able to optimize the original code and avoid using
> an actual VLA. Anyway this change is useful because it will avoid a false
> positive with -Wvla, it might also help the compiler generating better
> code.
> 
> [1] https://lkml.org/lkml/2018/3/7/621
> 
> Signed-off-by: Salvatore Mesoraca 

We already have a patch for that from a few days ago, but thanks anyway.

https://patchwork.kernel.org/patch/10277901/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: check for SB checksum when scanned

On Wed, Mar 14, 2018 at 11:20:41AM +0200, Nikolay Borisov wrote:
> 
> 
> On 13.03.2018 17:06, Anand Jain wrote:
> > We aren't checking the SB csum when the device scanned,
> > instead we do that when mounting the device, and if the
> > csum fails we fail the mount. How if we check the csum
> > when the device is scanned, I can't see any reason for
> > why not? any idea?
> 
> So what problems does this solve?

Devices with partially corrupted superblock with a valid UUID will be
added to list of scanned devices. This will be rejected at mount time so
it's not a problem in the end but there are structures tracking the
device, so we'd get rid of that early.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: check for SB checksum when scanned

On Tue, Mar 13, 2018 at 11:06:37PM +0800, Anand Jain wrote:
> We aren't checking the SB csum when the device scanned,
> instead we do that when mounting the device, and if the
> csum fails we fail the mount. How if we check the csum
> when the device is scanned, I can't see any reason for
> why not? any idea?

I think we should do some minimal checks of the sb also during scanning,
as btrfs_open_one_device accesses a few items from unchecked
'disk_super' directly. So at least magic number and checksum plus
validate the checksum type.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: add missing initialization in btrfs_check_shared

On Wed, Mar 14, 2018 at 09:03:11AM -0600, Edmund Nadolski wrote:
> btrfs_check_shared calls find_parent_nodes in a loop. In each
> iteration it passes in a share_check struct to request a check for
> a shared extent.  The share_check::share_count must be re-initialized
> to zero for each iteration in order to avoid using a stale count
> value from the previous iteration, thus causing a false
> shared extent indication.
> 
> Signed-off-by: Edmund Nadolski 

I've replaced the changelog with text from cover letter and added the
Fixes: reference to 3ec4d3238ab that introduces 'shared'.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: add missing initialization in btrfs_check_shared



On 14.03.2018 17:03, Edmund Nadolski wrote:
> btrfs_check_shared calls find_parent_nodes in a loop. In each
> iteration it passes in a share_check struct to request a check for
> a shared extent.  The share_check::share_count must be re-initialized
> to zero for each iteration in order to avoid using a stale count
> value from the previous iteration, thus causing a false
> shared extent indication.
> 
> Signed-off-by: Edmund Nadolski 

Is there a particular commit which introduced this regression (i.e. one
of the backref walking series)? It would be useful to know so that this
can be tagged with a Fixes: tag and perhaps stable.

Also your cover letter should go as a changelog in the patch so that we
have a proper history of how this fix came to be :)

> ---
>  fs/btrfs/backref.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 4e89598..4a33448 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1519,6 +1519,7 @@ int btrfs_check_shared(struct btrfs_root *root, u64 
> inum, u64 bytenr)
>   if (!node)
>   break;
>   bytenr = node->val;
> + shared.share_count = 0;
>   cond_resched();
>   }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] missing initialization in btrfs_check_shared

On Wed, Mar 14, 2018 at 3:03 PM, Edmund Nadolski  wrote:
> This patch addresses an issue that causes fiemap to falsely
> report a shared extent.  The test case is as follows:
>
> # cat do_xfs_io
> xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
> sync
> xfs_io  -c "fiemap -v" /media/scratch/file5
>
> which gives the resulting output:
>
> # . do_xfs_io
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 4 ops; 0. sec (121.359 MiB/sec and 7766.9903 ops/sec)
> /media/scratch/file5:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:24576..24703   128 0x2001
> /media/scratch/file5:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:24576..24703   128   0x1
>
> This is because btrfs_check_shared calls find_parent_nodes
> repeatedly in a loop, passing a share_check struct to report
> the count of shared extent. But btrfs_check_shared does not
> re-initialize the count value to zero for subsequent calls
> from the loop, resulting in a false share count value. This
> is a regressive behavior from 4.13.
>
> With proper re-initialization the test result is as follows:
>
> # . do_xfs_io
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 4 ops; 0. sec (110.035 MiB/sec and 7042.2535 ops/sec)
> /media/scratch/file5:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:24576..24703   128   0x1
> /media/scratch/file5:
>  EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
>0: [0..127]:24576..24703   128   0x1
>
> which corrects the regression.

All this information, about reproducer, which kernel/commit introduced
the regression, etc, should go into the changelog of the patch, not
the cover letter, since the cover letter won't go to the git
repository.

>
> Edmund Nadolski (1):
>   btrfs: add missing initialization in btrfs_check_shared
>
>  fs/btrfs/backref.c | 1 +
>  1 file changed, 1 insertion(+)
>
> --
> 2.10.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] missing initialization in btrfs_check_shared

2018-03-14 Thread Edmund Nadolski

This patch addresses an issue that causes fiemap to falsely
report a shared extent.  The test case is as follows:

# cat do_xfs_io
xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
sync
xfs_io  -c "fiemap -v" /media/scratch/file5

which gives the resulting output:

# . do_xfs_io
wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0. sec (121.359 MiB/sec and 7766.9903 ops/sec)
/media/scratch/file5:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:24576..24703   128 0x2001
/media/scratch/file5:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:24576..24703   128   0x1

This is because btrfs_check_shared calls find_parent_nodes
repeatedly in a loop, passing a share_check struct to report
the count of shared extent. But btrfs_check_shared does not
re-initialize the count value to zero for subsequent calls
from the loop, resulting in a false share count value. This
is a regressive behavior from 4.13.

With proper re-initialization the test result is as follows:

# . do_xfs_io
wrote 65536/65536 bytes at offset 0
64 KiB, 4 ops; 0. sec (110.035 MiB/sec and 7042.2535 ops/sec)
/media/scratch/file5:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:24576..24703   128   0x1
/media/scratch/file5:
 EXT: FILE-OFFSET  BLOCK-RANGE  TOTAL FLAGS
   0: [0..127]:24576..24703   128   0x1

which corrects the regression.

Edmund Nadolski (1):
  btrfs: add missing initialization in btrfs_check_shared

 fs/btrfs/backref.c | 1 +
 1 file changed, 1 insertion(+)

-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: add missing initialization in btrfs_check_shared

2018-03-14 Thread Edmund Nadolski

btrfs_check_shared calls find_parent_nodes in a loop. In each
iteration it passes in a share_check struct to request a check for
a shared extent.  The share_check::share_count must be re-initialized
to zero for each iteration in order to avoid using a stale count
value from the previous iteration, thus causing a false
shared extent indication.

Signed-off-by: Edmund Nadolski 
---
 fs/btrfs/backref.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 4e89598..4a33448 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1519,6 +1519,7 @@ int btrfs_check_shared(struct btrfs_root *root, u64 inum, 
u64 bytenr)
if (!node)
break;
bytenr = node->val;
+   shared.share_count = 0;
cond_resched();
}
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2 v2] Btrfs-progs: add fsck test for filesystem with shared prealloc extents

2018-03-14 Thread fdmanana

From: Filipe Manana 

Verify that a filesystem check operation (fsck) does not report the
following scenario as an error:

An extent is shared between two inodes, as a result of clone/reflink
operation, and for one of the inodes, lets call it inode A, the extent is
referenced through a file extent item as a prealloc extent, while for the
other inode, call it inode B, the extent is referenced through a regular
file extent item, that is, it was written to. The goal of this test is to
make sure a filesystem check operation will not report "odd csum items"
errors for the prealloc extent at inode A, because this scenario is valid
since the extent was written through inode B and therefore it is expected
to have checksum items in the filesystem's checksum btree for that shared
extent.

Such scenario can be created with the following steps for example:

 mkfs.btrfs -f /dev/sdb
 mount /dev/sdb /mnt

 touch /mnt/foo
 xfs_io -c "falloc 0 256K" /mnt/foo
 sync

 xfs_io -c "pwrite -S 0xab 0 256K" /mnt/foo
 touch /mnt/bar
 xfs_io -c "reflink /mnt/foo 0 0 256K" /mnt/bar
 xfs_io -c "fsync" /mnt/bar

 
 mount /dev/sdb /mnt
 umount /mnt

This scenario is fixed by the following patch for the filesystem checker:

 "Btrfs-progs: check, fix false error reports for shared prealloc extents"

Signed-off-by: Filipe Manana 
Reviewed-by: Qu Wenruo 
---

V2: No changes, only added Qu's reviewed-by tag.

 .../reflinked-prealloc-extents.img.xz  | Bin 0 -> 3244 bytes
 .../030-reflinked-prealloc-extents/test.sh |  42 +
 2 files changed, 42 insertions(+)
 create mode 100644 
tests/fsck-tests/030-reflinked-prealloc-extents/reflinked-prealloc-extents.img.xz
 create mode 100755 tests/fsck-tests/030-reflinked-prealloc-extents/test.sh

diff --git 
a/tests/fsck-tests/030-reflinked-prealloc-extents/reflinked-prealloc-extents.img.xz
 
b/tests/fsck-tests/030-reflinked-prealloc-extents/reflinked-prealloc-extents.img.xz
new file mode 100644
index 
..8adf0071328806fa6981f6ef225084e517d1bc3e
GIT binary patch
literal 3244
zcmV;d3{&&{H+ooF000E$*0e?f03iV!G>wRyj;C3^v%$$4d1wo3
zjjaF1$3n~+-*XiD+@YGxDtdHe$qq4$Wo>7O_CGnuIn8OLnT=x4IGVZ-w??mVmRMZ7
z5Ay+V;mHSAoDkyM-tpo^bS;x+v}A`sNQY^!9&~{eDz@XZKXO?8OF)IjGm%(
zDUT1JIsHjj>>}y5qdSJT79-V@5Qr*LP$DwA@XNl^>?ShU!y!KD38l^=LE*nzF2}qs
zeB4zmhwleI8r9LN(p#cQeLEqdB4_Wgl7Qx-M>xaL`0TU`9k;qs6+KP*H>U4>=}6F*5_MhCaIzBlfo8uAWSd+hP%WEflaqMvG9Yi+k#=?=X1;)RqEp2fES1y0!CcOxHBNudB4H~
zg^Q=pEd>6f!;l~S^outD;V9&@RLa{&{9spkk|mHXS%NxkxrGeraxy^{Cx+(
z$hvBW#`Zhts1td;TLkX1RsNg`CWT*I)*I@mk(hGjz?_yyPd-M$ua7B`xni7MSs
zWyl`7wxZ%$wBM$)P<1(dhx?%fv+}{l<|u?L$$QGeT(rUdBI{pRI`}U)a7Fj|kXDbn-0nezRe
zyo)Z$_6XM$t$LFEKukadE#*)^!)pQ&8mYw_kMjNgxd*nQI${ow>pAWyg-
zeY~@@vvyi{r6BFgsls*62dIJ6pHyrL+V9bwi^Y3AmVbJ-rj~|0%3qm%C-loCfyP>s
z$_FFfC0=|!1xk(2zP?u6Z!@qsjy-cb?f8VOdA*CL8lOP3H25+rghFiC_`b!X$
z>P+>~7sXasAE_)+#HFn-4WN^kg_vvy*M&~#=`LtcOqgLlop`KKZ3Jgg1KVc>0+
z<&7l#_?5OdUbH9q)rI1J($!(YD81|JB{3iNd4fYKY~4PtyP7Z3#>vE
zLbdE%`v*0S%thI+abigpJ5|RLJpSeZubQp{B{+2vJAP~H{7!6?%0k~ao{0;mLf$)D
z#8FMC2d0|gj@k<#OyA%!|kb9cc7H-YKcGH#V8vGFVvS$6Fd)cwk
ze|9UrRDX1SMKK2wwAIrYlX^SQ^Gd1TmT8TquvY#QgwM?S~ZIw4!K(Esm|Uq
z`*nSS2CYP0d>BeDe@8%d7RbS$9;pJuD7(EMB+vmZ%gK70s>Oyw9^pFXUsEVku?z;W
zEAMEG2zVPq|IQkY!446b#zBa;-T;gc5rz(FDUjCpB9(^S_qi=4Lhcn8?1nXaN5b_i
z6}`_lNEJMtzqR%|caxD_PZ1dBV+a)jInRU=V%d);03w3e2LWwuKn=GMU>w=2+NG32Rn!YaXN2+v~(kl
zo^@2=_Kn$=vP0Dp_k^q>8Q{e{(7kB%9_mJ|<39=H&0Li1=wcrd{hN7URQks)E8JY=
zgM5#Nt%rY;GY*s%)x8WSes6E8vtAxZ0RAS+(V%}@kdj~mJ5{x#s
zou?Fb;e7HrfU;p4?0sY877;Z1H?}-Ygcc31-G9Y(;OpjA_g*cpG4-G%Xr76TlW!S!
zD_4#GiJ9d6}O$qGTb3Ue-WLHSzqdibPv^Qgkk
zF!gSYFP0o14elE{;)bNl2Z@l^s_=eh+ez))ZImHa$*Ti;e$`XkLqg
zSP(U&{f0R?H;DC~3y7mGE!{R^vpI=_uSE%tjI~g#ih?Ln6gz3n`*V}gKOyr)vUi}w

[PATCH 1/2 v2] Btrfs-progs: check, fix false error reports for shared prealloc extents

2018-03-14 Thread fdmanana

From: Filipe Manana 

Under some cases the filesystem checker reports an error when it finds
checksum items for an extent that is referenced by an inode as a prealloc
extent. Such cases are not an error when the extent is actually shared
(was cloned/reflinked) with other inodes and was written through one of
those other inodes.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/foo
  $ xfs_io -c "falloc 0 256K" /mnt/foo
  $ sync

  $ xfs_io -c "pwrite -S 0xab 0 256K" /mnt/foo
  $ touch /mnt/bar
  $ xfs_io -c "reflink /mnt/foo 0 0 256K" /mnt/bar
  $ xfs_io -c "fsync" /mnt/bar

  
  $ mount /dev/sdb /mnt
  $ umount /mnt

  $ btrfs check /dev/sdc
  Checking filesystem on /dev/sdb
  UUID: 52d3006e-ee3b-40eb-aa21-e56253a03d39
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 257 errors 800, odd csum item
  ERROR: errors found in fs roots
  found 688128 bytes used, error(s) found
  total csum bytes: 256
  total tree bytes: 163840
  total fs tree bytes: 65536
  total extent tree bytes: 16384
  btree space waste bytes: 138819
  file data blocks allocated: 10747904
   referenced 10747904
  $ echo $?
  1

So teach check to not report such cases as errors by checking if the
extent is shared with other inodes and if so, consider it an error the
existence of checksum items only if all those other inodes are referencing
the extent as a prealloc extent.
This case can be hit often when running the generic/475 testcase from
fstests.

A test case will follow in a separate patch.

Signed-off-by: Filipe Manana 
---

V2: Made stuff work with lowmem mode as well.
Added a comment about the limitations of the current check.
Removed filtering by inode number since it was unreliable as we can
have different inodes with same inode number but in different roots
(so opted to drop the filtering instead of filtering by root as well,
to keep it simpler).

 check/main.c|  11 ++-
 check/mode-common.c | 258 
 check/mode-common.h |   3 +
 check/mode-lowmem.c |  14 ++-
 4 files changed, 281 insertions(+), 5 deletions(-)

diff --git a/check/main.c b/check/main.c
index 392195ca..88e6c1e9 100644
--- a/check/main.c
+++ b/check/main.c
@@ -1515,8 +1515,15 @@ static int process_file_extent(struct btrfs_root *root,
if (found < num_bytes)
rec->some_csum_missing = 1;
} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
-   if (found > 0)
-   rec->errors |= I_ERR_ODD_CSUM_ITEM;
+   if (found > 0) {
+   ret = 
check_prealloc_extent_written(root->fs_info,
+   disk_bytenr,
+   num_bytes);
+   if (ret < 0)
+   return ret;
+   if (ret == 0)
+   rec->errors |= I_ERR_ODD_CSUM_ITEM;
+   }
}
}
return 0;
diff --git a/check/mode-common.c b/check/mode-common.c
index 1b56a968..559cd11d 100644
--- a/check/mode-common.c
+++ b/check/mode-common.c
@@ -24,6 +24,264 @@
 #include "check/mode-common.h"
 
 /*
+ * Check if the inode referenced by the given data reference uses the extent
+ * at disk_bytenr as a non-prealloc extent.
+ *
+ * Returns 1 if true, 0 if false and < 0 on error.
+ */
+static int check_prealloc_data_ref(struct btrfs_fs_info *fs_info,
+  u64 disk_bytenr,
+  struct btrfs_extent_data_ref *dref,
+  struct extent_buffer *eb)
+{
+   u64 rootid = btrfs_extent_data_ref_root(eb, dref);
+   u64 objectid = btrfs_extent_data_ref_objectid(eb, dref);
+   u64 offset = btrfs_extent_data_ref_offset(eb, dref);
+   struct btrfs_root *root;
+   struct btrfs_key key;
+   struct btrfs_path path;
+   int ret;
+
+   btrfs_init_path();
+   key.objectid = rootid;
+   key.type = BTRFS_ROOT_ITEM_KEY;
+   key.offset = (u64)-1;
+   root = btrfs_read_fs_root(fs_info, );
+   if (IS_ERR(root)) {
+   ret = PTR_ERR(root);
+   goto out;
+   }
+
+   key.objectid = objectid;
+   key.type = BTRFS_EXTENT_DATA_KEY;
+   key.offset = offset;
+   ret = btrfs_search_slot(NULL, root, , , 0, 0);
+   if (ret > 0) {
+   fprintf(stderr,
+   "Missing file extent item for inode %llu, root %llu, 
offset %llu",
+   objectid, rootid, offset);
+   ret = -ENOENT;
+   }
+   if (ret < 0)
+   goto out;
+
+   while (true) {
+   struct btrfs_file_extent_item *fi;
+

[PATCH] generic/015: Issue sync after deleting the fillup file

This test fails on btrfs due to the presence of delayed processing
of file deletes if the file is smaller than 32mb. Initially commit
97575acd7495b412435d06229a6d94ed9a814ada tried to fix a similar
failure by bumping the size of the filesystem. However that change
had a knock-on effect in that the scratch filesystem created is
larger than 100mb and thus not created in mixed mode. This in turn
causes the fs to have only 20mb for file data (rest is taken by DUP
metadata). Naturally, this leads to file freeing taking up to
"transaction commit interval" (default 30 s) time to properly account
the freed space.

Not standards define when unlink operations should be accounted so
btrfs is well within its right to be implemented in that way. So
to avoid this edge case just issue a sync before taking the 2nd
free space reading.

Signed-off-by: Nikolay Borisov 
---
 tests/generic/015 | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tests/generic/015 b/tests/generic/015
index bdae86dd..cb7252b9 100755
--- a/tests/generic/015
+++ b/tests/generic/015
@@ -115,6 +115,8 @@ fi
 
 echo "check free space:"
 
+sync
+
 free2=`_free`
 if [ -z "$free2" ]
 then
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARN when unmounting a subvolume that is being synced

2018-03-14 Thread Tycho Andersen

On Wed, Mar 14, 2018 at 09:46:07AM +0200, Nikolay Borisov wrote:
> 
> 
> On 14.03.2018 05:10, Tycho Andersen wrote:
> > Hi all,
> > 
> > I'm getting the WARN below. I think (?) what I'm doing when I get it
> > is that I'm unmounting a subvolume while it's being synced (concurrent
> > uses of the subvolume, at least, happy to look into it further if the
> > stack trace is not so useful). Anyway, I can fairly reliably reproduce
> > this on 4.16-rc4.
> > 
> > Thoughts?
> 
> Yeah, you've mounted the filesystem with flushoncommit. THe easiest
> thing would be to remount with that option ommitted i.e the default
> behavior. This is a well-known issue, unfortunately fixing it is a bit
> hairy and deadlock prone against fs freeze.

Excellent, thanks!

Tycho
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: check for SB checksum when scanned

2018-03-14 Thread Austin S. Hemmelgarn


On 2018-03-14 05:20, Nikolay Borisov wrote:

On 13.03.2018 17:06, Anand Jain wrote:

We aren't checking the SB csum when the device scanned,
instead we do that when mounting the device, and if the
csum fails we fail the mount. How if we check the csum
when the device is scanned, I can't see any reason for
why not? any idea?


So what problems does this solve? Only net "gain" I see is making a
function public (which is a minus in my book) thus expanding the public
interface.

A device with bogus SB's is effectively a missing device as far as the 
mount code is concerned.  AFAICT, this patch results in it being treated 
more like a missing device right from the start.


OTOH, I'm not really sure if this is an improvement or not, as I've 
never had to deal with devices with invalid SB's before.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ongoing Btrfs stability issues

2018-03-14 Thread Austin S. Hemmelgarn


On 2018-03-13 15:36, Goffredo Baroncelli wrote:

On 03/12/2018 10:48 PM, Christoph Anton Mitterer wrote:

On Mon, 2018-03-12 at 22:22 +0100, Goffredo Baroncelli wrote:

Unfortunately no, the likelihood might be 100%: there are some
patterns which trigger this problem quite easily. See The link which
I posted in my previous email. There was a program which creates a
bad checksum (in COW+DATASUM mode), and the file became unreadable.

But that rather seems like a plain bug?!


You are right, unfortunately it seems that it is catalogate as WONT-FIX :(


No reason that would conceptually make checksumming+notdatacow
impossible.

AFAIU, the conceptual thin would be about:
- data is written in nodatacow
   => thus a checksum must be written as well, so write it
- what can then of course happen is
   - both csum and data are written => fine
   - csum is written but data not and then some crash => csum will show
 that => fine
   - data is written but csum not and then some crash => csum will give
 false positive

Still better few false positives, as many unnoticed data corruptions
and no true raid repair.


A checksum mismatch, is returned as -EIO by a read() syscall. This is an event 
handled badly by most part of the programs.
I.e. suppose that a page of a VM ram image file has a wrong checksum. When the 
VM starts, tries to read the page, got -EIO and aborts. It is even possible 
that it could not print which page is corrupted. In this case, how the user 
understand the problem, and what he could do ?
Check the kernel log on the host system, which should have an error 
message saying which block failed.  If the VM itself actually gets to 
the point of booting into an OS (and properly propagates things like 
-EIO to the guest environment like it should), that OS should also log 
where the error was.


Most of the reason user applications don't tell you where the error was 
is because the kernel already does it on any sensible system, and the 
kernel tells you _exactly_ where the error was (exact block and device 
that threw the error), which user applications can't really do (they 
generally can't get sufficiently low-level information to give you all 
the info the kernel does).





Again, you are assuming that the likelihood of having a bad checksum
is low. Unfortunately this is not true. There are pattern which
exploits this bug with a likelihood=100%.


Okay I don't understand why this would be so and wouldn't assume that
the IO pattern can affect it heavily... but I'm not really btrfs
expert.

My blind assumption would have been that writing an extent of data
takes much longer to complete than writing the corresponding checksum.


The problem is the following: there is a time window between the checksum 
computation and the writing the data on the disk (which is done at the lower 
level via a DMA channel), where if the data is update the checksum would 
mismatch. This happens if we have two threads, where the first commits the data 
on the disk, and the second one updates the data (I think that both VM and 
database could behave so).
Though it only matters if you use O_DIRECT or the files in question are 
NOCOW.


In btrfs, a checksum mismatch creates an -EIO error during the reading. In a 
conventional filesystem (or a btrfs filesystem w/o datasum) there is no 
checksum, so this problem doesn't exist.

I am curious how ZFS solves this problem.
It doesn't support disabling COW or the O_DIRECT flag, so it just never 
has the problem in the first place.


However I have to point out that this problem is not solved by the COW. COW 
solved only the problem about an interrupted commit of the filesystem, where 
the data is update in place (so it is available by the user), but the metadata 
not.
COW is irrelevant if you're bypassing it.  It's only enforced for 
metadata so that you don't have to check the FS every time you mount it 
(because the way BTRFS uses it guarantees consistency of the metadata).




Even if not... I should be only a problem in case of a crash during
that,.. and than I'd still prefer to get the false positive than bad
data.


How you can know if it is a "bad data" or a "bad checksum" ?
You can't directly.  Just like you can't know which copy in a two-device 
MD RAID1 array is bad when they mismatch.


That's part of why I'm not all that fond of the idea of having checksums 
without COW, you need to verify the data using secondary means anyway, 
so why exactly should you waste time verifying it twice?




Anyway... it's not going to happen so the discussion is pointless.
I think people can probably use dm-integrity (which btw: does no CoW
either (IIRC) and still can provide integrity... ;-) ) to see whether
their data is valid.
No nice but since it won't change on btrfs, a possible alternative.


Even in this case I am curious about dm-integrity would sole this issue.
dm-integrity uses journaling, and actually based on the testing I've 
done, will typically have much worse

RE: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-14 Thread David Laight

From: Kees Cook
> Sent: 13 March 2018 22:15
...
> I'll send a "const_max()" which will refuse to work on
> non-constant-values (so it doesn't get accidentally used on variables
> that could be exposed to double-evaluation), and will work for stack
> array declarations (to avoid the overly-sensitive -Wvla checks).

ISTR the definitions were of the form:
char foo[max(sizeof (struct bah), sizeof (struct baz))];
This doesn't generate a 'foo' with the required alignment.
It would be much better to use a union.

David

Re: [PATCH 1/2] Btrfs-progs: check, fix false error reports for shared prealloc extents



On 2018年03月14日 18:33, Filipe Manana wrote:
> On Wed, Mar 14, 2018 at 1:32 AM, Qu Wenruo  wrote:
>>
>>
>> On 2018年03月14日 02:47, fdman...@kernel.org wrote:
>>> From: Filipe Manana 
>>>
>>> Under some cases the filesystem checker reports an error when it finds
>>> checksum items for an extent that is referenced by an inode as a prealloc
>>> extent. Such cases are not an error when the extent is actually shared
>>> (was cloned/reflinked) with other inodes and was written through one of
>>> those other inodes.
>>>
>>> Example:
>>>
>>>   $ mkfs.btrfs -f /dev/sdb
>>>   $ mount /dev/sdb /mnt
>>>
>>>   $ touch /mnt/foo
>>>   $ xfs_io -c "falloc 0 256K" /mnt/foo
>>>   $ sync
>>>
>>>   $ xfs_io -c "pwrite -S 0xab -b 0 256K" /mnt/foo
>>>   $ touch /mnt/bar
>>>   $ xfs_io -c "reflink /mnt/foo 0 0 256K" /mnt/bar
>>>   $ xfs_io -c "fsync" /mnt/bar
>>>
>>>   
>>>   $ mount /dev/sdb /mnt
>>>   $ umount /mnt
>>>
>>>   $ btrfs check /dev/sdc
>>>   Checking filesystem on /dev/sdb
>>>   UUID: 52d3006e-ee3b-40eb-aa21-e56253a03d39
>>>   checking extents
>>>   checking free space cache
>>>   checking fs roots
>>>   root 5 inode 257 errors 800, odd csum item
>>>   ERROR: errors found in fs roots
>>>   found 688128 bytes used, error(s) found
>>>   total csum bytes: 256
>>>   total tree bytes: 163840
>>>   total fs tree bytes: 65536
>>>   total extent tree bytes: 16384
>>>   btree space waste bytes: 138819
>>>   file data blocks allocated: 10747904
>>>referenced 10747904
>>>   $ echo $?
>>>   1
>>>
>>> So tech check to not report such cases as errors by checking if the
>>> extent is shared with other inodes and if so, consider it an error the
>>> existence of checksum items only if all those other inodes are referencing
>>> the extent as a prealloc extent.
>>> This case can be hit often when running the generic/475 testcase from
>>> fstests.
>>>
>>> A test case will follow in a separate patch.
>>>
>>> Signed-off-by: Filipe Manana 
>>
>> Looks good overall, but still some small nitpicks.
>>
>> One is, the fix is only for original mode, while the fix itself has
>> nothing mode dependent, so it's better to put the check into
>> check/mode-original.[ch] so lowmem mode can also benefit from it.
> 
> Ok, I started doing development on a release branch where the split
> doesn't exist, them moved to the development branch and wrongly
> assumed that shared code would be in main.c.
>>
>>> ---
>>>  check/main.c | 270 
>>> ++-
>>>  1 file changed, 268 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/check/main.c b/check/main.c
>>> index 392195ca..bb816311 100644
>>> --- a/check/main.c
>>> +++ b/check/main.c
>>> @@ -1424,6 +1424,264 @@ static int process_inode_extref(struct 
>>> extent_buffer *eb,
>>>
>>>  }
>>>
>>> +/*
>>> + * Check if the inode referenced by the given data reference uses the 
>>> extent
>>> + * at disk_bytenr as a non-prealloc extent.
>>> + *
>>> + * Returns 1 if true, 0 if false and < 0 on error.
>>> + */
>>
>> The check itself (along with check_prealloc_shared_data_ref) is good
>> enough to detect any regular file extents inside the prealloc extent.
>>
>> But when it finds any regular extents, it doesn't check if it's covered
>> by csum correctly.
>>
>> For example:
>>
>> |<-- prealloc extent range -->|
>> |<- the *only* regular extent ->|
>> |< range covered by csum >|
>> |<- ODD ->|
>>
>> Or
>> |<-- prealloc extent range -->|
>> |<- the *only* regular extent ->|
>> |<- csum ->|
>>|<-- ODD --->|
>>
>> So at least in theory, the best solution is not to just check if there
>> is any regular extents inside the large prealloc extent, but also check
>> how many csum bytes the regular extents contribute, and then compare it
>> to the result of count_csum_range().
> 
> Yes, I considered that initially but then moved to this simpler, but
> less accurate solution.
> So the problem is more complex then you think, as a prealloc extent
> can be partially written and then reflinked multiple times and
> fsync'ed.
> For example:
> 
> create inode A with prealloc extent [0, 256k[
> sync
> write into prealloc extent, range [0, 128k[ through inode A
> reflink prealloc extent into inode B
> reflink prealloc extent into inode C
> reflink prealloc extent into inode D
> fsync inodes B, C and D
> power fail
> remount to replay log
> 
> Now using a simple counter to count how many bytes the are covered by
> chekcsum items we would have, we would get 384K (128K * 3), more then
> the size of the extent.
> So we would have to track ranges and then ranges that partially
> overlap (due to non-zero offsets in the extent items).

Yep, I also find that we need that overlap check.

> 
> And this could go even more complex. It's doable, but I wanted to keep
> things simple for now.

I'm mostly

Re: [PATCH 1/2] Btrfs-progs: check, fix false error reports for shared prealloc extents

On Wed, Mar 14, 2018 at 9:19 AM, Nikolay Borisov  wrote:
>
>
> On 13.03.2018 20:47, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Under some cases the filesystem checker reports an error when it finds
>> checksum items for an extent that is referenced by an inode as a prealloc
>> extent. Such cases are not an error when the extent is actually shared
>> (was cloned/reflinked) with other inodes and was written through one of
>> those other inodes.
>>
>> Example:
>>
>>   $ mkfs.btrfs -f /dev/sdb
>>   $ mount /dev/sdb /mnt
>>
>>   $ touch /mnt/foo
>>   $ xfs_io -c "falloc 0 256K" /mnt/foo
>>   $ sync
>>
>>   $ xfs_io -c "pwrite -S 0xab -b 0 256K" /mnt/foo
>   ^^
> nit: You are actually missing the argument for the -b switch, how big
> should the blocks be ? Same thing applies to the commit message of your test

-b is there because originally I had a value there. It's irrelevant
the block size value, I simply forgot to delete -b.

>
>>   $ touch /mnt/bar
>>   $ xfs_io -c "reflink /mnt/foo 0 0 256K" /mnt/bar
>>   $ xfs_io -c "fsync" /mnt/bar
>>
>>   
>>   $ mount /dev/sdb /mnt
>>   $ umount /mnt
>>
>>   $ btrfs check /dev/sdc
>>   Checking filesystem on /dev/sdb
>>   UUID: 52d3006e-ee3b-40eb-aa21-e56253a03d39
>>   checking extents
>>   checking free space cache
>>   checking fs roots
>>   root 5 inode 257 errors 800, odd csum item
>>   ERROR: errors found in fs roots
>>   found 688128 bytes used, error(s) found
>>   total csum bytes: 256
>>   total tree bytes: 163840
>>   total fs tree bytes: 65536
>>   total extent tree bytes: 16384
>>   btree space waste bytes: 138819
>>   file data blocks allocated: 10747904
>>referenced 10747904
>>   $ echo $?
>>   1
>>
>> So tech check to not report such cases as errors by checking if the
>> extent is shared with other inodes and if so, consider it an error the
>> existence of checksum items only if all those other inodes are referencing
>> the extent as a prealloc extent.
>> This case can be hit often when running the generic/475 testcase from
>> fstests.
>>
>> A test case will follow in a separate patch.
>>
>> Signed-off-by: Filipe Manana 
>> ---
>>  check/main.c | 270 
>> ++-
>>  1 file changed, 268 insertions(+), 2 deletions(-)
>>
>> diff --git a/check/main.c b/check/main.c
>> index 392195ca..bb816311 100644
>> --- a/check/main.c
>> +++ b/check/main.c
>> @@ -1424,6 +1424,264 @@ static int process_inode_extref(struct extent_buffer 
>> *eb,
>>
>>  }
>>
>> +/*
>> + * Check if the inode referenced by the given data reference uses the extent
>> + * at disk_bytenr as a non-prealloc extent.
>> + *
>> + * Returns 1 if true, 0 if false and < 0 on error.
>> + */
>> +static int check_prealloc_data_ref(struct btrfs_fs_info *fs_info,
>> +u64 ino,
>> +u64 disk_bytenr,
>> +struct btrfs_extent_data_ref *dref,
>> +struct extent_buffer *eb)
>> +{
>> + u64 rootid = btrfs_extent_data_ref_root(eb, dref);
>> + u64 objectid = btrfs_extent_data_ref_objectid(eb, dref);
>> + u64 offset = btrfs_extent_data_ref_offset(eb, dref);
>> + struct btrfs_root *root;
>> + struct btrfs_key key;
>> + struct btrfs_path path;
>> + int ret;
>> +
>> + if (objectid == ino)
>> + return 0;
>> +
>> + btrfs_init_path();
>> + key.objectid = rootid;
>> + key.type = BTRFS_ROOT_ITEM_KEY;
>> + key.offset = (u64)-1;
>> + root = btrfs_read_fs_root(fs_info, );
>> + if (IS_ERR(root)) {
>> + ret = PTR_ERR(root);
>> + goto out;
>> + }
>> +
>> + key.objectid = objectid;
>> + key.type = BTRFS_EXTENT_DATA_KEY;
>> + key.offset = offset;
>> + ret = btrfs_search_slot(NULL, root, , , 0, 0);
>> + if (ret > 0) {
>> + fprintf(stderr,
>> + "Missing file extent item for inode %llu, root %llu, 
>> offset %llu",
>> + objectid, rootid, offset);
>> + ret = -ENOENT;
>> + }
>> + if (ret < 0)
>> + goto out;
>> +
>> + while (true) {
>> + struct btrfs_file_extent_item *fi;
>> + int extent_type;
>> +
>> + if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
>> + ret = btrfs_next_leaf(fs_info->extent_root, );
>> + if (ret < 0)
>> + goto out;
>> + if (ret > 0)
>> + break;
>> + }
>> +
>> + btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
>> + if (key.objectid != objectid ||
>> + key.type != BTRFS_EXTENT_DATA_KEY)
>> + break;
>> +
>> + fi = btrfs_item_ptr(path.nodes[0], path.slots[0],
>> +

Re: [PATCH 1/2] Btrfs-progs: check, fix false error reports for shared prealloc extents

On Wed, Mar 14, 2018 at 8:19 AM, Nikolay Borisov  wrote:
>
>
> On 13.03.2018 20:47, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Under some cases the filesystem checker reports an error when it finds
>> checksum items for an extent that is referenced by an inode as a prealloc
>> extent. Such cases are not an error when the extent is actually shared
>> (was cloned/reflinked) with other inodes and was written through one of
>> those other inodes.
>>
>> Example:
>>
>>   $ mkfs.btrfs -f /dev/sdb
>>   $ mount /dev/sdb /mnt
>>
>>   $ touch /mnt/foo
>>   $ xfs_io -c "falloc 0 256K" /mnt/foo
>>   $ sync
>>
>>   $ xfs_io -c "pwrite -S 0xab -b 0 256K" /mnt/foo
>>   $ touch /mnt/bar
>>   $ xfs_io -c "reflink /mnt/foo 0 0 256K" /mnt/bar
>>   $ xfs_io -c "fsync" /mnt/bar
>>
>>   
>>   $ mount /dev/sdb /mnt
>>   $ umount /mnt
>>
>>   $ btrfs check /dev/sdc
>>   Checking filesystem on /dev/sdb
>>   UUID: 52d3006e-ee3b-40eb-aa21-e56253a03d39
>>   checking extents
>>   checking free space cache
>>   checking fs roots
>>   root 5 inode 257 errors 800, odd csum item
>>   ERROR: errors found in fs roots
>>   found 688128 bytes used, error(s) found
>>   total csum bytes: 256
>>   total tree bytes: 163840
>>   total fs tree bytes: 65536
>>   total extent tree bytes: 16384
>>   btree space waste bytes: 138819
>>   file data blocks allocated: 10747904
>>referenced 10747904
>>   $ echo $?
>>   1
>>
>> So tech check to not report such cases as errors by checking if the
>> extent is shared with other inodes and if so, consider it an error the
>> existence of checksum items only if all those other inodes are referencing
>> the extent as a prealloc extent.
>> This case can be hit often when running the generic/475 testcase from
>> fstests.
>>
>> A test case will follow in a separate patch.
>>
>> Signed-off-by: Filipe Manana 
>> ---
>>  check/main.c | 270 
>> ++-
>>  1 file changed, 268 insertions(+), 2 deletions(-)
>>
>> diff --git a/check/main.c b/check/main.c
>> index 392195ca..bb816311 100644
>> --- a/check/main.c
>> +++ b/check/main.c
>> @@ -1424,6 +1424,264 @@ static int process_inode_extref(struct extent_buffer 
>> *eb,
>>
>>  }
>>
>> +/*
>> + * Check if the inode referenced by the given data reference uses the extent
>> + * at disk_bytenr as a non-prealloc extent.
>> + *
>> + * Returns 1 if true, 0 if false and < 0 on error.
>> + */
>> +static int check_prealloc_data_ref(struct btrfs_fs_info *fs_info,
>> +u64 ino,
>> +u64 disk_bytenr,
>> +struct btrfs_extent_data_ref *dref,
>> +struct extent_buffer *eb)
>> +{
>> + u64 rootid = btrfs_extent_data_ref_root(eb, dref);
>> + u64 objectid = btrfs_extent_data_ref_objectid(eb, dref);
>
> nit: Shouldn't this ino be equal to the ino passed to the function? If
> so objectid can be removed and when setting up the key for the second
> search you just assigne ino to its objectid

No, we want to skip if it's equal, because we're only interested in
references for inodes other then the one being currently processed.

>
>> + u64 offset = btrfs_extent_data_ref_offset(eb, dref);
>> + struct btrfs_root *root;
>> + struct btrfs_key key;
>> + struct btrfs_path path;
>> + int ret;
>> +
>> + if (objectid == ino)
>> + return 0;
>> +
>> + btrfs_init_path();
>> + key.objectid = rootid;
>> + key.type = BTRFS_ROOT_ITEM_KEY;
>> + key.offset = (u64)-1;
>> + root = btrfs_read_fs_root(fs_info, );
>> + if (IS_ERR(root)) {
>> + ret = PTR_ERR(root);
>> + goto out;
>> + }
>> +
>> + key.objectid = objectid;
>> + key.type = BTRFS_EXTENT_DATA_KEY;
>> + key.offset = offset;
>> + ret = btrfs_search_slot(NULL, root, , , 0, 0);
>> + if (ret > 0) {
>> + fprintf(stderr,
>> + "Missing file extent item for inode %llu, root %llu, 
>> offset %llu",
>> + objectid, rootid, offset);
>> + ret = -ENOENT;
>> + }
>> + if (ret < 0)
>> + goto out;
>> +
>> + while (true) {
>> + struct btrfs_file_extent_item *fi;
>> + int extent_type;
>> +
>> + if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
>> + ret = btrfs_next_leaf(fs_info->extent_root, );
>> + if (ret < 0)
>> + goto out;
>> + if (ret > 0)
>> + break;
>> + }
>> +
>> + btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
>> + if (key.objectid != objectid ||
>> + key.type != BTRFS_EXTENT_DATA_KEY)
>> + break;
>> +
>> + fi = btrfs_item_ptr(path.nodes[0], path.slots[0],
>> +

Re: [PATCH RFC 3/3] fstests: generic: Check the fs after each FUA writes



On 2018年03月14日 18:27, Amir Goldstein wrote:
> On Wed, Mar 14, 2018 at 11:02 AM, Qu Wenruo  wrote:
>> Basic test case which triggers fsstress with dm-log-writes, and then
>> check the fs after each FUA writes.
>> With needed infrastructure and special handlers for journal based fs.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>> In my test, xfs and btrfs survives while ext4 would report error during fsck.
>>
>> My current biggest concern is, we abuse $TEST_DEV and mkfs on it all by
>> ourselves. Not sure if it's allowed.
> 
> Definitely not allowed.
> 
> I suppose you could either create an LVM volume on scratch for the tested
> fs and leave room for a snapshot target on scratch or do something similar
> with LOGWRITES_DEV, which may be more appropriate in the context of
> common/dmlogwrites helper.

This sounds much better.

> 
> I see you are posting this generic series independently from the btrsf 
> specific
> series, but the two series have conflicting changes. How do intent for
> the final merged result to look like?

Btrfs specific one will be discard and merged into this.
The found btrfs specific problem proves to be harmless now.

Thanks,
Qu

> 
> 
>> ---
>>  common/dmlogwrites| 119 
>>  tests/generic/481 | 124 
>> ++
>>  tests/generic/481.out |   2 +
>>  tests/generic/group   |   1 +
>>  4 files changed, 246 insertions(+)
>>  create mode 100755 tests/generic/481
>>  create mode 100644 tests/generic/481.out
>>
>> diff --git a/common/dmlogwrites b/common/dmlogwrites
>> index 467b872e..258f5887 100644
>> --- a/common/dmlogwrites
>> +++ b/common/dmlogwrites
>> @@ -126,3 +126,122 @@ _log_writes_cleanup()
>> $UDEV_SETTLE_PROG >/dev/null 2>&1
>> _log_writes_remove
>>  }
>> +
>> +# Convert log writes mark to entry number
>> +# Result entry number is output to stdout, could be empty if not found
>> +_log_writes_mark_to_entry_number()
>> +{
>> +   local _mark=$1
>> +   local ret
>> +
>> +   [ -z "$_mark" ] && _fatal \
>> +   "mark must be given for _log_writes_mark_to_entry_number"
>> +
>> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
>> +   --end-mark $_mark 2> /dev/null)
>> +   [ -z "$ret" ] && return
>> +   ret=$(echo "$ret" | cut -f1 -d\@)
>> +   echo "mark $_mark has entry number $ret" >> $seqres.full
>> +   echo "$ret"
>> +}
>> +
>> +# Find next fua write entry number
>> +# Result entry number is output to stdout, could be empty if not found
>> +_log_writes_find_next_fua()
>> +{
>> +   local _start_entry=$1
>> +   local ret
>> +
>> +   [ -z "$_start_entry" ] && _start_entry=0
>> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
>> + --next-fua --start-entry $_start_entry 2> /dev/null)
>> +   [ -z "$ret" ] && return
>> +
>> +   ret=$(echo "$ret" | cut -f1 -d\@)
>> +   echo "next fua is entry number $ret" >> $seqres.full
>> +   echo "$ret"
>> +}
>> +
>> +# Replay log range to specified entry
>> +# $1:  End entry. The entry with this number *WILL* be replayed
>> +# $2:  Start entry. If not specified, start from the first entry.
>> +# $3:  Verbose. If set to 'y' do verbose output
>> +_log_writes_replay_log_entry_range()
>> +{
>> +   local _end=$1
>> +   local _start=$2
>> +   local _verbose=$3
>> +
>> +   [ -z "$_end" ] && _fatal \
>> +   "end entry must be specified for _log_writes_replay_log_entry_range"
>> +
>> +   [ "x$verbose" == "xy" ] && _verbose="-v"
> 
> Those xy tricks are really not needed with bash.
> "$verbose" == "y" is nicer and better.
> 
> 
>> +   [ -z "$_start" ] && _start=0
>> +   [ "$_start" -gt "$_end" ] && _fatal \
>> +   "wrong parameter order for 
>> _log_writes_replay_log_entry_range:start=$_start end=$_end"
>> +
>> +   # To ensure we replay the last entry, for _start == 0 case,
>> +   # we need to manually increase the end entry number to ensure
>> +   # it's played
>> +   echo "=== replay from $_start to $_end ===" >> $seqres.full
>> +   $here/src/log-writes/replay-log --log $LOGWRITES_DEV \
>> +   --replay $SCRATCH_DEV --start-entry $_start \
>> +   --limit $(($_end - $_start + 1)) $_verbose \
>> +   >> $seqres.full 2>&1
>> +   [ $? -ne 0 ] && _fatal "replay failed"
>> +}
>> +
>> +_log_writes_cleanup_snapshot()
>> +{
>> +   $UDEV_SETTLE_PROG > /dev/null 2>&1
>> +   $DMSETUP_PROG remove "$DMLOGWRITES_SNAPSHOT_NAME" > /dev/null 2>&1
>> +   $DMSETUP_PROG remove "$DMLOGWRITES_ORIGIN_NAME" > /dev/null 2>&1
>> +}
>> +
>> +# Helper to create snapshot of a the replayed device
>> +# Useful for journal based filesystem such as XFS and Ext4 to replay
>> +# their journal without touching the replay device, so that we can
>> +# continue replaying other than replay from the beginning.
>>

Re: [PATCH 1/2] Btrfs-progs: check, fix false error reports for shared prealloc extents

On Wed, Mar 14, 2018 at 1:32 AM, Qu Wenruo  wrote:
>
>
> On 2018年03月14日 02:47, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Under some cases the filesystem checker reports an error when it finds
>> checksum items for an extent that is referenced by an inode as a prealloc
>> extent. Such cases are not an error when the extent is actually shared
>> (was cloned/reflinked) with other inodes and was written through one of
>> those other inodes.
>>
>> Example:
>>
>>   $ mkfs.btrfs -f /dev/sdb
>>   $ mount /dev/sdb /mnt
>>
>>   $ touch /mnt/foo
>>   $ xfs_io -c "falloc 0 256K" /mnt/foo
>>   $ sync
>>
>>   $ xfs_io -c "pwrite -S 0xab -b 0 256K" /mnt/foo
>>   $ touch /mnt/bar
>>   $ xfs_io -c "reflink /mnt/foo 0 0 256K" /mnt/bar
>>   $ xfs_io -c "fsync" /mnt/bar
>>
>>   
>>   $ mount /dev/sdb /mnt
>>   $ umount /mnt
>>
>>   $ btrfs check /dev/sdc
>>   Checking filesystem on /dev/sdb
>>   UUID: 52d3006e-ee3b-40eb-aa21-e56253a03d39
>>   checking extents
>>   checking free space cache
>>   checking fs roots
>>   root 5 inode 257 errors 800, odd csum item
>>   ERROR: errors found in fs roots
>>   found 688128 bytes used, error(s) found
>>   total csum bytes: 256
>>   total tree bytes: 163840
>>   total fs tree bytes: 65536
>>   total extent tree bytes: 16384
>>   btree space waste bytes: 138819
>>   file data blocks allocated: 10747904
>>referenced 10747904
>>   $ echo $?
>>   1
>>
>> So tech check to not report such cases as errors by checking if the
>> extent is shared with other inodes and if so, consider it an error the
>> existence of checksum items only if all those other inodes are referencing
>> the extent as a prealloc extent.
>> This case can be hit often when running the generic/475 testcase from
>> fstests.
>>
>> A test case will follow in a separate patch.
>>
>> Signed-off-by: Filipe Manana 
>
> Looks good overall, but still some small nitpicks.
>
> One is, the fix is only for original mode, while the fix itself has
> nothing mode dependent, so it's better to put the check into
> check/mode-original.[ch] so lowmem mode can also benefit from it.

Ok, I started doing development on a release branch where the split
doesn't exist, them moved to the development branch and wrongly
assumed that shared code would be in main.c.
>
>> ---
>>  check/main.c | 270 
>> ++-
>>  1 file changed, 268 insertions(+), 2 deletions(-)
>>
>> diff --git a/check/main.c b/check/main.c
>> index 392195ca..bb816311 100644
>> --- a/check/main.c
>> +++ b/check/main.c
>> @@ -1424,6 +1424,264 @@ static int process_inode_extref(struct extent_buffer 
>> *eb,
>>
>>  }
>>
>> +/*
>> + * Check if the inode referenced by the given data reference uses the extent
>> + * at disk_bytenr as a non-prealloc extent.
>> + *
>> + * Returns 1 if true, 0 if false and < 0 on error.
>> + */
>
> The check itself (along with check_prealloc_shared_data_ref) is good
> enough to detect any regular file extents inside the prealloc extent.
>
> But when it finds any regular extents, it doesn't check if it's covered
> by csum correctly.
>
> For example:
>
> |<-- prealloc extent range -->|
> |<- the *only* regular extent ->|
> |< range covered by csum >|
> |<- ODD ->|
>
> Or
> |<-- prealloc extent range -->|
> |<- the *only* regular extent ->|
> |<- csum ->|
>|<-- ODD --->|
>
> So at least in theory, the best solution is not to just check if there
> is any regular extents inside the large prealloc extent, but also check
> how many csum bytes the regular extents contribute, and then compare it
> to the result of count_csum_range().

Yes, I considered that initially but then moved to this simpler, but
less accurate solution.
So the problem is more complex then you think, as a prealloc extent
can be partially written and then reflinked multiple times and
fsync'ed.
For example:

create inode A with prealloc extent [0, 256k[
sync
write into prealloc extent, range [0, 128k[ through inode A
reflink prealloc extent into inode B
reflink prealloc extent into inode C
reflink prealloc extent into inode D
fsync inodes B, C and D
power fail
remount to replay log

Now using a simple counter to count how many bytes the are covered by
chekcsum items we would have, we would get 384K (128K * 3), more then
the size of the extent.
So we would have to track ranges and then ranges that partially
overlap (due to non-zero offsets in the extent items).

And this could go even more complex. It's doable, but I wanted to keep
things simple for now.

thanks

>
> Thanks,
> Qu
>
>> +static int check_prealloc_data_ref(struct btrfs_fs_info *fs_info,
>> +u64 ino,
>> +u64 disk_bytenr,
>> +struct btrfs_extent_data_ref *dref,
>> +

Re: [PATCH RFC 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-14 Thread Amir Goldstein

On Wed, Mar 14, 2018 at 11:02 AM, Qu Wenruo  wrote:
> Basic test case which triggers fsstress with dm-log-writes, and then
> check the fs after each FUA writes.
> With needed infrastructure and special handlers for journal based fs.
>
> Signed-off-by: Qu Wenruo 
> ---
> In my test, xfs and btrfs survives while ext4 would report error during fsck.
>
> My current biggest concern is, we abuse $TEST_DEV and mkfs on it all by
> ourselves. Not sure if it's allowed.

Definitely not allowed.

I suppose you could either create an LVM volume on scratch for the tested
fs and leave room for a snapshot target on scratch or do something similar
with LOGWRITES_DEV, which may be more appropriate in the context of
common/dmlogwrites helper.

I see you are posting this generic series independently from the btrsf specific
series, but the two series have conflicting changes. How do intent for
the final merged result to look like?


> ---
>  common/dmlogwrites| 119 
>  tests/generic/481 | 124 
> ++
>  tests/generic/481.out |   2 +
>  tests/generic/group   |   1 +
>  4 files changed, 246 insertions(+)
>  create mode 100755 tests/generic/481
>  create mode 100644 tests/generic/481.out
>
> diff --git a/common/dmlogwrites b/common/dmlogwrites
> index 467b872e..258f5887 100644
> --- a/common/dmlogwrites
> +++ b/common/dmlogwrites
> @@ -126,3 +126,122 @@ _log_writes_cleanup()
> $UDEV_SETTLE_PROG >/dev/null 2>&1
> _log_writes_remove
>  }
> +
> +# Convert log writes mark to entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_mark_to_entry_number()
> +{
> +   local _mark=$1
> +   local ret
> +
> +   [ -z "$_mark" ] && _fatal \
> +   "mark must be given for _log_writes_mark_to_entry_number"
> +
> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
> +   --end-mark $_mark 2> /dev/null)
> +   [ -z "$ret" ] && return
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "mark $_mark has entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Find next fua write entry number
> +# Result entry number is output to stdout, could be empty if not found
> +_log_writes_find_next_fua()
> +{
> +   local _start_entry=$1
> +   local ret
> +
> +   [ -z "$_start_entry" ] && _start_entry=0
> +   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
> + --next-fua --start-entry $_start_entry 2> /dev/null)
> +   [ -z "$ret" ] && return
> +
> +   ret=$(echo "$ret" | cut -f1 -d\@)
> +   echo "next fua is entry number $ret" >> $seqres.full
> +   echo "$ret"
> +}
> +
> +# Replay log range to specified entry
> +# $1:  End entry. The entry with this number *WILL* be replayed
> +# $2:  Start entry. If not specified, start from the first entry.
> +# $3:  Verbose. If set to 'y' do verbose output
> +_log_writes_replay_log_entry_range()
> +{
> +   local _end=$1
> +   local _start=$2
> +   local _verbose=$3
> +
> +   [ -z "$_end" ] && _fatal \
> +   "end entry must be specified for _log_writes_replay_log_entry_range"
> +
> +   [ "x$verbose" == "xy" ] && _verbose="-v"

Those xy tricks are really not needed with bash.
"$verbose" == "y" is nicer and better.


> +   [ -z "$_start" ] && _start=0
> +   [ "$_start" -gt "$_end" ] && _fatal \
> +   "wrong parameter order for 
> _log_writes_replay_log_entry_range:start=$_start end=$_end"
> +
> +   # To ensure we replay the last entry, for _start == 0 case,
> +   # we need to manually increase the end entry number to ensure
> +   # it's played
> +   echo "=== replay from $_start to $_end ===" >> $seqres.full
> +   $here/src/log-writes/replay-log --log $LOGWRITES_DEV \
> +   --replay $SCRATCH_DEV --start-entry $_start \
> +   --limit $(($_end - $_start + 1)) $_verbose \
> +   >> $seqres.full 2>&1
> +   [ $? -ne 0 ] && _fatal "replay failed"
> +}
> +
> +_log_writes_cleanup_snapshot()
> +{
> +   $UDEV_SETTLE_PROG > /dev/null 2>&1
> +   $DMSETUP_PROG remove "$DMLOGWRITES_SNAPSHOT_NAME" > /dev/null 2>&1
> +   $DMSETUP_PROG remove "$DMLOGWRITES_ORIGIN_NAME" > /dev/null 2>&1
> +}
> +
> +# Helper to create snapshot of a the replayed device
> +# Useful for journal based filesystem such as XFS and Ext4 to replay
> +# their journal without touching the replay device, so that we can
> +# continue replaying other than replay from the beginning.
> +# $1:  Snapshot device
> +_log_writes_create_snapshot()
> +{
> +   _require_dm_target snapshot
> +
> +   local snapshot_dev=$1
> +   local cow_base=""
> +
> +   [ -z "$snapshot_dev" ] && _fatal \
> +   "@device must be specified for _log_writes_create_snapshot"
> +   local size=$(blockdev --getsz $SCRATCH_DEV)
> +

Re: FS unmountable after RAID/LVM problems



On 2018年03月14日 18:06, Dirk Gouders wrote:
> Qu Wenruo  writes:
> 
>> On 2018年03月14日 17:36, Dirk Gouders wrote:
>>> Qu Wenruo  writes:
>>>
 On 2018年03月14日 16:53, Dirk Gouders wrote:
> Qu Wenruo  writes:
>
>> On 2018年03月13日 22:49, Dirk Gouders wrote:
>> [snip]

 # btrfs inspect dump-tree -b 848986112 /dev/loop0p1
 # btrfs inspect dump-tree -b 72089600 /dev/loop0p1
>>>
>>> OK.
>>>
>>> (This mail gets a bit long but I don't want to snip probably important
>>>  information above.)
>>>
>>
>> Feel free to snip.
>> As the involved tree block is not shown anywhere.
>>
>> So it's not any root node corrupted.
>> It may be some extent tree node corrupted in this case.
>>
>> While to inspect it, we need some new functionality in btrfs inspect 
>> tree.
>>
>> Before that, would you please try the following patch and to see if it
>> helps btrfs-restore to salvage any data?
>
> I tried it and got the following output:
>
> # btrfs restore /dev/loop0p1 /mnt/
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> checksum verify failed on 363069440 found DC09290B wanted C630FD61
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> bytenr mismatch, want=363069440, have=17552567724568668829
> Could not open root, trying backup super
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> checksum verify failed on 363069440 found DC09290B wanted C630FD61
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> bytenr mismatch, want=363069440, have=17552567724568668829
> Could not open root, trying backup super
> ERROR: superblock bytenr 274877906944 is larger than device size 
> 10741612544
> Could not open root, trying backup super

 So it's still something important in the tree.

 Would you please apply this patch?
 https://patchwork.kernel.org/patch/10281329/

 And then dump the tree again using that newly added -f option?
 (Both stdout and stderr is needed)

 The dump command would be:
 # btrfs inspect dump-tree -f -b 

 Needed bytenrs would be:
 848773120  tree root
 848789504  extent root (My primary guess)
>>>
>>> I am currently preparing the diagnosis data but after the above bytenr
>>> the log grew to already 28MB.  Should I send all that data to the list?
>>
>> Nope, stderr is enough.
> 
> OK, I will attach the output to the end.  The output is separated by the
> command lines, so searching for "inspect" helps for navigation.
> 
> For extend root and fs root I provide only stderr, because they grew
> stdout by 28 resp. 150 MB.

Thanks for all your effort!

It's clear the problem is the extent tree.

I'll try to enhance open_ctree() to allow btrfs-restore to continue even
if extent tree is corrupted asap.

Thanks,
Qu

> 
> Thanks,
> 
> Dirk
> 
>>
>>>
>>> Thanks,
>>>
>>> Dirk
>>>
 30408704   dev root
 850509824  fs root (this could contain *FILENAME*, please censor
  them if needed, and it may be large)
 212353024  uuid tree (not really imporatant)

 And if it's extent root, we could enhance btrfs-progs open_ctree() to
 handle it for RO mode (needed by btrfs-restore)

 Thanks,
 Qu
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 2/3] fstests: log-writes: Add support for METADATA flag

2018-03-14 Thread Amir Goldstein

On Wed, Mar 14, 2018 at 11:02 AM, Qu Wenruo  wrote:
> Signed-off-by: Qu Wenruo 
Reviewed-by: Amir Goldstein 

> ---
>  src/log-writes/log-writes.c | 3 ++-
>  src/log-writes/log-writes.h | 9 +
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
> index a872429d..5dc22c24 100644
> --- a/src/log-writes/log-writes.c
> +++ b/src/log-writes/log-writes.c
> @@ -130,7 +130,8 @@ struct flags_to_str_entry {
> DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
> DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
> DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
> -   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
> +   DEFINE_LOG_FLAGS_STR_ENTRY(MARK),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(METADATA)
>  };
>
>  #define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> diff --git a/src/log-writes/log-writes.h b/src/log-writes/log-writes.h
> index 35ca3583..75fb8ac0 100644
> --- a/src/log-writes/log-writes.h
> +++ b/src/log-writes/log-writes.h
> @@ -20,10 +20,11 @@ typedef __u32 u32;
>  /*
>   * Constants copied from kernel file drivers/md/dm-log-writes.c
>   */
> -#define LOG_FLUSH_FLAG (1 << 0)
> -#define LOG_FUA_FLAG (1 << 1)
> -#define LOG_DISCARD_FLAG (1 << 2)
> -#define LOG_MARK_FLAG (1 << 3)
> +#define LOG_FLUSH_FLAG (1 << 0)
> +#define LOG_FUA_FLAG   (1 << 1)
> +#define LOG_DISCARD_FLAG   (1 << 2)
> +#define LOG_MARK_FLAG  (1 << 3)
> +#define LOG_METADATA_FLAG  (1 << 4)
>
>  #define WRITE_LOG_VERSION 1
>  #define WRITE_LOG_MAGIC 0x6a736677736872
> --
> 2.15.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] fstests: log-writes: Add support to output human readable flags

2018-03-14 Thread Amir Goldstein

On Wed, Mar 14, 2018 at 11:02 AM, Qu Wenruo  wrote:
> Also change the flag numeric output to hex.
>
> Signed-off-by: Qu Wenruo 
Reviewed-by: Amir Goldstein 

> ---
>  src/log-writes/log-writes.c | 70 
> -
>  1 file changed, 63 insertions(+), 7 deletions(-)
>
> diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
> index 09391574..a872429d 100644
> --- a/src/log-writes/log-writes.c
> +++ b/src/log-writes/log-writes.c
> @@ -120,6 +120,58 @@ int log_discard(struct log *log, struct log_write_entry 
> *entry)
> return 0;
>  }
>
> +#define DEFINE_LOG_FLAGS_STR_ENTRY(x)  \
> +   {LOG_##x##_FLAG, #x}
> +
> +struct flags_to_str_entry {
> +   u64 flags;
> +   const char *str;
> +} log_flags_table[] = {
> +   DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
> +   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
> +};
> +
> +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
> +#define LOG_FLAGS_BUF_SIZE 128
> +/*
> + * Convert numeric flags to human readable flags.
> + * @flags: numeric flags
> + * @buf:   output buffer for human readable string.
> + * must have enough space (LOG_FLAGS_BUF_SIZE) to contain all
> + * the string
> + */
> +static void entry_flags_to_str(u64 flags, char *buf)
> +{
> +   int empty = 1;
> +   int left_len;
> +   int i;
> +
> +   buf[0] = '\0';
> +   for (i = 0; i < ARRAY_SIZE(log_flags_table); i++) {
> +   if (flags & log_flags_table[i].flags) {
> +   if (!empty)
> +   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
> +   empty = 0;
> +   strncat(buf, log_flags_table[i].str, 
> LOG_FLAGS_BUF_SIZE);
> +   flags &= ~log_flags_table[i].flags;
> +   }
> +   }
> +   if (flags) {
> +   if (!empty)
> +   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
> +   empty = 0;
> +   left_len = LOG_FLAGS_BUF_SIZE - strnlen(buf,
> +   LOG_FLAGS_BUF_SIZE);
> +   if (left_len > 0)
> +   snprintf(buf + strnlen(buf, LOG_FLAGS_BUF_SIZE),
> +left_len, "UNKNOWN.0x%llx", flags);
> +   }
> +   if (empty)
> +   strncpy(buf, "NONE", LOG_FLAGS_BUF_SIZE);
> +}
> +
>  /*
>   * @log: the log we are replaying.
>   * @entry: entry to be replayed.
> @@ -179,6 +231,7 @@ int log_replay_next_entry(struct log *log, struct 
> log_write_entry *entry,
> size_t read_size = read_data ? log->sectorsize :
> sizeof(struct log_write_entry);
> char *buf;
> +   char flags_buf[LOG_FLAGS_BUF_SIZE];
> ssize_t ret;
> off_t offset;
> int skip = 0;
> @@ -210,19 +263,20 @@ int log_replay_next_entry(struct log *log, struct 
> log_write_entry *entry,
> log->cur_pos += read_size;
> }
>
> +   flags = le64_to_cpu(entry->flags);
> +   entry_flags_to_str(flags, flags_buf);
> skip = log_should_skip(log, entry);
> if (log_writes_verbose > 1 || (log_writes_verbose && !skip)) {
> -   printf("%s %d@%llu: sector %llu, size %llu, flags %llu\n",
> +   printf("%s %d@%llu: sector %llu, size %llu, flags 
> 0x%llx(%s)\n",
>skip ? "skipping" : "replaying",
>(int)log->cur_entry - 1, log->cur_pos / 
> log->sectorsize,
>(unsigned long long)le64_to_cpu(entry->sector),
>(unsigned long long)size,
> -  (unsigned long long)le64_to_cpu(entry->flags));
> +  (unsigned long long)flags, flags_buf);
> }
> if (!size)
> return 0;
>
> -   flags = le64_to_cpu(entry->flags);
> if (flags & LOG_DISCARD_FLAG)
> return log_discard(log, entry);
>
> @@ -301,7 +355,7 @@ int log_seek_entry(struct log *log, u64 entry_num)
> return -1;
> }
> if (log_writes_verbose > 1)
> -   printf("seek entry %d@%llu: %llu, size %llu, flags 
> %llu\n",
> +   printf("seek entry %d@%llu: %llu, size %llu, flags 
> 0x%llx\n",
>(int)i, log->cur_pos / log->sectorsize,
>(unsigned long long)le64_to_cpu(entry.sector),
>(unsigned long 
> long)le64_to_cpu(entry.nr_sectors),
> @@ -339,6 +393,7 @@ int log_seek_next_entry(struct log *log, struct 
> log_write_entry *entry,
> size_t read_size = read_data ? log->sectorsize :
> sizeof(struct log_write_entry);
> u64 flags;
> +   char flags_buf[LOG_FLAGS_BUF_SIZE];
>

Re: FS unmountable after RAID/LVM problems



On 2018年03月14日 17:36, Dirk Gouders wrote:
> Qu Wenruo  writes:
> 
>> On 2018年03月14日 16:53, Dirk Gouders wrote:
>>> Qu Wenruo  writes:
>>>
 On 2018年03月13日 22:49, Dirk Gouders wrote:
 [snip]
>>
>> # btrfs inspect dump-tree -b 848986112 /dev/loop0p1
>> # btrfs inspect dump-tree -b 72089600 /dev/loop0p1
>
> OK.
>
> (This mail gets a bit long but I don't want to snip probably important
>  information above.)
>

 Feel free to snip.
 As the involved tree block is not shown anywhere.

 So it's not any root node corrupted.
 It may be some extent tree node corrupted in this case.

 While to inspect it, we need some new functionality in btrfs inspect tree.

 Before that, would you please try the following patch and to see if it
 helps btrfs-restore to salvage any data?
>>>
>>> I tried it and got the following output:
>>>
>>> # btrfs restore /dev/loop0p1 /mnt/
>>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>>> checksum verify failed on 363069440 found DC09290B wanted C630FD61
>>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>>> bytenr mismatch, want=363069440, have=17552567724568668829
>>> Could not open root, trying backup super
>>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>>> checksum verify failed on 363069440 found DC09290B wanted C630FD61
>>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>>> bytenr mismatch, want=363069440, have=17552567724568668829
>>> Could not open root, trying backup super
>>> ERROR: superblock bytenr 274877906944 is larger than device size 10741612544
>>> Could not open root, trying backup super
>>
>> So it's still something important in the tree.
>>
>> Would you please apply this patch?
>> https://patchwork.kernel.org/patch/10281329/
>>
>> And then dump the tree again using that newly added -f option?
>> (Both stdout and stderr is needed)
>>
>> The dump command would be:
>> # btrfs inspect dump-tree -f -b 
>>
>> Needed bytenrs would be:
>> 848773120tree root
>> 848789504extent root (My primary guess)
> 
> I am currently preparing the diagnosis data but after the above bytenr
> the log grew to already 28MB.  Should I send all that data to the list?

Nope, stderr is enough.

Thanks,
Qu

> 
> Thanks,
> 
> Dirk
> 
>> 30408704 dev root
>> 850509824fs root (this could contain *FILENAME*, please censor
>>  them if needed, and it may be large)
>> 212353024uuid tree (not really imporatant)
>>
>> And if it's extent root, we could enhance btrfs-progs open_ctree() to
>> handle it for RO mode (needed by btrfs-restore)
>>
>> Thanks,
>> Qu



signature.asc
Description: OpenPGP digital signature

Re: FS unmountable after RAID/LVM problems

2018-03-14 Thread Dirk Gouders

Qu Wenruo  writes:

> On 2018年03月14日 16:53, Dirk Gouders wrote:
>> Qu Wenruo  writes:
>> 
>>> On 2018年03月13日 22:49, Dirk Gouders wrote:
>>> [snip]
>
> # btrfs inspect dump-tree -b 848986112 /dev/loop0p1
> # btrfs inspect dump-tree -b 72089600 /dev/loop0p1

 OK.

 (This mail gets a bit long but I don't want to snip probably important
  information above.)

>>>
>>> Feel free to snip.
>>> As the involved tree block is not shown anywhere.
>>>
>>> So it's not any root node corrupted.
>>> It may be some extent tree node corrupted in this case.
>>>
>>> While to inspect it, we need some new functionality in btrfs inspect tree.
>>>
>>> Before that, would you please try the following patch and to see if it
>>> helps btrfs-restore to salvage any data?
>> 
>> I tried it and got the following output:
>> 
>> # btrfs restore /dev/loop0p1 /mnt/
>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>> checksum verify failed on 363069440 found DC09290B wanted C630FD61
>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>> bytenr mismatch, want=363069440, have=17552567724568668829
>> Could not open root, trying backup super
>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>> checksum verify failed on 363069440 found DC09290B wanted C630FD61
>> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
>> bytenr mismatch, want=363069440, have=17552567724568668829
>> Could not open root, trying backup super
>> ERROR: superblock bytenr 274877906944 is larger than device size 10741612544
>> Could not open root, trying backup super
>
> So it's still something important in the tree.
>
> Would you please apply this patch?
> https://patchwork.kernel.org/patch/10281329/
>
> And then dump the tree again using that newly added -f option?
> (Both stdout and stderr is needed)
>
> The dump command would be:
> # btrfs inspect dump-tree -f -b 
>
> Needed bytenrs would be:
> 848773120 tree root
> 848789504 extent root (My primary guess)

I am currently preparing the diagnosis data but after the above bytenr
the log grew to already 28MB.  Should I send all that data to the list?

Thanks,

Dirk

> 30408704  dev root
> 850509824 fs root (this could contain *FILENAME*, please censor
>  them if needed, and it may be large)
> 212353024 uuid tree (not really imporatant)
>
> And if it's extent root, we could enhance btrfs-progs open_ctree() to
> handle it for RO mode (needed by btrfs-restore)
>
> Thanks,
> Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] btrfs: drop optimal argument from find_live_mirror()



On 14.03.2018 10:29, Anand Jain wrote:
> Drop optimal argument from the function find_live_mirror()
> as we can deduce it in the function itself. Also rename
> optimal to preferred_mirror.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: Nikolay Borisov 

> ---
> I thought I have sent v2 to the ML. But now I realize I didn't.
> v1->v2:
>Accepts David's comment to rename %optimal. IMO, %preferred_mirror is
>better than the suggested %fallback.
> 
>  fs/btrfs/volumes.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9beea7c891a7..f1b7efbdcec1 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5276,10 +5276,11 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
>  
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
>   struct map_lookup *map, int first,
> - int optimal, int dev_replace_is_ongoing)
> + int dev_replace_is_ongoing)
>  {
>   int i;
>   int num_stripes;
> + int preferred_mirror;
>   int tolerance;
>   struct btrfs_device *srcdev;
>  
> @@ -5291,6 +5292,8 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>   else
>   num_stripes = map->num_stripes;
>  
> + preferred_mirror = first + current->pid % num_stripes;
> +
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
> @@ -5304,9 +5307,9 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>* mirror is available
>*/
>   for (tolerance = 0; tolerance < 2; tolerance++) {
> - if (map->stripes[optimal].dev->bdev &&
> - (tolerance || map->stripes[optimal].dev != srcdev))
> - return optimal;
> + if (map->stripes[preferred_mirror].dev->bdev &&
> + (tolerance || map->stripes[preferred_mirror].dev != srcdev))
> + return preferred_mirror;
>   for (i = first; i < first + num_stripes; i++) {
>   if (map->stripes[i].dev->bdev &&
>   (tolerance || map->stripes[i].dev != srcdev))
> @@ -5317,7 +5320,7 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>   /* we couldn't find one that doesn't fail.  Just return something
>* and the io error handling code will clean up eventually
>*/
> - return optimal;
> + return preferred_mirror;
>  }
>  
>  static inline int parity_smaller(u64 a, u64 b)
> @@ -5844,7 +5847,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   stripe_index = mirror_num - 1;
>   else {
>   stripe_index = find_live_mirror(fs_info, map, 0,
> - current->pid % map->num_stripes,
>   dev_replace_is_ongoing);
>   mirror_num = stripe_index + 1;
>   }
> @@ -5872,8 +5874,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   int old_stripe_index = stripe_index;
>   stripe_index = find_live_mirror(fs_info, map,
> stripe_index,
> -   stripe_index +
> -   current->pid % map->sub_stripes,
> dev_replace_is_ongoing);
>   mirror_num = stripe_index - old_stripe_index + 1;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] btrfs: drop num argument from find_live_mirror()



On 14.03.2018 10:29, Anand Jain wrote:
> Obtain the stripes info from the map directly and so no need
> to pass it as an argument.
> 
> Signed-off-by: Anand Jain 

LGTM

Reviewed-by: Nikolay Borisov 

> ---
> v1->v2:
>   Accepts David's comment to rename %num to %num_stripes.
> 
>  fs/btrfs/volumes.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1e72357bdfa8..9beea7c891a7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5275,13 +5275,22 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
> *fs_info, u64 logical, u64 len)
>  }
>  
>  static int find_live_mirror(struct btrfs_fs_info *fs_info,
> - struct map_lookup *map, int first, int num,
> + struct map_lookup *map, int first,
>   int optimal, int dev_replace_is_ongoing)
>  {
>   int i;
> + int num_stripes;
>   int tolerance;
>   struct btrfs_device *srcdev;
>  
> + ASSERT((map->type &
> +  (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)));
> +
> + if (map->type & BTRFS_BLOCK_GROUP_RAID10)
> + num_stripes = map->sub_stripes;
> + else
> + num_stripes = map->num_stripes;
> +
>   if (dev_replace_is_ongoing &&
>   fs_info->dev_replace.cont_reading_from_srcdev_mode ==
>BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
> @@ -5298,7 +5307,7 @@ static int find_live_mirror(struct btrfs_fs_info 
> *fs_info,
>   if (map->stripes[optimal].dev->bdev &&
>   (tolerance || map->stripes[optimal].dev != srcdev))
>   return optimal;
> - for (i = first; i < first + num; i++) {
> + for (i = first; i < first + num_stripes; i++) {
>   if (map->stripes[i].dev->bdev &&
>   (tolerance || map->stripes[i].dev != srcdev))
>   return i;
> @@ -5835,7 +5844,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   stripe_index = mirror_num - 1;
>   else {
>   stripe_index = find_live_mirror(fs_info, map, 0,
> - map->num_stripes,
>   current->pid % map->num_stripes,
>   dev_replace_is_ongoing);
>   mirror_num = stripe_index + 1;
> @@ -5864,7 +5872,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
> *fs_info,
>   int old_stripe_index = stripe_index;
>   stripe_index = find_live_mirror(fs_info, map,
> stripe_index,
> -   map->sub_stripes, stripe_index +
> +   stripe_index +
> current->pid % map->sub_stripes,
> dev_replace_is_ongoing);
>   mirror_num = stripe_index - old_stripe_index + 1;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: check for SB checksum when scanned



On 13.03.2018 17:06, Anand Jain wrote:
> We aren't checking the SB csum when the device scanned,
> instead we do that when mounting the device, and if the
> csum fails we fail the mount. How if we check the csum
> when the device is scanned, I can't see any reason for
> why not? any idea?

So what problems does this solve? Only net "gain" I see is making a
function public (which is a minus in my book) thus expanding the public
interface.

> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/disk-io.c | 9 +
>  fs/btrfs/disk-io.h | 2 ++
>  fs/btrfs/volumes.c | 4 
>  3 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 156116655a32..28e71e2aaa92 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -393,8 +393,8 @@ static int verify_parent_transid(struct extent_io_tree 
> *io_tree,
>   * Return 0 if the superblock checksum type matches the checksum value of 
> that
>   * algorithm. Pass the raw disk superblock data.
>   */
> -static int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
> -   char *raw_disk_sb)
> +int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
> +char *raw_disk_sb)
>  {
>   struct btrfs_super_block *disk_sb =
>   (struct btrfs_super_block *)raw_disk_sb;
> @@ -420,8 +420,9 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
> *fs_info,
>   }
>  
>   if (csum_type >= ARRAY_SIZE(btrfs_csum_sizes)) {
> - btrfs_err(fs_info, "unsupported checksum algorithm %u",
> - csum_type);
> + if (fs_info)
> + btrfs_err(fs_info, "unsupported checksum algorithm %u",
> +   csum_type);
>   ret = 1;
>   }
>  
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 70a88d61b547..683e4b0c4e24 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -70,6 +70,8 @@ struct buffer_head *btrfs_read_dev_super(struct 
> block_device *bdev);
>  int btrfs_read_dev_one_super(struct block_device *bdev, int copy_num,
>   struct buffer_head **bh_ret);
>  int btrfs_commit_super(struct btrfs_fs_info *fs_info);
> +int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
> +char *raw_disk_sb);
>  struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root,
> struct btrfs_key *location);
>  int btrfs_init_fs_root(struct btrfs_root *root);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1e72357bdfa8..af34b8a611a0 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -677,6 +677,10 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
> *fs_devices,
>   if (ret)
>   return ret;
>  
> + if (btrfs_check_super_csum(NULL, bh->b_data)) {
> + pr_err("BTRFS: superblock checksum mismatch");
> + goto error_brelse;
> + }
>   disk_super = (struct btrfs_super_block *)bh->b_data;
>   devid = btrfs_stack_device_id(_super->dev_item);
>   if (devid != device->devid)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Btrfs-progs: check, fix false error reports for shared prealloc extents



On 13.03.2018 20:47, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Under some cases the filesystem checker reports an error when it finds
> checksum items for an extent that is referenced by an inode as a prealloc
> extent. Such cases are not an error when the extent is actually shared
> (was cloned/reflinked) with other inodes and was written through one of
> those other inodes.
> 
> Example:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount /dev/sdb /mnt
> 
>   $ touch /mnt/foo
>   $ xfs_io -c "falloc 0 256K" /mnt/foo
>   $ sync
> 
>   $ xfs_io -c "pwrite -S 0xab -b 0 256K" /mnt/foo
  ^^
nit: You are actually missing the argument for the -b switch, how big
should the blocks be ? Same thing applies to the commit message of your test

>   $ touch /mnt/bar
>   $ xfs_io -c "reflink /mnt/foo 0 0 256K" /mnt/bar
>   $ xfs_io -c "fsync" /mnt/bar
> 
>   
>   $ mount /dev/sdb /mnt
>   $ umount /mnt
> 
>   $ btrfs check /dev/sdc
>   Checking filesystem on /dev/sdb
>   UUID: 52d3006e-ee3b-40eb-aa21-e56253a03d39
>   checking extents
>   checking free space cache
>   checking fs roots
>   root 5 inode 257 errors 800, odd csum item
>   ERROR: errors found in fs roots
>   found 688128 bytes used, error(s) found
>   total csum bytes: 256
>   total tree bytes: 163840
>   total fs tree bytes: 65536
>   total extent tree bytes: 16384
>   btree space waste bytes: 138819
>   file data blocks allocated: 10747904
>referenced 10747904
>   $ echo $?
>   1
> 
> So tech check to not report such cases as errors by checking if the
> extent is shared with other inodes and if so, consider it an error the
> existence of checksum items only if all those other inodes are referencing
> the extent as a prealloc extent.
> This case can be hit often when running the generic/475 testcase from
> fstests.
> 
> A test case will follow in a separate patch.
> 
> Signed-off-by: Filipe Manana 
> ---
>  check/main.c | 270 
> ++-
>  1 file changed, 268 insertions(+), 2 deletions(-)
> 
> diff --git a/check/main.c b/check/main.c
> index 392195ca..bb816311 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -1424,6 +1424,264 @@ static int process_inode_extref(struct extent_buffer 
> *eb,
>  
>  }
>  
> +/*
> + * Check if the inode referenced by the given data reference uses the extent
> + * at disk_bytenr as a non-prealloc extent.
> + *
> + * Returns 1 if true, 0 if false and < 0 on error.
> + */
> +static int check_prealloc_data_ref(struct btrfs_fs_info *fs_info,
> +u64 ino,
> +u64 disk_bytenr,
> +struct btrfs_extent_data_ref *dref,
> +struct extent_buffer *eb)
> +{
> + u64 rootid = btrfs_extent_data_ref_root(eb, dref);
> + u64 objectid = btrfs_extent_data_ref_objectid(eb, dref);
> + u64 offset = btrfs_extent_data_ref_offset(eb, dref);
> + struct btrfs_root *root;
> + struct btrfs_key key;
> + struct btrfs_path path;
> + int ret;
> +
> + if (objectid == ino)
> + return 0;
> +
> + btrfs_init_path();
> + key.objectid = rootid;
> + key.type = BTRFS_ROOT_ITEM_KEY;
> + key.offset = (u64)-1;
> + root = btrfs_read_fs_root(fs_info, );
> + if (IS_ERR(root)) {
> + ret = PTR_ERR(root);
> + goto out;
> + }
> +
> + key.objectid = objectid;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> + key.offset = offset;
> + ret = btrfs_search_slot(NULL, root, , , 0, 0);
> + if (ret > 0) {
> + fprintf(stderr,
> + "Missing file extent item for inode %llu, root %llu, 
> offset %llu",
> + objectid, rootid, offset);
> + ret = -ENOENT;
> + }
> + if (ret < 0)
> + goto out;
> +
> + while (true) {
> + struct btrfs_file_extent_item *fi;
> + int extent_type;
> +
> + if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
> + ret = btrfs_next_leaf(fs_info->extent_root, );
> + if (ret < 0)
> + goto out;
> + if (ret > 0)
> + break;
> + }
> +
> + btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
> + if (key.objectid != objectid ||
> + key.type != BTRFS_EXTENT_DATA_KEY)
> + break;
> +
> + fi = btrfs_item_ptr(path.nodes[0], path.slots[0],
> + struct btrfs_file_extent_item);
> + extent_type = btrfs_file_extent_type(path.nodes[0], fi);
> + if (extent_type != BTRFS_FILE_EXTENT_REG &&
> + extent_type != BTRFS_FILE_EXTENT_PREALLOC)
> + goto next;
> +
> + if (btrfs_file_extent_disk_bytenr(path.nodes[0],

Re: FS unmountable after RAID/LVM problems



On 2018年03月14日 16:53, Dirk Gouders wrote:
> Qu Wenruo  writes:
> 
>> On 2018年03月13日 22:49, Dirk Gouders wrote:
>> [snip]

 # btrfs inspect dump-tree -b 848986112 /dev/loop0p1
 # btrfs inspect dump-tree -b 72089600 /dev/loop0p1
>>>
>>> OK.
>>>
>>> (This mail gets a bit long but I don't want to snip probably important
>>>  information above.)
>>>
>>
>> Feel free to snip.
>> As the involved tree block is not shown anywhere.
>>
>> So it's not any root node corrupted.
>> It may be some extent tree node corrupted in this case.
>>
>> While to inspect it, we need some new functionality in btrfs inspect tree.
>>
>> Before that, would you please try the following patch and to see if it
>> helps btrfs-restore to salvage any data?
> 
> I tried it and got the following output:
> 
> # btrfs restore /dev/loop0p1 /mnt/
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> checksum verify failed on 363069440 found DC09290B wanted C630FD61
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> bytenr mismatch, want=363069440, have=17552567724568668829
> Could not open root, trying backup super
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> checksum verify failed on 363069440 found DC09290B wanted C630FD61
> checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
> bytenr mismatch, want=363069440, have=17552567724568668829
> Could not open root, trying backup super
> ERROR: superblock bytenr 274877906944 is larger than device size 10741612544
> Could not open root, trying backup super

So it's still something important in the tree.

Would you please apply this patch?
https://patchwork.kernel.org/patch/10281329/

And then dump the tree again using that newly added -f option?
(Both stdout and stderr is needed)

The dump command would be:
# btrfs inspect dump-tree -f -b 

Needed bytenrs would be:
848773120   tree root
848789504   extent root (My primary guess)
30408704dev root
850509824   fs root (this could contain *FILENAME*, please censor
 them if needed, and it may be large)
212353024   uuid tree (not really imporatant)

And if it's extent root, we could enhance btrfs-progs open_ctree() to
handle it for RO mode (needed by btrfs-restore)

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature

[PATCH RFC 3/3] fstests: generic: Check the fs after each FUA writes

Basic test case which triggers fsstress with dm-log-writes, and then
check the fs after each FUA writes.
With needed infrastructure and special handlers for journal based fs.

Signed-off-by: Qu Wenruo 
---
In my test, xfs and btrfs survives while ext4 would report error during fsck.

My current biggest concern is, we abuse $TEST_DEV and mkfs on it all by
ourselves. Not sure if it's allowed.
---
 common/dmlogwrites| 119 
 tests/generic/481 | 124 ++
 tests/generic/481.out |   2 +
 tests/generic/group   |   1 +
 4 files changed, 246 insertions(+)
 create mode 100755 tests/generic/481
 create mode 100644 tests/generic/481.out

diff --git a/common/dmlogwrites b/common/dmlogwrites
index 467b872e..258f5887 100644
--- a/common/dmlogwrites
+++ b/common/dmlogwrites
@@ -126,3 +126,122 @@ _log_writes_cleanup()
$UDEV_SETTLE_PROG >/dev/null 2>&1
_log_writes_remove
 }
+
+# Convert log writes mark to entry number
+# Result entry number is output to stdout, could be empty if not found
+_log_writes_mark_to_entry_number()
+{
+   local _mark=$1
+   local ret
+
+   [ -z "$_mark" ] && _fatal \
+   "mark must be given for _log_writes_mark_to_entry_number"
+
+   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
+   --end-mark $_mark 2> /dev/null)
+   [ -z "$ret" ] && return
+   ret=$(echo "$ret" | cut -f1 -d\@)
+   echo "mark $_mark has entry number $ret" >> $seqres.full
+   echo "$ret"
+}
+
+# Find next fua write entry number
+# Result entry number is output to stdout, could be empty if not found
+_log_writes_find_next_fua()
+{
+   local _start_entry=$1
+   local ret
+
+   [ -z "$_start_entry" ] && _start_entry=0
+   ret=$($here/src/log-writes/replay-log --find --log $LOGWRITES_DEV \
+ --next-fua --start-entry $_start_entry 2> /dev/null)
+   [ -z "$ret" ] && return
+
+   ret=$(echo "$ret" | cut -f1 -d\@)
+   echo "next fua is entry number $ret" >> $seqres.full
+   echo "$ret"
+}
+
+# Replay log range to specified entry
+# $1:  End entry. The entry with this number *WILL* be replayed
+# $2:  Start entry. If not specified, start from the first entry.
+# $3:  Verbose. If set to 'y' do verbose output
+_log_writes_replay_log_entry_range()
+{
+   local _end=$1
+   local _start=$2
+   local _verbose=$3
+
+   [ -z "$_end" ] && _fatal \
+   "end entry must be specified for _log_writes_replay_log_entry_range"
+
+   [ "x$verbose" == "xy" ] && _verbose="-v"
+   [ -z "$_start" ] && _start=0
+   [ "$_start" -gt "$_end" ] && _fatal \
+   "wrong parameter order for 
_log_writes_replay_log_entry_range:start=$_start end=$_end"
+
+   # To ensure we replay the last entry, for _start == 0 case,
+   # we need to manually increase the end entry number to ensure
+   # it's played
+   echo "=== replay from $_start to $_end ===" >> $seqres.full
+   $here/src/log-writes/replay-log --log $LOGWRITES_DEV \
+   --replay $SCRATCH_DEV --start-entry $_start \
+   --limit $(($_end - $_start + 1)) $_verbose \
+   >> $seqres.full 2>&1
+   [ $? -ne 0 ] && _fatal "replay failed"
+}
+
+_log_writes_cleanup_snapshot()
+{
+   $UDEV_SETTLE_PROG > /dev/null 2>&1
+   $DMSETUP_PROG remove "$DMLOGWRITES_SNAPSHOT_NAME" > /dev/null 2>&1
+   $DMSETUP_PROG remove "$DMLOGWRITES_ORIGIN_NAME" > /dev/null 2>&1
+}
+
+# Helper to create snapshot of a the replayed device
+# Useful for journal based filesystem such as XFS and Ext4 to replay
+# their journal without touching the replay device, so that we can
+# continue replaying other than replay from the beginning.
+# $1:  Snapshot device
+_log_writes_create_snapshot()
+{
+   _require_dm_target snapshot
+
+   local snapshot_dev=$1
+   local cow_base=""
+
+   [ -z "$snapshot_dev" ] && _fatal \
+   "@device must be specified for _log_writes_create_snapshot"
+   local size=$(blockdev --getsz $SCRATCH_DEV)
+   [ -z "$size" ] && _fatal \
+   "failed to get device size for _log_writes_create_snapshot"
+
+   _log_writes_cleanup_snapshot
+
+   DMLOGWRITES_ORIGIN_NAME="log-writes-origin"
+   DMLOGWRITES_SNAPSHOT_NAME="log-writes-snapshot"
+   DMLOGWRITES_ORIGIN_TABLE="0 $size snapshot-origin $SCRATCH_DEV"
+   DMLOGWRITES_SNAPSHOT_TABLE="0 $size snapshot 
/dev/mapper/$DMLOGWRITES_ORIGIN_NAME $snapshot_dev N 512"
+
+   $UDEV_SETTLE_PROG >/dev/null 2>&1
+   $DMSETUP_PROG create $DMLOGWRITES_ORIGIN_NAME --table 
"$DMLOGWRITES_ORIGIN_TABLE" ||\
+   _fatal "failed to create snapshot-origin of log writes target"
+   $DMSETUP_PROG mknodes > /dev/null 2>&1
+   $UDEV_SETTLE_PROG >/dev/null 2>&1
+   $DMSETUP_PROG create $DMLOGWRITES_SNAPSHOT_NAME --table 
"$DMLOGWRITES_SNAPSHOT_TABLE" ||\
+

[PATCH 2/3] fstests: log-writes: Add support for METADATA flag

Signed-off-by: Qu Wenruo 
---
 src/log-writes/log-writes.c | 3 ++-
 src/log-writes/log-writes.h | 9 +
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
index a872429d..5dc22c24 100644
--- a/src/log-writes/log-writes.c
+++ b/src/log-writes/log-writes.c
@@ -130,7 +130,8 @@ struct flags_to_str_entry {
DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
-   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
+   DEFINE_LOG_FLAGS_STR_ENTRY(MARK),
+   DEFINE_LOG_FLAGS_STR_ENTRY(METADATA)
 };
 
 #define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
diff --git a/src/log-writes/log-writes.h b/src/log-writes/log-writes.h
index 35ca3583..75fb8ac0 100644
--- a/src/log-writes/log-writes.h
+++ b/src/log-writes/log-writes.h
@@ -20,10 +20,11 @@ typedef __u32 u32;
 /*
  * Constants copied from kernel file drivers/md/dm-log-writes.c
  */
-#define LOG_FLUSH_FLAG (1 << 0)
-#define LOG_FUA_FLAG (1 << 1)
-#define LOG_DISCARD_FLAG (1 << 2)
-#define LOG_MARK_FLAG (1 << 3)
+#define LOG_FLUSH_FLAG (1 << 0)
+#define LOG_FUA_FLAG   (1 << 1)
+#define LOG_DISCARD_FLAG   (1 << 2)
+#define LOG_MARK_FLAG  (1 << 3)
+#define LOG_METADATA_FLAG  (1 << 4)
 
 #define WRITE_LOG_VERSION 1
 #define WRITE_LOG_MAGIC 0x6a736677736872
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] fstests: log-writes: Add support to output human readable flags

Also change the flag numeric output to hex.

Signed-off-by: Qu Wenruo 
---
 src/log-writes/log-writes.c | 70 -
 1 file changed, 63 insertions(+), 7 deletions(-)

diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
index 09391574..a872429d 100644
--- a/src/log-writes/log-writes.c
+++ b/src/log-writes/log-writes.c
@@ -120,6 +120,58 @@ int log_discard(struct log *log, struct log_write_entry 
*entry)
return 0;
 }
 
+#define DEFINE_LOG_FLAGS_STR_ENTRY(x)  \
+   {LOG_##x##_FLAG, #x}
+
+struct flags_to_str_entry {
+   u64 flags;
+   const char *str;
+} log_flags_table[] = {
+   DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
+   DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
+   DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
+   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
+};
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#define LOG_FLAGS_BUF_SIZE 128
+/*
+ * Convert numeric flags to human readable flags.
+ * @flags: numeric flags
+ * @buf:   output buffer for human readable string.
+ * must have enough space (LOG_FLAGS_BUF_SIZE) to contain all
+ * the string
+ */
+static void entry_flags_to_str(u64 flags, char *buf)
+{
+   int empty = 1;
+   int left_len;
+   int i;
+
+   buf[0] = '\0';
+   for (i = 0; i < ARRAY_SIZE(log_flags_table); i++) {
+   if (flags & log_flags_table[i].flags) {
+   if (!empty)
+   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
+   empty = 0;
+   strncat(buf, log_flags_table[i].str, 
LOG_FLAGS_BUF_SIZE);
+   flags &= ~log_flags_table[i].flags;
+   }
+   }
+   if (flags) {
+   if (!empty)
+   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
+   empty = 0;
+   left_len = LOG_FLAGS_BUF_SIZE - strnlen(buf,
+   LOG_FLAGS_BUF_SIZE);
+   if (left_len > 0)
+   snprintf(buf + strnlen(buf, LOG_FLAGS_BUF_SIZE),
+left_len, "UNKNOWN.0x%llx", flags);
+   }
+   if (empty)
+   strncpy(buf, "NONE", LOG_FLAGS_BUF_SIZE);
+}
+
 /*
  * @log: the log we are replaying.
  * @entry: entry to be replayed.
@@ -179,6 +231,7 @@ int log_replay_next_entry(struct log *log, struct 
log_write_entry *entry,
size_t read_size = read_data ? log->sectorsize :
sizeof(struct log_write_entry);
char *buf;
+   char flags_buf[LOG_FLAGS_BUF_SIZE];
ssize_t ret;
off_t offset;
int skip = 0;
@@ -210,19 +263,20 @@ int log_replay_next_entry(struct log *log, struct 
log_write_entry *entry,
log->cur_pos += read_size;
}
 
+   flags = le64_to_cpu(entry->flags);
+   entry_flags_to_str(flags, flags_buf);
skip = log_should_skip(log, entry);
if (log_writes_verbose > 1 || (log_writes_verbose && !skip)) {
-   printf("%s %d@%llu: sector %llu, size %llu, flags %llu\n",
+   printf("%s %d@%llu: sector %llu, size %llu, flags 0x%llx(%s)\n",
   skip ? "skipping" : "replaying",
   (int)log->cur_entry - 1, log->cur_pos / log->sectorsize,
   (unsigned long long)le64_to_cpu(entry->sector),
   (unsigned long long)size,
-  (unsigned long long)le64_to_cpu(entry->flags));
+  (unsigned long long)flags, flags_buf);
}
if (!size)
return 0;
 
-   flags = le64_to_cpu(entry->flags);
if (flags & LOG_DISCARD_FLAG)
return log_discard(log, entry);
 
@@ -301,7 +355,7 @@ int log_seek_entry(struct log *log, u64 entry_num)
return -1;
}
if (log_writes_verbose > 1)
-   printf("seek entry %d@%llu: %llu, size %llu, flags 
%llu\n",
+   printf("seek entry %d@%llu: %llu, size %llu, flags 
0x%llx\n",
   (int)i, log->cur_pos / log->sectorsize,
   (unsigned long long)le64_to_cpu(entry.sector),
   (unsigned long 
long)le64_to_cpu(entry.nr_sectors),
@@ -339,6 +393,7 @@ int log_seek_next_entry(struct log *log, struct 
log_write_entry *entry,
size_t read_size = read_data ? log->sectorsize :
sizeof(struct log_write_entry);
u64 flags;
+   char flags_buf[LOG_FLAGS_BUF_SIZE];
ssize_t ret;
 
if (log->cur_entry >= log->nr_entries)
@@ -366,14 +421,15 @@ int log_seek_next_entry(struct log *log, struct 
log_write_entry *entry,
} else {
log->cur_pos += read_size;
}
+   flags = le64_to_cpu(entry->flags);
+   entry_flags_to_str(flags, flags_buf);
if (log_writes_verbose > 1)
-

Re: FS unmountable after RAID/LVM problems

2018-03-14 Thread Dirk Gouders

Qu Wenruo  writes:

> On 2018年03月13日 22:49, Dirk Gouders wrote:
> [snip]
>>>
>>> # btrfs inspect dump-tree -b 848986112 /dev/loop0p1
>>> # btrfs inspect dump-tree -b 72089600 /dev/loop0p1
>> 
>> OK.
>> 
>> (This mail gets a bit long but I don't want to snip probably important
>>  information above.)
>> 
>
> Feel free to snip.
> As the involved tree block is not shown anywhere.
>
> So it's not any root node corrupted.
> It may be some extent tree node corrupted in this case.
>
> While to inspect it, we need some new functionality in btrfs inspect tree.
>
> Before that, would you please try the following patch and to see if it
> helps btrfs-restore to salvage any data?

I tried it and got the following output:

# btrfs restore /dev/loop0p1 /mnt/
checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
checksum verify failed on 363069440 found DC09290B wanted C630FD61
checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
bytenr mismatch, want=363069440, have=17552567724568668829
Could not open root, trying backup super
checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
checksum verify failed on 363069440 found DC09290B wanted C630FD61
checksum verify failed on 363069440 found 296FB15A wanted F0AFE59D
bytenr mismatch, want=363069440, have=17552567724568668829
Could not open root, trying backup super
ERROR: superblock bytenr 274877906944 is larger than device size 10741612544
Could not open root, trying backup super


Because the patch did not apply, I did the modification manually as
follows:

# git diff
diff --git a/cmds-restore.c b/cmds-restore.c
index ade35f0f..e7b96a67 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -1282,7 +1282,7 @@ static struct btrfs_root *open_fs(const char *dev, u64 
root_location,
for (i = super_mirror; i < BTRFS_SUPER_MIRROR_MAX; i++) {
bytenr = btrfs_sb_offset(i);
fs_info = open_ctree_fs_info(dev, bytenr, root_location, 0,
-OPEN_CTREE_PARTIAL);
+OPEN_CTREE_PARTIAL | 
__OPEN_CTREE_RETURN_CHUNK_ROOT);
if (fs_info)
break;
fprintf(stderr, "Could not open root, trying backup super\n");

Thanks,

Dirk
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/2] btrfs: drop optimal argument from find_live_mirror()

2018-03-14 Thread Anand Jain

Drop optimal argument from the function find_live_mirror()
as we can deduce it in the function itself. Also rename
optimal to preferred_mirror.

Signed-off-by: Anand Jain 
---
I thought I have sent v2 to the ML. But now I realize I didn't.
v1->v2:
   Accepts David's comment to rename %optimal. IMO, %preferred_mirror is
   better than the suggested %fallback.

 fs/btrfs/volumes.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9beea7c891a7..f1b7efbdcec1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5276,10 +5276,11 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
 
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
struct map_lookup *map, int first,
-   int optimal, int dev_replace_is_ongoing)
+   int dev_replace_is_ongoing)
 {
int i;
int num_stripes;
+   int preferred_mirror;
int tolerance;
struct btrfs_device *srcdev;
 
@@ -5291,6 +5292,8 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
else
num_stripes = map->num_stripes;
 
+   preferred_mirror = first + current->pid % num_stripes;
+
if (dev_replace_is_ongoing &&
fs_info->dev_replace.cont_reading_from_srcdev_mode ==
 BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
@@ -5304,9 +5307,9 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 * mirror is available
 */
for (tolerance = 0; tolerance < 2; tolerance++) {
-   if (map->stripes[optimal].dev->bdev &&
-   (tolerance || map->stripes[optimal].dev != srcdev))
-   return optimal;
+   if (map->stripes[preferred_mirror].dev->bdev &&
+   (tolerance || map->stripes[preferred_mirror].dev != srcdev))
+   return preferred_mirror;
for (i = first; i < first + num_stripes; i++) {
if (map->stripes[i].dev->bdev &&
(tolerance || map->stripes[i].dev != srcdev))
@@ -5317,7 +5320,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
/* we couldn't find one that doesn't fail.  Just return something
 * and the io error handling code will clean up eventually
 */
-   return optimal;
+   return preferred_mirror;
 }
 
 static inline int parity_smaller(u64 a, u64 b)
@@ -5844,7 +5847,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
stripe_index = mirror_num - 1;
else {
stripe_index = find_live_mirror(fs_info, map, 0,
-   current->pid % map->num_stripes,
dev_replace_is_ongoing);
mirror_num = stripe_index + 1;
}
@@ -5872,8 +5874,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
int old_stripe_index = stripe_index;
stripe_index = find_live_mirror(fs_info, map,
  stripe_index,
- stripe_index +
- current->pid % map->sub_stripes,
  dev_replace_is_ongoing);
mirror_num = stripe_index - old_stripe_index + 1;
}
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/2] btrfs: drop num argument from find_live_mirror()

2018-03-14 Thread Anand Jain

Obtain the stripes info from the map directly and so no need
to pass it as an argument.

Signed-off-by: Anand Jain 
---
v1->v2:
  Accepts David's comment to rename %num to %num_stripes.

 fs/btrfs/volumes.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1e72357bdfa8..9beea7c891a7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5275,13 +5275,22 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
 }
 
 static int find_live_mirror(struct btrfs_fs_info *fs_info,
-   struct map_lookup *map, int first, int num,
+   struct map_lookup *map, int first,
int optimal, int dev_replace_is_ongoing)
 {
int i;
+   int num_stripes;
int tolerance;
struct btrfs_device *srcdev;
 
+   ASSERT((map->type &
+(BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)));
+
+   if (map->type & BTRFS_BLOCK_GROUP_RAID10)
+   num_stripes = map->sub_stripes;
+   else
+   num_stripes = map->num_stripes;
+
if (dev_replace_is_ongoing &&
fs_info->dev_replace.cont_reading_from_srcdev_mode ==
 BTRFS_DEV_REPLACE_ITEM_CONT_READING_FROM_SRCDEV_MODE_AVOID)
@@ -5298,7 +5307,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
if (map->stripes[optimal].dev->bdev &&
(tolerance || map->stripes[optimal].dev != srcdev))
return optimal;
-   for (i = first; i < first + num; i++) {
+   for (i = first; i < first + num_stripes; i++) {
if (map->stripes[i].dev->bdev &&
(tolerance || map->stripes[i].dev != srcdev))
return i;
@@ -5835,7 +5844,6 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
stripe_index = mirror_num - 1;
else {
stripe_index = find_live_mirror(fs_info, map, 0,
-   map->num_stripes,
current->pid % map->num_stripes,
dev_replace_is_ongoing);
mirror_num = stripe_index + 1;
@@ -5864,7 +5872,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info,
int old_stripe_index = stripe_index;
stripe_index = find_live_mirror(fs_info, map,
  stripe_index,
- map->sub_stripes, stripe_index +
+ stripe_index +
  current->pid % map->sub_stripes,
  dev_replace_is_ongoing);
mirror_num = stripe_index - old_stripe_index + 1;
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Btrfs-progs: check, fix false error reports for shared prealloc extents



On 13.03.2018 20:47, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Under some cases the filesystem checker reports an error when it finds
> checksum items for an extent that is referenced by an inode as a prealloc
> extent. Such cases are not an error when the extent is actually shared
> (was cloned/reflinked) with other inodes and was written through one of
> those other inodes.
> 
> Example:
> 
>   $ mkfs.btrfs -f /dev/sdb
>   $ mount /dev/sdb /mnt
> 
>   $ touch /mnt/foo
>   $ xfs_io -c "falloc 0 256K" /mnt/foo
>   $ sync
> 
>   $ xfs_io -c "pwrite -S 0xab -b 0 256K" /mnt/foo
>   $ touch /mnt/bar
>   $ xfs_io -c "reflink /mnt/foo 0 0 256K" /mnt/bar
>   $ xfs_io -c "fsync" /mnt/bar
> 
>   
>   $ mount /dev/sdb /mnt
>   $ umount /mnt
> 
>   $ btrfs check /dev/sdc
>   Checking filesystem on /dev/sdb
>   UUID: 52d3006e-ee3b-40eb-aa21-e56253a03d39
>   checking extents
>   checking free space cache
>   checking fs roots
>   root 5 inode 257 errors 800, odd csum item
>   ERROR: errors found in fs roots
>   found 688128 bytes used, error(s) found
>   total csum bytes: 256
>   total tree bytes: 163840
>   total fs tree bytes: 65536
>   total extent tree bytes: 16384
>   btree space waste bytes: 138819
>   file data blocks allocated: 10747904
>referenced 10747904
>   $ echo $?
>   1
> 
> So tech check to not report such cases as errors by checking if the
> extent is shared with other inodes and if so, consider it an error the
> existence of checksum items only if all those other inodes are referencing
> the extent as a prealloc extent.
> This case can be hit often when running the generic/475 testcase from
> fstests.
> 
> A test case will follow in a separate patch.
> 
> Signed-off-by: Filipe Manana 
> ---
>  check/main.c | 270 
> ++-
>  1 file changed, 268 insertions(+), 2 deletions(-)
> 
> diff --git a/check/main.c b/check/main.c
> index 392195ca..bb816311 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -1424,6 +1424,264 @@ static int process_inode_extref(struct extent_buffer 
> *eb,
>  
>  }
>  
> +/*
> + * Check if the inode referenced by the given data reference uses the extent
> + * at disk_bytenr as a non-prealloc extent.
> + *
> + * Returns 1 if true, 0 if false and < 0 on error.
> + */
> +static int check_prealloc_data_ref(struct btrfs_fs_info *fs_info,
> +u64 ino,
> +u64 disk_bytenr,
> +struct btrfs_extent_data_ref *dref,
> +struct extent_buffer *eb)
> +{
> + u64 rootid = btrfs_extent_data_ref_root(eb, dref);
> + u64 objectid = btrfs_extent_data_ref_objectid(eb, dref);

nit: Shouldn't this ino be equal to the ino passed to the function? If
so objectid can be removed and when setting up the key for the second
search you just assigne ino to its objectid

> + u64 offset = btrfs_extent_data_ref_offset(eb, dref);
> + struct btrfs_root *root;
> + struct btrfs_key key;
> + struct btrfs_path path;
> + int ret;
> +
> + if (objectid == ino)
> + return 0;
> +
> + btrfs_init_path();
> + key.objectid = rootid;
> + key.type = BTRFS_ROOT_ITEM_KEY;
> + key.offset = (u64)-1;
> + root = btrfs_read_fs_root(fs_info, );
> + if (IS_ERR(root)) {
> + ret = PTR_ERR(root);
> + goto out;
> + }
> +
> + key.objectid = objectid;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> + key.offset = offset;
> + ret = btrfs_search_slot(NULL, root, , , 0, 0);
> + if (ret > 0) {
> + fprintf(stderr,
> + "Missing file extent item for inode %llu, root %llu, 
> offset %llu",
> + objectid, rootid, offset);
> + ret = -ENOENT;
> + }
> + if (ret < 0)
> + goto out;
> +
> + while (true) {
> + struct btrfs_file_extent_item *fi;
> + int extent_type;
> +
> + if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
> + ret = btrfs_next_leaf(fs_info->extent_root, );
> + if (ret < 0)
> + goto out;
> + if (ret > 0)
> + break;
> + }
> +
> + btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
> + if (key.objectid != objectid ||
> + key.type != BTRFS_EXTENT_DATA_KEY)
> + break;
> +
> + fi = btrfs_item_ptr(path.nodes[0], path.slots[0],
> + struct btrfs_file_extent_item);
> + extent_type = btrfs_file_extent_type(path.nodes[0], fi);
> + if (extent_type != BTRFS_FILE_EXTENT_REG &&
> + extent_type != BTRFS_FILE_EXTENT_PREALLOC)
> + goto next;
> +
> + if (btrfs_file_extent_disk_bytenr(path.nodes[0],

Re: WARN when unmounting a subvolume that is being synced