Re: [PATCH 0/6] btrfs-progs: Fixes inline ram_bytes related bugs

2018-06-06 Thread Qu Wenruo
To Steve,

This patchset should fix your problem.

Would you please have a try?

Thanks,
Qu

On 2018年06月06日 15:27, Qu Wenruo wrote:
> The patchset can be fetched from github (*):
> https://github.com/adam900710/btrfs-progs/tree/inline_ram_bytes
> 
> It's based on David's devel branch, whose HEAD is:
> commit 0d1c5812e28e286648781c7b35b542311cc01aa4 (david/devel)
> Author: Matthias Benkard 
> Date:   Wed Apr 25 16:34:54 2018 +0200
> 
> btrfs-progs: mkfs: traverse_directory: Reset error code on continue
> 
> Reported-by Steve Leung , his old btrfs (at least
> offending inodes are from 2014) has inline uncompressed extent, while
> its ram_bytes mismatch with item size.
> 
> Latest kernel tree check catches this bug, while we failed to detect by
> dump-tree.
> 
> It turns out that btrfs-progs is doing something evil to avoid reading
> ram_bytes from inline uncompressed extent.
> 
> 
> So this patchset will address all such ram_bytes related problems.
> 
> The 1st patch is a not-so-relative fix for restore, which is using
> ram_bytes for decompress. Although thanks to the compression header, we
> won't read out-of-boundary, but fixing it is never a bad thing.
> 
> The 2nd patch will get rid of the evil btrfs_file_extent_inline_len()
> which hides raw ram_bytes from us, and fooling us for a long long time.
> 
> The 3rd~5th patches introduce check/repair function for both original
> and lowmem mode (although lowmem mode can detect it even before this patch).
> 
> The last one is the test case for it as usual.
> 
> *: Or should I just migrate to gitlab after M$ acquired github?
> 
> Qu Wenruo (6):
>   btrfs-progs: restore: Fix wrong compressed item size for decompress()
>   btrfs-progs: Get rid of the confusing btrfs_file_extent_inline_len()
>   btrfs-progs: check/original: Detect and repair wrong inline ram_bytes
>   btrfs-progs: check/lowmem: Prepare check_file_extent() to handle
> repair
>   btrfs-progs: check/lowmem: Repair wrong inlien ram_bytes for
> uncompressed extent
>   btrfs-progs: fsck-tests: Add test case for corrupted inline ram_bytes
> 
>  check/main.c  |  46 ++-
>  check/mode-lowmem.c   | 120 ++
>  check/mode-original.h |   1 +
>  cmds-restore.c|   5 +-
>  ctree.h   |  22 
>  file.c|   3 +-
>  print-tree.c  |   4 +-
>  .../offset_by_one.img | Bin 0 -> 3072 bytes
>  .../035-inline-bad-ram-bytes/test.sh  |  11 ++
>  9 files changed, 157 insertions(+), 55 deletions(-)
>  create mode 100644 
> tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img
>  create mode 100755 tests/fsck-tests/035-inline-bad-ram-bytes/test.sh
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-06 Thread Qu Wenruo



On 2018年06月07日 12:20, Su Yue wrote:
> 
> 
> On 06/07/2018 11:33 AM, james harvey wrote:
>> =[ NOTE ]=
>>
>> I think I found a buffer over-read error that will come up other places, 
>> unless
>> EACH caller checks bounds themselves.  See "Bonus bug, LEFT FOR READER" 
>> below.
>>
>> =[ PROBLEM ]=
>>
>> Using btrfs-progs v4.16:
>>
>> No extent found at range [10955980800,10955984896)
>>
>> But, this extent exists.  btrfs-debug-tree shows:
>> ...
>> extent tree key (EXTENT_TREE ROOT_ITEM 0)
>> ...
>> leaf 316456960 items 203 free space 3697 generation 84225 owner EXTENT_TREE
>> ...
>> item 197 key (10955931648 EXTENT_ITEM 28672) itemoff 8957 itemsize 37
>> refs 1 gen 62656 flags DATA
>> shared data backref parent 142622720 count 1
>> item 198 key (10955960320 EXTENT_ITEM 4096) itemoff 8920 itemsize 37
>> refs 1 gen 62656 flags DATA
>> shared data backref parent 142622720 count 1
>> item 199 key (10955964416 EXTENT_ITEM 4096) itemoff 8883 itemsize 37
>> refs 1 gen 62656 flags DATA
>> shared data backref parent 142622720 count 1
>> item 200 key (10955968512 EXTENT_ITEM 4096) itemoff 8846 itemsize 37
>> refs 1 gen 62656 flags DATA
>> shared data backref parent 142622720 count 1
>> item 201 key (10955972608 EXTENT_ITEM 4096) itemoff 8809 itemsize 37
>> refs 1 gen 62656 flags DATA
>> shared data backref parent 142655488 count 1
>> item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37
>> refs 1 gen 62656 flags DATA
>> shared data backref parent 142655488 count 1
>> ...
>> leaf 316489728 items 208 free space 3387 generation 84225 owner EXTENT_TREE
>> ...
>> item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37
>> refs 1 gen 62656 flags DATA
>> shared data backref parent 128958464 count 1
>> ...
>> checksum tree key (CSUM_TREE ROOT_ITEM 0)
>> ...
>> item 412 key (EXTENT_CSUM EXTENT_CSUM 10955980800) itemoff 10647
>>itemsize 4
>> range start 10955980800 end 10955984896 length 4096
>> 
>> file tree key (3038 ROOT_ITEM 80009)
>> ...
>> leaf 128958464 items 37 free space 6032 generation 62656 owner FS_TREE
>> ...
>> item 1 key (997645 INODE_ITEM 0) itemoff 15220 itemsize 160
>> generation 62656 transid 62656 size 4943 nbytes 8192
>> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
>> sequence 5 flags 0x0(none)
>> atime 1478246492.0 (2016-11-04 08:01:32)
>> ctime 1478246493.129060242 (2016-11-04 08:01:33)
>> mtime 1477487995.0 (2016-10-26 13:19:55)
>> otime 1478246493.129060242 (2016-11-04 08:01:33)
>> item 2 key (997645 INODE_REF 299949) itemoff 15200 itemsize 20
>> index 13 namelen 10 name: as_map.hpp
>> item 3 key (997645 EXTENT_DATA 0) itemoff 15147 itemsize 53
>> generation 62656 type 1 (regular)
>> extent data disk byte 10955980800 nr 4096
>> extent data offset 0 nr 8192 ram 8192
>> extent compression 2 (lzo)
>> ...
>>
>> =[ CAUSE ]=
>>
>> In main's first call to map_one_extent(10955980800, 4096, 0):
>>
>> * btrfs_search_slot() looks for (10955980800, 0, 0), and:
>> ** returns 1 because it doesn't exist
>> ** sets path->slots[0] to 203 (for leaf 316456960), where it should go if
>>inserted (pointing after the last existing item)
>> * ret is reset to 0
>> * btrfs_item_key_to_cpu sets key to (10955960320, BTRFS_EXTENT_ITEM_KEY, 
>> 4096)
>> !!! Bonus bug, LEFT FOR READER.  Why is this item #197, 5 items before the 
>> 203
>> given?  I think no bounds checking causes a buffer over-read here.
>> btrfs_item_key_to_cpu() calls btrfs_item_key(), which uses the macro
>> read_eb_member() to call read_extent_buffer() which memcpy's using an out
>> of range index, at least for this leaf.
>> * With (!search_forward && key.objectid > logical) being false, the code 
>> calling
>>   btrfs_next_item() is not run.
>> * logical is set to this too-low key.objectid
>> * !ret, so set *logical_ret and *len_ret with the new values
>>
>> Back in main:
>>
>> * ret is 0, so don't print the first error
>> * ret is still 0, so don't run map_one_extent() with search_forward=1
>> * At the while loop, (10955960320 + 4096 >= 10955980800 && 10955960320 <
>>   10955980800 + 4096) (10955964416 >= 10955980800 && 10955960320 < 
>> 10955984896)
>>   (false && true) (false), so don't call map_one_extent() with 
>> search_forward=1
>>   here, either.
>> * In the while loop, now call map_one_extent() with search_forward=1
>> * !found, so print (somewhat deceptive) error only mentioning the user-given
>>   logical without mentioning it looked 

Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-06 Thread Qu Wenruo


On 2018年06月07日 11:33, james harvey wrote:
> =[ NOTE ]=
> 
> I think I found a buffer over-read error that will come up other places, 
> unless
> EACH caller checks bounds themselves.  See "Bonus bug, LEFT FOR READER" below.
> 
> =[ PROBLEM ]=
> 
> Using btrfs-progs v4.16:
> 
> No extent found at range [10955980800,10955984896)
> 
> But, this extent exists.  btrfs-debug-tree shows:
> ...
> extent tree key (EXTENT_TREE ROOT_ITEM 0)
> ...
> leaf 316456960 items 203 free space 3697 generation 84225 owner EXTENT_TREE
> ...
> item 197 key (10955931648 EXTENT_ITEM 28672) itemoff 8957 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142622720 count 1
> item 198 key (10955960320 EXTENT_ITEM 4096) itemoff 8920 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142622720 count 1
> item 199 key (10955964416 EXTENT_ITEM 4096) itemoff 8883 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142622720 count 1
> item 200 key (10955968512 EXTENT_ITEM 4096) itemoff 8846 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142622720 count 1
> item 201 key (10955972608 EXTENT_ITEM 4096) itemoff 8809 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142655488 count 1
> item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142655488 count 1
> ...
> leaf 316489728 items 208 free space 3387 generation 84225 owner EXTENT_TREE
> ...
> item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 128958464 count 1
> ...
> checksum tree key (CSUM_TREE ROOT_ITEM 0)
> ...
> item 412 key (EXTENT_CSUM EXTENT_CSUM 10955980800) itemoff 10647
>itemsize 4
> range start 10955980800 end 10955984896 length 4096
> 
> file tree key (3038 ROOT_ITEM 80009)
> ...
> leaf 128958464 items 37 free space 6032 generation 62656 owner FS_TREE
> ...
> item 1 key (997645 INODE_ITEM 0) itemoff 15220 itemsize 160
> generation 62656 transid 62656 size 4943 nbytes 8192
> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
> sequence 5 flags 0x0(none)
> atime 1478246492.0 (2016-11-04 08:01:32)
> ctime 1478246493.129060242 (2016-11-04 08:01:33)
> mtime 1477487995.0 (2016-10-26 13:19:55)
> otime 1478246493.129060242 (2016-11-04 08:01:33)
> item 2 key (997645 INODE_REF 299949) itemoff 15200 itemsize 20
> index 13 namelen 10 name: as_map.hpp
> item 3 key (997645 EXTENT_DATA 0) itemoff 15147 itemsize 53
> generation 62656 type 1 (regular)
> extent data disk byte 10955980800 nr 4096
> extent data offset 0 nr 8192 ram 8192
> extent compression 2 (lzo)
> ...
> 
> =[ CAUSE ]=
> 
> In main's first call to map_one_extent(10955980800, 4096, 0):
> 
> * btrfs_search_slot() looks for (10955980800, 0, 0), and:
> ** returns 1 because it doesn't exist
> ** sets path->slots[0] to 203 (for leaf 316456960), where it should go if
>inserted (pointing after the last existing item)
> * ret is reset to 0
> * btrfs_item_key_to_cpu sets key to (10955960320, BTRFS_EXTENT_ITEM_KEY, 4096)
> !!! Bonus bug, LEFT FOR READER.  Why is this item #197, 5 items before the 203
> given?  I think no bounds checking causes a buffer over-read here.
> btrfs_item_key_to_cpu() calls btrfs_item_key(), which uses the macro
> read_eb_member() to call read_extent_buffer() which memcpy's using an out
> of range index, at least for this leaf.

Here the key you got could be garbage.
Either the leaf has enough free space, then you got pending zero into
the key, or the leaf is full, you got some item data into the key.

Either way, the key should be read/trusted at all.

> * With (!search_forward && key.objectid > logical) being false, the code 
> calling
>   btrfs_next_item() is not run.
> * logical is set to this too-low key.objectid
> * !ret, so set *logical_ret and *len_ret with the new values

Personally speaking, the first paragraph of dump tree could already
explain the bug pretty well.
So the explanation is a little longer here.

> 
> Back in main:
> 
> * ret is 0, so don't print the first error
> * ret is still 0, so don't run map_one_extent() with search_forward=1
> * At the while loop, (10955960320 + 4096 >= 10955980800 && 10955960320 <
>   10955980800 + 4096) (10955964416 >= 10955980800 && 10955960320 < 
> 10955984896)
>   (false && true) (false), so don't call map_one_extent() with 
> 

Re: [PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-06 Thread Su Yue



On 06/07/2018 11:33 AM, james harvey wrote:
> =[ NOTE ]=
> 
> I think I found a buffer over-read error that will come up other places, 
> unless
> EACH caller checks bounds themselves.  See "Bonus bug, LEFT FOR READER" below.
> 
> =[ PROBLEM ]=
> 
> Using btrfs-progs v4.16:
> 
> No extent found at range [10955980800,10955984896)
> 
> But, this extent exists.  btrfs-debug-tree shows:
> ...
> extent tree key (EXTENT_TREE ROOT_ITEM 0)
> ...
> leaf 316456960 items 203 free space 3697 generation 84225 owner EXTENT_TREE
> ...
> item 197 key (10955931648 EXTENT_ITEM 28672) itemoff 8957 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142622720 count 1
> item 198 key (10955960320 EXTENT_ITEM 4096) itemoff 8920 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142622720 count 1
> item 199 key (10955964416 EXTENT_ITEM 4096) itemoff 8883 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142622720 count 1
> item 200 key (10955968512 EXTENT_ITEM 4096) itemoff 8846 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142622720 count 1
> item 201 key (10955972608 EXTENT_ITEM 4096) itemoff 8809 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142655488 count 1
> item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 142655488 count 1
> ...
> leaf 316489728 items 208 free space 3387 generation 84225 owner EXTENT_TREE
> ...
> item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37
> refs 1 gen 62656 flags DATA
> shared data backref parent 128958464 count 1
> ...
> checksum tree key (CSUM_TREE ROOT_ITEM 0)
> ...
> item 412 key (EXTENT_CSUM EXTENT_CSUM 10955980800) itemoff 10647
>itemsize 4
> range start 10955980800 end 10955984896 length 4096
> 
> file tree key (3038 ROOT_ITEM 80009)
> ...
> leaf 128958464 items 37 free space 6032 generation 62656 owner FS_TREE
> ...
> item 1 key (997645 INODE_ITEM 0) itemoff 15220 itemsize 160
> generation 62656 transid 62656 size 4943 nbytes 8192
> block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
> sequence 5 flags 0x0(none)
> atime 1478246492.0 (2016-11-04 08:01:32)
> ctime 1478246493.129060242 (2016-11-04 08:01:33)
> mtime 1477487995.0 (2016-10-26 13:19:55)
> otime 1478246493.129060242 (2016-11-04 08:01:33)
> item 2 key (997645 INODE_REF 299949) itemoff 15200 itemsize 20
> index 13 namelen 10 name: as_map.hpp
> item 3 key (997645 EXTENT_DATA 0) itemoff 15147 itemsize 53
> generation 62656 type 1 (regular)
> extent data disk byte 10955980800 nr 4096
> extent data offset 0 nr 8192 ram 8192
> extent compression 2 (lzo)
> ...
> 
> =[ CAUSE ]=
> 
> In main's first call to map_one_extent(10955980800, 4096, 0):
> 
> * btrfs_search_slot() looks for (10955980800, 0, 0), and:
> ** returns 1 because it doesn't exist
> ** sets path->slots[0] to 203 (for leaf 316456960), where it should go if
>inserted (pointing after the last existing item)
> * ret is reset to 0
> * btrfs_item_key_to_cpu sets key to (10955960320, BTRFS_EXTENT_ITEM_KEY, 4096)
> !!! Bonus bug, LEFT FOR READER.  Why is this item #197, 5 items before the 203
> given?  I think no bounds checking causes a buffer over-read here.
> btrfs_item_key_to_cpu() calls btrfs_item_key(), which uses the macro
> read_eb_member() to call read_extent_buffer() which memcpy's using an out
> of range index, at least for this leaf.
> * With (!search_forward && key.objectid > logical) being false, the code 
> calling
>   btrfs_next_item() is not run.
> * logical is set to this too-low key.objectid
> * !ret, so set *logical_ret and *len_ret with the new values
> 
> Back in main:
> 
> * ret is 0, so don't print the first error
> * ret is still 0, so don't run map_one_extent() with search_forward=1
> * At the while loop, (10955960320 + 4096 >= 10955980800 && 10955960320 <
>   10955980800 + 4096) (10955964416 >= 10955980800 && 10955960320 < 
> 10955984896)
>   (false && true) (false), so don't call map_one_extent() with 
> search_forward=1
>   here, either.
> * In the while loop, now call map_one_extent() with search_forward=1
> * !found, so print (somewhat deceptive) error only mentioning the user-given
>   logical without mentioning it looked elsewhere, and give up.
> 
> =[ FIX ]=
> 
> btrfs-map-logical and I are not friends.  The "least code" fix for this
> situation is this patch.

Re: [PATCH v2 1/3] btrfs-progs: check: check symlinks with append/immutable flags

2018-06-06 Thread Su Yue



On 06/07/2018 10:45 AM, Misono Tomohiro wrote:
> 
> 
> On 2018/05/15 10:33, Su Yue wrote:
>> Define new macro I_ERR_ODD_INODE_FLAGS to represents odd inode flags.
>>
>> Symlinks should never have append/immutable flags.
>> While processing inodes, if found a symlink with append/immutable
>> flags, mark the inode record with I_ERR_ODD_INODE_FLAGS.
>>
>> This is for original mode.
>>
>> Signed-off-by: Su Yue 
>> ---
>>  check/main.c  | 7 +++
>>  check/mode-original.h | 1 +
>>  2 files changed, 8 insertions(+)
>>
>> diff --git a/check/main.c b/check/main.c
>> index 68da994f7ae0..c764fc011ded 100644
>> --- a/check/main.c
>> +++ b/check/main.c
>> @@ -576,6 +576,8 @@ static void print_inode_error(struct btrfs_root *root, 
>> struct inode_record *rec)
>>  fprintf(stderr, ", link count wrong");
>>  if (errors & I_ERR_FILE_EXTENT_ORPHAN)
>>  fprintf(stderr, ", orphan file extent");
>> +if (errors & I_ERR_ODD_INODE_FLAGS)
>> +fprintf(stderr, ", odd inode flags");
>>  fprintf(stderr, "\n");
>>  /* Print the orphan extents if needed */
>>  if (errors & I_ERR_FILE_EXTENT_ORPHAN)
>> @@ -805,6 +807,7 @@ static int process_inode_item(struct extent_buffer *eb,
>>  {
>>  struct inode_record *rec;
>>  struct btrfs_inode_item *item;
>> +u64 flags;
>>  
>>  rec = active_node->current;
>>  BUG_ON(rec->ino != key->objectid || rec->refs > 1);
>> @@ -822,6 +825,10 @@ static int process_inode_item(struct extent_buffer *eb,
>>  rec->found_inode_item = 1;
>>  if (rec->nlink == 0)
>>  rec->errors |= I_ERR_NO_ORPHAN_ITEM;
>> +flags = btrfs_inode_flags(eb, item);
>> +if (rec->imode & BTRFS_FT_SYMLINK &&
> 
> Hello,
> 
> I observed that this commit causes test-convert/009 in current kdave/devel 
> branch.
> Since rec->imode uses S_IFLNK (0xa000) for symbolic link and BTRFS_FT_SYMLINK 
> is 7,
> above statement does not work well. Shouldn't we use S_ISLNK(rec->imode) 
> instead?
> 

Oh.. Yep, my bad.
Since the test case is created by hand, the whole patchset should be
modified in next version.
Thanks a lot.

Su

> Thanks,
> Tomohiro Misono
> 
>> +flags & (BTRFS_INODE_IMMUTABLE | BTRFS_INODE_APPEND))
>> +rec->errors |= I_ERR_ODD_INODE_FLAGS;
>>  maybe_free_inode_rec(_node->inode_cache, rec);
>>  return 0;
>>  }
>> diff --git a/check/mode-original.h b/check/mode-original.h
>> index 368de692fdd1..13cfa5b9e1b3 100644
>> --- a/check/mode-original.h
>> +++ b/check/mode-original.h
>> @@ -186,6 +186,7 @@ struct file_extent_hole {
>>  #define I_ERR_LINK_COUNT_WRONG  (1 << 13)
>>  #define I_ERR_FILE_EXTENT_ORPHAN(1 << 14)
>>  #define I_ERR_FILE_EXTENT_TOO_LARGE (1 << 15)
>> +#define I_ERR_ODD_INODE_FLAGS   (1 << 16)
>>  
>>  struct inode_record {
>>  struct list_head backrefs;
>>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] btrfs-progs: map-logical: look at next leaf if slot > items

2018-06-06 Thread james harvey
=[ NOTE ]=

I think I found a buffer over-read error that will come up other places, unless
EACH caller checks bounds themselves.  See "Bonus bug, LEFT FOR READER" below.

=[ PROBLEM ]=

Using btrfs-progs v4.16:

No extent found at range [10955980800,10955984896)

But, this extent exists.  btrfs-debug-tree shows:
...
extent tree key (EXTENT_TREE ROOT_ITEM 0)
...
leaf 316456960 items 203 free space 3697 generation 84225 owner EXTENT_TREE
...
item 197 key (10955931648 EXTENT_ITEM 28672) itemoff 8957 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 142622720 count 1
item 198 key (10955960320 EXTENT_ITEM 4096) itemoff 8920 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 142622720 count 1
item 199 key (10955964416 EXTENT_ITEM 4096) itemoff 8883 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 142622720 count 1
item 200 key (10955968512 EXTENT_ITEM 4096) itemoff 8846 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 142622720 count 1
item 201 key (10955972608 EXTENT_ITEM 4096) itemoff 8809 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 142655488 count 1
item 202 key (10955976704 EXTENT_ITEM 4096) itemoff 8772 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 142655488 count 1
...
leaf 316489728 items 208 free space 3387 generation 84225 owner EXTENT_TREE
...
item 0 key (10955980800 EXTENT_ITEM 4096) itemoff 16246 itemsize 37
refs 1 gen 62656 flags DATA
shared data backref parent 128958464 count 1
...
checksum tree key (CSUM_TREE ROOT_ITEM 0)
...
item 412 key (EXTENT_CSUM EXTENT_CSUM 10955980800) itemoff 10647
   itemsize 4
range start 10955980800 end 10955984896 length 4096

file tree key (3038 ROOT_ITEM 80009)
...
leaf 128958464 items 37 free space 6032 generation 62656 owner FS_TREE
...
item 1 key (997645 INODE_ITEM 0) itemoff 15220 itemsize 160
generation 62656 transid 62656 size 4943 nbytes 8192
block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
sequence 5 flags 0x0(none)
atime 1478246492.0 (2016-11-04 08:01:32)
ctime 1478246493.129060242 (2016-11-04 08:01:33)
mtime 1477487995.0 (2016-10-26 13:19:55)
otime 1478246493.129060242 (2016-11-04 08:01:33)
item 2 key (997645 INODE_REF 299949) itemoff 15200 itemsize 20
index 13 namelen 10 name: as_map.hpp
item 3 key (997645 EXTENT_DATA 0) itemoff 15147 itemsize 53
generation 62656 type 1 (regular)
extent data disk byte 10955980800 nr 4096
extent data offset 0 nr 8192 ram 8192
extent compression 2 (lzo)
...

=[ CAUSE ]=

In main's first call to map_one_extent(10955980800, 4096, 0):

* btrfs_search_slot() looks for (10955980800, 0, 0), and:
** returns 1 because it doesn't exist
** sets path->slots[0] to 203 (for leaf 316456960), where it should go if
   inserted (pointing after the last existing item)
* ret is reset to 0
* btrfs_item_key_to_cpu sets key to (10955960320, BTRFS_EXTENT_ITEM_KEY, 4096)
!!! Bonus bug, LEFT FOR READER.  Why is this item #197, 5 items before the 203
given?  I think no bounds checking causes a buffer over-read here.
btrfs_item_key_to_cpu() calls btrfs_item_key(), which uses the macro
read_eb_member() to call read_extent_buffer() which memcpy's using an out
of range index, at least for this leaf.
* With (!search_forward && key.objectid > logical) being false, the code calling
  btrfs_next_item() is not run.
* logical is set to this too-low key.objectid
* !ret, so set *logical_ret and *len_ret with the new values

Back in main:

* ret is 0, so don't print the first error
* ret is still 0, so don't run map_one_extent() with search_forward=1
* At the while loop, (10955960320 + 4096 >= 10955980800 && 10955960320 <
  10955980800 + 4096) (10955964416 >= 10955980800 && 10955960320 < 10955984896)
  (false && true) (false), so don't call map_one_extent() with search_forward=1
  here, either.
* In the while loop, now call map_one_extent() with search_forward=1
* !found, so print (somewhat deceptive) error only mentioning the user-given
  logical without mentioning it looked elsewhere, and give up.

=[ FIX ]=

btrfs-map-logical and I are not friends.  The "least code" fix for this
situation is this patch.

Qu's "btrfs-progs: check: Also compare data between mirrors to detect corruption
for NODATASUM extents" uses a simpler way, which makes me wonder if that should
just be modified to replace how btrfs-map-logical works.  But, I'll admit I do
not have my head around the 

in which directions does btrfs send -p | btrfs receive work

2018-06-06 Thread Christoph Anton Mitterer
Hey.

Just wondered about the following:

When I have a btrfs which acts as a master and from which I make copies
 of snapshots on it via send/receive (with using -p at send) to other
btrfs which acts as copies like this:
master +--> copy1
   +--> copy2
   \--> copy3
and if now e.g. the device of master breaks, can I move *with
incremential send -p / receive backups from one of the copies?

Which of the following two would work (or both?):

A) Redesignating a copy to be a new master, e.g.:
   old-copy1/new-master +--> new-disk/new-copy1

   +--> copy2
\--> copy3
   Obviously at least
send/receiving to new-copy1 shoud work,
   but would that work as well
to copy2/copy3 (with -p), since they're
   based on (and probably using
UUIDs) from the snapshot on the old
   broken master?

B) Let a new device be the master and move on from that (kinda creating
   a "send/receive cycle":
   1st:
   copy1 +--> new-disk/new-master

   from then on (when new snapshots should be incrementally sent):
   new-master +--> copy1
  +--> copy2
  \--> copy3
   Again, not sure whether send/receiving to copy2/3 would work, since
   they're based on snapshots/parents from the old broken master.
   And I'm even more unsure, whether this back send/receiving,
   from copy1->new-master->copy1 would work.


Any expert having some definite idea? :-)

Thanks,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: check: Initialize all filed of btrfs_inode_item in insert_inode_item()

2018-06-06 Thread Misono Tomohiro
Initialize all filed of btrfs_inode_item to zero in order to prevent
having some garbage, especially for flags field.

Signed-off-by: Misono Tomohiro 
---
 check/mode-common.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/check/mode-common.c b/check/mode-common.c
index db9e4299..15e2bbd1 100644
--- a/check/mode-common.c
+++ b/check/mode-common.c
@@ -379,18 +379,14 @@ int insert_inode_item(struct btrfs_trans_handle *trans,
time_t now = time(NULL);
int ret;
 
+   memset(, 0, sizeof(ii));
btrfs_set_stack_inode_size(, size);
btrfs_set_stack_inode_nbytes(, nbytes);
btrfs_set_stack_inode_nlink(, nlink);
btrfs_set_stack_inode_mode(, mode);
btrfs_set_stack_inode_generation(, trans->transid);
-   btrfs_set_stack_timespec_nsec(, 0);
btrfs_set_stack_timespec_sec(, now);
-   btrfs_set_stack_timespec_nsec(, 0);
btrfs_set_stack_timespec_sec(, now);
-   btrfs_set_stack_timespec_nsec(, 0);
-   btrfs_set_stack_timespec_sec(, 0);
-   btrfs_set_stack_timespec_nsec(, 0);
 
ret = btrfs_insert_inode(trans, root, ino, );
ASSERT(!ret);
-- 
2.14.4


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] btrfs-progs: check: check symlinks with append/immutable flags

2018-06-06 Thread Misono Tomohiro



On 2018/05/15 10:33, Su Yue wrote:
> Define new macro I_ERR_ODD_INODE_FLAGS to represents odd inode flags.
> 
> Symlinks should never have append/immutable flags.
> While processing inodes, if found a symlink with append/immutable
> flags, mark the inode record with I_ERR_ODD_INODE_FLAGS.
> 
> This is for original mode.
> 
> Signed-off-by: Su Yue 
> ---
>  check/main.c  | 7 +++
>  check/mode-original.h | 1 +
>  2 files changed, 8 insertions(+)
> 
> diff --git a/check/main.c b/check/main.c
> index 68da994f7ae0..c764fc011ded 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -576,6 +576,8 @@ static void print_inode_error(struct btrfs_root *root, 
> struct inode_record *rec)
>   fprintf(stderr, ", link count wrong");
>   if (errors & I_ERR_FILE_EXTENT_ORPHAN)
>   fprintf(stderr, ", orphan file extent");
> + if (errors & I_ERR_ODD_INODE_FLAGS)
> + fprintf(stderr, ", odd inode flags");
>   fprintf(stderr, "\n");
>   /* Print the orphan extents if needed */
>   if (errors & I_ERR_FILE_EXTENT_ORPHAN)
> @@ -805,6 +807,7 @@ static int process_inode_item(struct extent_buffer *eb,
>  {
>   struct inode_record *rec;
>   struct btrfs_inode_item *item;
> + u64 flags;
>  
>   rec = active_node->current;
>   BUG_ON(rec->ino != key->objectid || rec->refs > 1);
> @@ -822,6 +825,10 @@ static int process_inode_item(struct extent_buffer *eb,
>   rec->found_inode_item = 1;
>   if (rec->nlink == 0)
>   rec->errors |= I_ERR_NO_ORPHAN_ITEM;
> + flags = btrfs_inode_flags(eb, item);
> + if (rec->imode & BTRFS_FT_SYMLINK &&

Hello,

I observed that this commit causes test-convert/009 in current kdave/devel 
branch.
Since rec->imode uses S_IFLNK (0xa000) for symbolic link and BTRFS_FT_SYMLINK 
is 7,
above statement does not work well. Shouldn't we use S_ISLNK(rec->imode) 
instead?

Thanks,
Tomohiro Misono

> + flags & (BTRFS_INODE_IMMUTABLE | BTRFS_INODE_APPEND))
> + rec->errors |= I_ERR_ODD_INODE_FLAGS;
>   maybe_free_inode_rec(_node->inode_cache, rec);
>   return 0;
>  }
> diff --git a/check/mode-original.h b/check/mode-original.h
> index 368de692fdd1..13cfa5b9e1b3 100644
> --- a/check/mode-original.h
> +++ b/check/mode-original.h
> @@ -186,6 +186,7 @@ struct file_extent_hole {
>  #define I_ERR_LINK_COUNT_WRONG   (1 << 13)
>  #define I_ERR_FILE_EXTENT_ORPHAN (1 << 14)
>  #define I_ERR_FILE_EXTENT_TOO_LARGE  (1 << 15)
> +#define I_ERR_ODD_INODE_FLAGS(1 << 16)
>  
>  struct inode_record {
>   struct list_head backrefs;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting a unique ino/dev pair to user space

2018-06-06 Thread Mark Fasheh
Hi Ian,

On Thu, Jun 07, 2018 at 08:47:28AM +0800, Ian Kent wrote:
> On Wed, 2018-06-06 at 23:38 +0200, Mark Fasheh wrote:
> > Hi,
> 
> I'm not sure I understand what the problem is.

I'll try to elaborate below.


> > We have an inconsistency in how the kernel is exporting inode number /
> > device pairs for user space. There's of course stat(2) and statx(2),
> > but aside from those we simply dump inode->i_ino and super->s_dev. In
> > some cases, the dumped values differ from what is returned via stat(2)
> > or statx(2). Some filesystems might even show duplicate (but
> > internally different!) pairs when the raw i_ino/s_dev is used.
> 
> How is it that you can dump the raw ino and s_dev if your not in
> kernel code?

If you look below my first paragraph, you'll see a list of places where the
kernel publishes (maybe that's a better word?) ino/dev pairs by printing or
otherwise copying raw ino/s_dev values into user accesible buffers.


> For stat family system calls, if the file system defines the inode
> operation getattr it will be used to fill the stat structure otherwise
> the VFS will fill the stat  structure and it will use inode->i_ino and
> sb->s_dev as you say.

My concern is that those pairs are sometimes not unique and do not line up
with what statx(2) returns. We actually need the true inode / device as is
returned by statx in those places. I go into much more detail in my original
mail.


> > Some examples where we dump raw ino/dev:
> > 
> > - /proc//maps. I've written about how this confuses lsof(8):
> >   https://marc.info/?l=linux-btrfs=130074451403261=2
> > 
> > - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
> >   trace/events/lock.h or trace/events/writeback.h for examples.
> > 
> > - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
> > 
> > - Audit records the raw ino/dev and passes them around. We do seem to
> >   have paths printed from audit as well, but if it's printed with the
> >   wrong ino/dev pair I believe my point still stands.
> > 
> > 
> > This breaks software which expects these pairs to be unique, and can
> > put the user in a situation where they might not be able to find an
> > inode referenced from the kernel. What's even worse - depending on how
> > ino is exported, they might even find the *wrong* inode.

Thanks,
--Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC ver.B] btrfs: scrub: Don't use inode pages for device replace

2018-06-06 Thread james harvey
On Tue, Jun 5, 2018 at 12:36 AM, Qu Wenruo  wrote:
> [BUG]
> Btrfs can easily create compressed extent without checksum (even
> though it shouldn't), and if we then try to replace device containing
> such extent, the result device will contain all the uncompressed data
> instead of the compressed one.
>
> Test case already submitted to fstests:
> https://patchwork.kernel.org/patch/10442353/
>
> [CAUSE]
> When handling compressed extent without checksum, device replace will
> goes into copy_nocow_pages() function.
>
> In that function, btrfs will get all inodes referring to this data
> extents and then use find_or_create_page() to get pages direct from that
> inode.
>
> The problem here is, pages directly from inode are always uncompressed.
> And for compressed data extent, they mismatch with on-disk data.
> Thus this leads to corrupted compressed data extent written to replace
> device.
>
> [FIX]
> In this attempt, we could just remove the "optimization" branch, and let
> unified scrub_pages() to handle it.
>
> Although scrub_pages() won't bother reusing page cache, it will be a
> little slower, but it does the correct csum checking and won't cause
> such data corruption cause by "optimization".
>
> Reported-by: James Harvey 
> Signed-off-by: Qu Wenruo 

Reviewed-by: James Harvey 



As expected, this fixes the problem.

I originally ran into this running btrfs replace on devid 1, on a 3
device RAID1.

I temporarily restored dd images of devid 2 & 3, from before I ran the
original replace.  I was able to run "btrfs replace" with a btrfs
module loaded with this patch.

There's definitely no corruption.  I excessively checked, multiple ways.

I haven't tried with the previous RFC patch since it looks like this
one is getting used instead, but if requested to, I'd try that one.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exporting a unique ino/dev pair to user space

2018-06-06 Thread Ian Kent
On Wed, 2018-06-06 at 23:38 +0200, Mark Fasheh wrote:
> Hi,

I'm not sure I understand what the problem is.

> 
> We have an inconsistency in how the kernel is exporting inode number /
> device pairs for user space. There's of course stat(2) and statx(2),
> but aside from those we simply dump inode->i_ino and super->s_dev. In
> some cases, the dumped values differ from what is returned via stat(2)
> or statx(2). Some filesystems might even show duplicate (but
> internally different!) pairs when the raw i_ino/s_dev is used.

How is it that you can dump the raw ino and s_dev if your not in
kernel code?

For stat family system calls, if the file system defines the inode
operation getattr it will be used to fill the stat structure otherwise
the VFS will fill the stat  structure and it will use inode->i_ino and
sb->s_dev as you say.

> 
> Some examples where we dump raw ino/dev:
> 
> - /proc//maps. I've written about how this confuses lsof(8):
>   https://marc.info/?l=linux-btrfs=130074451403261=2
> 
> - Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
>   trace/events/lock.h or trace/events/writeback.h for examples.
> 
> - eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()
> 
> - Audit records the raw ino/dev and passes them around. We do seem to
>   have paths printed from audit as well, but if it's printed with the
>   wrong ino/dev pair I believe my point still stands.
> 
> 
> This breaks software which expects these pairs to be unique, and can
> put the user in a situation where they might not be able to find an
> inode referenced from the kernel. What's even worse - depending on how
> ino is exported, they might even find the *wrong* inode.
> 
> I also should point out that we're likely at this point because
> stat(2) has been using an unsigned long for ino. On a 32 bit system,
> it would have been impossible for the user to get the real inode
> number in the first place. So there probably wasn't much we could do.
> 
> These days though, we have statx(2) which will do the right thing on
> all platforms so we no longer have that excuse. The user ought to be
> able to take an ino/dev pair and ultimately find the actual file on
> their system, partially with the help of statx(2).
> 
> 
> Some examples of how ino is manipulated by filesystems:
> 
> - On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
>   what I can tell). stat->dev is not always consistent with super->s_dev.
> 
> - On 32 bits, many filesystems employ a transformation to squeeze a 64
>   bit identifier into 32 bits. The exact methods are fs specific,
>   what's important is that we're losing information, introducing the
>   possibility of duplicate inode numbers.
> 
> - On all platforms, Btrfs and Overlayfs can have duplicate inode
>   numbers. Of course, device can be different across the fs as well
>   with the kernel reporting s_dev and these filesystems returning
>   a different device via stat() or statx().
> 
> 
> Getting the inode number portion of this pair fixed would immediately
> solve the situation for all filesystems except Btrfs and
> Overlayfs - they report a different device from stat.
> 
> Regarding the device portion of the pair, I'm honestly not sure
> whether Overlayfs cares, and my attempts to fix the s_dev situation
> for Btrfs have all come to the same dead ends that I've hit briefly
> looking into this inode number issue - the problems are intrinsically
> linked.
> 
> So my questions are:
> 
> 1) Do we care about this? On one hand, we've had this inconsistency
>for a long time, for various reasons. On the other hand, I can point
>to bugzilla's where these inconsistencies have become a problem.
> 
>In the case that we don't care, any fs-internal solutions are
>likely to be extremely disruptive to the end user.
> 
> 
> 2) If we do care, how do we fix this?
> 
>  2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
>  downsides from a memory usage standpoint. Also it doesn't fully
>  address the issue - we still have a device field that Btrfs and
>  Overlayfs override.
> 
>  We could combine this with an intermediate structure between the
>  inode and super block so s_dev could be abstracted out. See my
>  fs_view patches for an example of how this could be done:
>  https://www.spinics.net/lists/linux-fsdevel/msg125492.html
> 
>  2b) Do we call ->getattr for this information? Plumbing a vfsmount to
>  various regions of the kernel like audit, or some of the deeper
>  tracepoints looks ugly and prone to life-timing issues (which
>  vfsmount do we use?). On the upside, we could probably make it
>  really light by only sending down the STATX_INO flag and letting
>  the filesystem optimize accordingly.
> 
>  2c) I don't think we can really use a dedicated callback without
>  passing the vfsmount through since Overlayfs ->getattr might call
>  the lower fs ->getattr. At that point we might as well 

Exporting a unique ino/dev pair to user space

2018-06-06 Thread Mark Fasheh
Hi,

We have an inconsistency in how the kernel is exporting inode number /
device pairs for user space. There's of course stat(2) and statx(2),
but aside from those we simply dump inode->i_ino and super->s_dev. In
some cases, the dumped values differ from what is returned via stat(2)
or statx(2). Some filesystems might even show duplicate (but
internally different!) pairs when the raw i_ino/s_dev is used.

Some examples where we dump raw ino/dev:

- /proc//maps. I've written about how this confuses lsof(8):
  https://marc.info/?l=linux-btrfs=130074451403261=2

- Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
  trace/events/lock.h or trace/events/writeback.h for examples.

- eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()

- Audit records the raw ino/dev and passes them around. We do seem to
  have paths printed from audit as well, but if it's printed with the
  wrong ino/dev pair I believe my point still stands.


This breaks software which expects these pairs to be unique, and can
put the user in a situation where they might not be able to find an
inode referenced from the kernel. What's even worse - depending on how
ino is exported, they might even find the *wrong* inode.

I also should point out that we're likely at this point because
stat(2) has been using an unsigned long for ino. On a 32 bit system,
it would have been impossible for the user to get the real inode
number in the first place. So there probably wasn't much we could do.

These days though, we have statx(2) which will do the right thing on
all platforms so we no longer have that excuse. The user ought to be
able to take an ino/dev pair and ultimately find the actual file on
their system, partially with the help of statx(2).


Some examples of how ino is manipulated by filesystems:

- On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
  what I can tell). stat->dev is not always consistent with super->s_dev.

- On 32 bits, many filesystems employ a transformation to squeeze a 64
  bit identifier into 32 bits. The exact methods are fs specific,
  what's important is that we're losing information, introducing the
  possibility of duplicate inode numbers.

- On all platforms, Btrfs and Overlayfs can have duplicate inode
  numbers. Of course, device can be different across the fs as well
  with the kernel reporting s_dev and these filesystems returning
  a different device via stat() or statx().


Getting the inode number portion of this pair fixed would immediately
solve the situation for all filesystems except Btrfs and
Overlayfs - they report a different device from stat.

Regarding the device portion of the pair, I'm honestly not sure
whether Overlayfs cares, and my attempts to fix the s_dev situation
for Btrfs have all come to the same dead ends that I've hit briefly
looking into this inode number issue - the problems are intrinsically
linked.

So my questions are:

1) Do we care about this? On one hand, we've had this inconsistency
   for a long time, for various reasons. On the other hand, I can point
   to bugzilla's where these inconsistencies have become a problem.

   In the case that we don't care, any fs-internal solutions are
   likely to be extremely disruptive to the end user.


2) If we do care, how do we fix this?

 2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
 downsides from a memory usage standpoint. Also it doesn't fully
 address the issue - we still have a device field that Btrfs and
 Overlayfs override.

 We could combine this with an intermediate structure between the
 inode and super block so s_dev could be abstracted out. See my
 fs_view patches for an example of how this could be done:
 https://www.spinics.net/lists/linux-fsdevel/msg125492.html

 2b) Do we call ->getattr for this information? Plumbing a vfsmount to
 various regions of the kernel like audit, or some of the deeper
 tracepoints looks ugly and prone to life-timing issues (which
 vfsmount do we use?). On the upside, we could probably make it
 really light by only sending down the STATX_INO flag and letting
 the filesystem optimize accordingly.

 2c) I don't think we can really use a dedicated callback without
 passing the vfsmount through since Overlayfs ->getattr might call
 the lower fs ->getattr. At that point we might as well use getattr.

I'd appreciate any comments or suggestions.

Thanks,
  --Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread james harvey
On Wed, Jun 6, 2018 at 3:06 PM, Marc Lehmann  wrote:
> On Tue, Jun 05, 2018 at 05:52:38PM -0400, james harvey 
>  wrote:
>> >> This is not always reproducible, but when deleting our journal, creating 
>> >> log
>> >> messages for a few hours and then doing the above manually has a ~50% 
>> >> chance of
>> >> corrupting the journal.
>> ...
>>
>> My strong bet is you have a hardware issue.
>
> Strange, what kind of harwdare bug would affect multiple very different
> computers in exactly the same way?

Oops.  I missed when you clearly said: "All of this is reproducible on
two different boxes, so is unlikely to be a hardware issue."  I ran
into all these problems ultimately because of a badly designed Marvell
SATA controller.  I thought I had ruled out hardware issues by having
2 identical systems, and reproducing the problem on both.  Certainly
makes a hardware issue for you much less likely, especially if "very
different computers" means different motherboards.

FWIW, I have dropped caches a lot lately (not nearly as much as your
crons) and haven't had it corrupt anything, even in proximity to heavy
I/O.

>> going bad, bad cables, bad port, etc.  My strong bet is you're also
>> using BTRFS mirroring.
>
> Not sure what exactly you mean with btrfs mirroring (there are many btrfs
> features this could refer to), but the closest thing to that that I use is
> dup for metadata (which is always checksummed), data is always single. All
> btrfs filesystems are on lvm (not mirrored), and most (but not all) are
> encrypted. One affected fs is on a hardware raid controller, one is on an
> ssd. I have a single btrfs fs in that box with raid1 for metadata, as an
> experiment, but I haven't used it for testing yet.

Was referring to any type of data mirroring.  Data dup, btrfs
RAID1/5/6/10.  But, I see that's not the case here.

>> You're describing intermittent data corruption on files that I'm
>> thinking all have NOCOW turned on.
>
> The systemd journal files are nocow (I re-enabled that after I turned it
> off for a while), but the rtorrent directory (and the files in it) are
> not.
>
> I did experiment (a year ago) with nocow for torrent files and, more
> importantly, vm images, but it didn't really solve the "millions of
> fragments slow down" problem with btrfs, so I figured I can keep them cow
> and regularly copy them to defragment them. Thats why I am quite sure cow
> is switched on long before I booted my first 4.14 kernel (and it still
> is).

Yeah, with data single, you wouldn't be seeing intermittent problems
if it was related to the bugs I was talking about.

>> it's done writing to a journal file, but in a way that guarantees it
>> to fail.  This has been reported to systemd at
>> https://github.com/systemd/systemd/issues/9112 but poettering has
>
> I am aware that systemd tries to turn on nocow, and I think this is actually
> a bug, but this wouldn't have an an effect on rtorrent, which has corruption
> problems on a different fs. And boy would it be wonderufl if Debian switched
> away form systemd, I feel I personally ran into every single bug that
> exists...

systemd turning on NOCOW isn't a bug.  systemd 219 intentionally
turned on NOCOW for journal files, attempting to improve performance
on btrfs.  220 made it user-configurable, defaulting to turning on
NOCOW.  But, yeah, the bugs I was talking about wouldn't affect
rtorrent files on a different fs, since you have NOCOW off on them,
and since they're data single.

> However, no matter how much systemd plays with btrfs flags, it shouldn't
> corrupt data.

Yeah, it doesn't in itself.  Just makes them susceptible to one disk
corruption that btrfs would otherwise protect against with data
checksums.  And, if using compression and btrfs replace on current
kernels, guarantees them to be corrupted.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Marc Lehmann
On Tue, Jun 05, 2018 at 05:52:38PM -0400, james harvey 
 wrote:
> >> This is not always reproducible, but when deleting our journal, creating 
> >> log
> >> messages for a few hours and then doing the above manually has a ~50% 
> >> chance of
> >> corrupting the journal.
> ...
> 
> My strong bet is you have a hardware issue.

Strange, what kind of harwdare bug would affect multiple very different
computers in exactly the same way?

> going bad, bad cables, bad port, etc.  My strong bet is you're also
> using BTRFS mirroring.

Not sure what exactly you mean with btrfs mirroring (there are many btrfs
features this could refer to), but the closest thing to that that I use is
dup for metadata (which is always checksummed), data is always single. All
btrfs filesystems are on lvm (not mirrored), and most (but not all) are
encrypted. One affected fs is on a hardware raid controller, one is on an
ssd. I have a single btrfs fs in that box with raid1 for metadata, as an
experiment, but I haven't used it for testing yet.

> You're describing intermittent data corruption on files that I'm
> thinking all have NOCOW turned on.

The systemd journal files are nocow (I re-enabled that after I turned it
off for a while), but the rtorrent directory (and the files in it) are
not.

I did experiment (a year ago) with nocow for torrent files and, more
importantly, vm images, but it didn't really solve the "millions of
fragments slow down" problem with btrfs, so I figured I can keep them cow
and regularly copy them to defragment them. Thats why I am quite sure cow
is switched on long before I booted my first 4.14 kernel (and it still
is).

> it's done writing to a journal file, but in a way that guarantees it
> to fail.  This has been reported to systemd at
> https://github.com/systemd/systemd/issues/9112 but poettering has

I am aware that systemd tries to turn on nocow, and I think this is actually
a bug, but this wouldn't have an an effect on rtorrent, which has corruption
problems on a different fs. And boy would it be wonderufl if Debian switched
away form systemd, I feel I personally ran into every single bug that
exists...

However, no matter how much systemd plays with btrfs flags, it shouldn't
corrupt data.

> The context I ran into this problem was with several other bugs
> interacting, that "btrfs replace" has been guaranteed to corrupt
> non-checksummed (NOCOW) compressed data, which the combination of
> those shouldn't happen, but does in some defragmentation situations
> due to another bug.  In my situation, I don't have a hardware issue.

Yeah, btrfs is full of bugs that I constantly run into, but most of them
are containable, unlikely this problem, which might or might not be a
btrfs bug - especially since all your bets seem to be wrong here.

-- 
The choice of a   Deliantra, the free code+content MORPG
  -==- _GNU_  http://www.deliantra.net
  ==-- _   generation
  ---==---(_)__  __   __  Marc Lehmann
  --==---/ / _ \/ // /\ \/ /  schm...@schmorp.de
  -=/_/_//_/\_,_/ /_/\_\
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix race between free_stale_devices and close_fs_devices

2018-06-06 Thread Anand Jain
%fs_devices can be free-ed by btrfs_free_stale_devices() when the
close_fs_devices() drops fs_devices::opened to zero, but close_fs_devices
tries to access the %fs_devices again without the device_list_mutex.

Fix this by bringing the %fs_devices access with in the device_list_mutex.

Stack trace as below.

HEAD commit:716a685fdb89 Merge branch 'x86-hyperv-for-linus' of git://..
::
CPU: 1 PID: 4499 Comm: syz-executor921 Not tainted 4.17.0+ #84
::
WARNING: CPU: 1 PID: 4499 at fs/btrfs/volumes.c:1071 
close_fs_devices+0xbc7/0xfa0 fs/btrfs/volumes.c:1071
Kernel panic - not syncing: panic_on_warn set ...
::
RIP: 0010:close_fs_devices+0xbc7/0xfa0 fs/btrfs/volumes.c:1071
::
 btrfs_close_devices+0x29/0x150 fs/btrfs/volumes.c:1085
 open_ctree+0x589/0x7898 fs/btrfs/disk-io.c:3358
 btrfs_fill_super fs/btrfs/super.c:1202 [inline]
 btrfs_mount_root+0x16df/0x1e70 fs/btrfs/super.c:1593
 mount_fs+0xae/0x328 fs/super.c:1277
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount+0x40/0x60 fs/namespace.c:1027
 btrfs_mount+0x4a1/0x213e fs/btrfs/super.c:1661
 mount_fs+0xae/0x328 fs/super.c:1277

Reported-by: syzbot+ceb2606025ec1cc34...@syzkaller.appspotmail.com
Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c2b7d66192e8..32fba4e24027 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1153,6 +1153,12 @@ static int close_fs_devices(struct btrfs_fs_devices 
*fs_devices)
btrfs_prepare_close_one_device(device);
list_add(>dev_list, _put);
}
+
+   WARN_ON(fs_devices->open_devices);
+   WARN_ON(fs_devices->rw_devices);
+   fs_devices->opened = 0;
+   clear_bit(BTRFS_VOLUME_STATE_SEEDING, _devices->volume_state);
+
mutex_unlock(_devices->device_list_mutex);
 
/*
@@ -1169,11 +1175,6 @@ static int close_fs_devices(struct btrfs_fs_devices 
*fs_devices)
call_rcu(>rcu, free_device_rcu);
}
 
-   WARN_ON(fs_devices->open_devices);
-   WARN_ON(fs_devices->rw_devices);
-   fs_devices->opened = 0;
-   clear_bit(BTRFS_VOLUME_STATE_SEEDING, _devices->volume_state);
-
return 0;
 }
 
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fi-usage: fix RAID10 raw disk usage

2018-06-06 Thread David Sterba
On Sat, Jun 02, 2018 at 11:26:41PM +0200, Hans van Kranenburg wrote:
> In case of RAID10, fi usage is reporting half the amount of allocated
> space and twice the amount of unallocated disk space. Let's fix this.
> 
> For example, a RAID10 chunk of 3GiB with num_stripes 6, half of the
> stripes (dev extents) are a mirror of the other half, so the 3GiB of
> actual data has to be divided by 3, and not 6 to get the size of each of
> those device extents.
> 
> Signed-off-by: Hans van Kranenburg 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: mkfs: Fix traverse_directory() silently failing on some dirs

2018-06-06 Thread David Sterba
On Sat, Jun 02, 2018 at 11:26:11PM +0300, Yevgeny Popovych wrote:
> When traverse_directory() encounters an inode item that already exists
> and has a normal amount of hardlinks - it just continues with a next one,
> w/o clearing the ret value (set to -EEXIST).
> 
> But, if the last file traverse_directory() processes already has an
> inode item - traverse_directory() will silently exit with a
> bad return code, causing the prints like the following:
> 
> unable to traverse directory initial-directory: 1
> error wihle filling filesystem: 1
> 
> Fix this by clearing ret value before continuing with a next file.
> 
> Signed-off-by: Yevgeny Popovych 

Thanks, there was a fix for that in pull request 124 and is in the devel
branch already.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: mkfs: Fix typos

2018-06-06 Thread David Sterba
On Sat, Jun 02, 2018 at 11:30:22PM +0300, Yevgeny Popovych wrote:
> Signed-off-by: Yevgeny Popovych 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Liu Bo
On Wed, Jun 6, 2018 at 9:44 PM, Chris Mason  wrote:
> On 6 Jun 2018, at 9:38, Liu Bo wrote:
>
>> On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:
>>>
>>>
>>>
>>> On 5 Jun 2018, at 16:03, Andrew Morton wrote:
>>>
 (switched to email.  Please respond via emailed reply-to-all, not via
 the
 bugzilla web interface).

 On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org
 wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>
> Bug ID: 199931
>Summary: systemd/rtorrent file data corruption when using
> echo
> 3 >/proc/sys/vm/drop_caches



 A long tale of woe here.  Chris, do you think the pagecache corruption
 is a general thing, or is it possible that btrfs is contributing?

 Also, that 4.4 oom-killer regression sounds very serious.
>>>
>>>
>>>
>>> This week I found a bug in btrfs file write with how we handle stable
>>> pages.
>>> Basically it works like this:
>>>
>>> write(fd, some bytes less than a page)
>>> write(fd, some bytes into the same page)
>>> btrfs prefaults the userland page
>>> lock_and_cleanup_extent_if_need()   <- stable pages
>>> wait for writeback()
>>> clear_page_dirty_for_io()
>>>
>>> At this point we have a page that was dirty and is now clean.  That's
>>> normally fine, unless our prefaulted page isn't in ram anymore.
>>>
>>> iov_iter_copy_from_user_atomic() <--- uh oh
>>>
>>> If the copy_from_user fails, we drop all our locks and retry.  But along
>>> the
>>> way, we completely lost the dirty bit on the page.  If the page is
>>> dropped
>>> by drop_caches, the writes are lost.  We'll just read back the stale
>>> contents of that page during the retry loop.  This won't result in crc
>>> errors because the bytes we lost were never crc'd.
>>>
>>
>> So we're going to carefully redirty the page under the page lock, right?
>
>
> I don't think we actually need to clean it.  We have the page locked,
> writeback won't start until we unlock.
>

My concern is that the buggy thing is similar to compression path,
where we also did the trick of clear_page_dirty_for_io and
redirty_pages to avoid any faults wandering in and changing pages
underneath, but seems here we're fine if pages get changed in between.

>>
>>> It could result in zeros in the file because we're basically reading a
>>> hole,
>>> and those zeros could move around in the page depending on which part of
>>> the
>>> page was dirty when the writes were lost.
>>>
>>
>> I got a question, while re-reading this page, wouldn't it read
>> old/stale on-disk data?
>
>
> If it was never written we should be treating it like a hole, but I'll
> double check.
>

Okay, I think this would also happen in the overwrite case, where
stale data lies on disk.

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Chris Mason

On 6 Jun 2018, at 9:38, Liu Bo wrote:


On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:



On 5 Jun 2018, at 16:03, Andrew Morton wrote:

(switched to email.  Please respond via emailed reply-to-all, not 
via the

bugzilla web interface).

On Tue, 05 Jun 2018 18:01:36 + 
bugzilla-dae...@bugzilla.kernel.org

wrote:


https://bugzilla.kernel.org/show_bug.cgi?id=199931

Bug ID: 199931
   Summary: systemd/rtorrent file data corruption when 
using echo

3 >/proc/sys/vm/drop_caches



A long tale of woe here.  Chris, do you think the pagecache 
corruption

is a general thing, or is it possible that btrfs is contributing?

Also, that 4.4 oom-killer regression sounds very serious.



This week I found a bug in btrfs file write with how we handle stable 
pages.

Basically it works like this:

write(fd, some bytes less than a page)
write(fd, some bytes into the same page)
btrfs prefaults the userland page
lock_and_cleanup_extent_if_need()   <- stable pages
wait for writeback()
clear_page_dirty_for_io()

At this point we have a page that was dirty and is now clean.  That's
normally fine, unless our prefaulted page isn't in ram anymore.

iov_iter_copy_from_user_atomic() <--- uh oh

If the copy_from_user fails, we drop all our locks and retry.  But 
along the
way, we completely lost the dirty bit on the page.  If the page is 
dropped

by drop_caches, the writes are lost.  We'll just read back the stale
contents of that page during the retry loop.  This won't result in 
crc

errors because the bytes we lost were never crc'd.



So we're going to carefully redirty the page under the page lock, 
right?


I don't think we actually need to clean it.  We have the page locked, 
writeback won't start until we unlock.




It could result in zeros in the file because we're basically reading 
a hole,
and those zeros could move around in the page depending on which part 
of the

page was dirty when the writes were lost.



I got a question, while re-reading this page, wouldn't it read
old/stale on-disk data?


If it was never written we should be treating it like a hole, but I'll 
double check.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Liu Bo
On Wed, Jun 6, 2018 at 8:18 AM, Chris Mason  wrote:
>
>
> On 5 Jun 2018, at 16:03, Andrew Morton wrote:
>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Tue, 05 Jun 2018 18:01:36 + bugzilla-dae...@bugzilla.kernel.org
>> wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=199931
>>>
>>> Bug ID: 199931
>>>Summary: systemd/rtorrent file data corruption when using echo
>>> 3 >/proc/sys/vm/drop_caches
>>
>>
>> A long tale of woe here.  Chris, do you think the pagecache corruption
>> is a general thing, or is it possible that btrfs is contributing?
>>
>> Also, that 4.4 oom-killer regression sounds very serious.
>
>
> This week I found a bug in btrfs file write with how we handle stable pages.
> Basically it works like this:
>
> write(fd, some bytes less than a page)
> write(fd, some bytes into the same page)
> btrfs prefaults the userland page
> lock_and_cleanup_extent_if_need()   <- stable pages
> wait for writeback()
> clear_page_dirty_for_io()
>
> At this point we have a page that was dirty and is now clean.  That's
> normally fine, unless our prefaulted page isn't in ram anymore.
>
> iov_iter_copy_from_user_atomic() <--- uh oh
>
> If the copy_from_user fails, we drop all our locks and retry.  But along the
> way, we completely lost the dirty bit on the page.  If the page is dropped
> by drop_caches, the writes are lost.  We'll just read back the stale
> contents of that page during the retry loop.  This won't result in crc
> errors because the bytes we lost were never crc'd.
>

So we're going to carefully redirty the page under the page lock, right?

> It could result in zeros in the file because we're basically reading a hole,
> and those zeros could move around in the page depending on which part of the
> page was dirty when the writes were lost.
>

I got a question, while re-reading this page, wouldn't it read
old/stale on-disk data?

thanks,
liubo

> I spent a morning trying to trigger this with drop_caches and couldn't make
> it happen, even with schedule_timeout()s inserted and other tricks.  But I
> was able to get corruptions if I manually invalidated pages in the critical
> section.
>
> I'm working on a patch, and I'll check and see if any of the other recent
> fixes Dave integrated may have a less exotic explanation.
>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC ver.B] btrfs: scrub: Don't use inode pages for device replace

2018-06-06 Thread David Sterba
On Tue, Jun 05, 2018 at 05:30:18PM +0300, Nikolay Borisov wrote:
> On  5.06.2018 17:24, David Sterba wrote:
> > On Tue, Jun 05, 2018 at 10:14:51PM +0800, Qu Wenruo wrote:
> >>> and then the whole callchain of copy_nocow_pages continues.
> >>
> >> Understood.
> >> I could go this method.
> >>
> >> However I'm a little concerned about such "if (0 &&" usage.
> >>
> >> Although with gcc 8.1 it works without any warning, it still looks
> >> pretty strange, as compiler could one day detect such dead branch and
> >> find copy_nocow_pages() is never used.
> > 
> > I've checked that this does not produce any warnings, gcc 7.3.1. The
> > condition looks strange but does what we want. The whole series will be
> > in 4.18, the first patch in stable versions. If gcc does not warn today,
> > it will not in the future in any of the versions.
> 
> WOuld it be possible to then take Qu's patch which deletes a lot of code
> for 4.19?

Of course, the deletion part will follow, we just need to split it so
the patches do not cause backporting conflicts.  Reading what I wrote
above, it's not clearl but it was the intention.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS losing SE Linux labels on power failure or "reboot -nffd"

2018-06-06 Thread Russell Coker
https://www.spinics.net/lists/linux-btrfs/msg77927.html

Thanks to Hans van Kranenburg and Holger Hoffstätte, the above has the link to 
a message with the patch for this which was already included in kernel 4.16.11 
which was uploaded to Debian on the 27th of May and got into testing about the 
time that my message got to the SE Linux list.

The kernel from Debian/Stable still has the issue.  So using a testing kernel 
might be a good option to deal with this problem at the moment.

On Monday, 4 June 2018 11:14:52 PM AEST Russell Coker wrote:
> The command "reboot -nffd" (kernel reboot without flushing kernel buffers or
> writing status) when run on a BTRFS system with SE Linux will often result
> in /var/log/audit/audit.log being unlabeled.  It also results in some
> systemd- journald files like
> /var/log/journal/c195779d29154ed8bcb4e8444c4a1728/ system.journal being
> unlabeled but that is rarer.  I think that the same problem afflicts both
> systemd-journald and auditd but it's a race condition that on my systems
> (both production and test) is more likely to affect auditd.
> 
> root@stretch:/# xattr -l /var/log/audit/audit.log
> security.selinux:
>    73 79 73 74 65 6D 5F 75 3A 6F 62 6A 65 63 74 5Fsystem_u:object_
> 0010   72 3A 61 75 64 69 74 64 5F 6C 6F 67 5F 74 3A 73r:auditd_log_t:s
> 0020   30 00  0.
> 
> SE Linux uses the xattr "security.selinux", you can see what it's doing with
> xattr(1) but generally using "ls -Z" is easiest.
> 
> If this issue just affected "reboot -nffd" then a solution might be to just
> not run that command.  However this affects systems after a power outage.
> 
> I have reproduced this bug with kernel 4.9.0-6-amd64 (the latest security
> update for Debian/Stretch which is the latest supported release of Debian). 
> I have also reproduced it in an identical manner with kernel 4.16.0-1-amd64
> (the latest from Debian/Unstable).  For testing I reproduced this with a 4G
> filesystem in a VM, but in production it has happened on BTRFS RAID-1
> arrays, both SSD and HDD.
> 
> #!/bin/bash
> set -e
> COUNT=$(ps aux|grep [s]bin/auditd|wc -l)
> date
> if [ "$COUNT" = "1" ]; then
>  echo "all good"
> else
>  echo "failed"
>  exit 1
> fi
> 
> Firstly the above is the script /usr/local/sbin/testit, I test for auditd
> running because it aborts if the context on it's log file is wrong.  When SE
> Linux is in enforcing mode an incorrect/missing label on the audit.log file
> causes auditd to abort.
> 
> root@stretch:~# ls -liZ /var/log/audit/audit.log
> 37952 -rw---. 1 root root system_u:object_r:auditd_log_t:s0 4385230 Jun 
> 1 12:23 /var/log/audit/audit.log
> Above is before I do the tests.
> 
> while ssh stretch /usr/local/sbin/testit ; do
>  ssh btrfs-local "reboot -nffd" > /dev/null 2>&1 &
>  sleep 20
> done
> Above is the shell code I run to do the tests.  Note that the VM in question
> runs on SSD storage which is why it can consistently boot in less than 20
> seconds.
> 
> Fri  1 Jun 12:26:13 UTC 2018
> all good
> Fri  1 Jun 12:26:33 UTC 2018
> failed
> Above is the output from the shell code in question.  After the first reboot
> it fails.  The probability of failure on my test system is greater than
> 50%.
> 
> root@stretch:~# ls -liZ /var/log/audit/audit.log
> 37952 -rw---. 1 root root system_u:object_r:unlabeled_t:s0 4396803 Jun 
> 1 12:26 /var/log/audit/audit.log
> Now the result.  Note that the Inode has not changed.  I could understand a
> newly created file missing an xattr, but this is an existing file which
> shouldn't have had it's xattr changed.  But somehow it gets corrupted.
> 
> The first possibility I considered was that SE Linux code might be at fault.
> I asked on the SE Linux mailing list (I haven't been involved in SE Linux
> kernel code for about 15 years) and was informed that this isn't likely at
> all.  There have been no problems like this reported with other
> filesystems.
> 
> Does anyone have any ideas of other tests I should run?  Anyone want me to
> try a different kernel?  I can give root on a VM to anyone who wants to
> poke at it.


-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-06 Thread Michal Hocko
On Tue 05-06-18 13:03:29, Andrew Morton wrote:
[...]
> > As for why we would do something silly as dropping the caches every hour 
> > (in a
> > cronjob), we started doing this recently because after kernel 4.4, we got
> > frequent OOM kills despite having gigabytes of available memory (e.g. 12GB 
> > in
> > use, 20GB page cache and 16GB empty swap and bang, mysql gets killed). We 
> > found
> > that that the debian 4.9 kernel is unusable, and 4.14 works, *iff* we use 
> > the
> > above as an hourly cron job, so we did that, and afterwards run into
> > rtorrent/journald corruption issues. Without the echo in place, mysql 
> > usually
> > gets oom-killed after a few days of uptime.

Do you have any oom reports to share?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] btrfs-progs: check/original: Detect and repair wrong inline ram_bytes

2018-06-06 Thread Su Yue



On 06/06/2018 04:26 PM, Qu Wenruo wrote:
> It looks like that around 2014, btrfs kernel has a regression that would
> cause offset-by-one ram_bytes for inline extent.
> 
> Add the ability to repair it in original mode.
> 
> Reported-by: Steve Leung 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Su Yue 

> ---
> changelog:
> v2:
>   Add extra output for print_inode_error().
>   Reword the repair output message.
>   Clear the error bitmap if repaired.
> ---
>  check/main.c  | 47 +--
>  check/mode-original.h |  1 +
>  2 files changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/check/main.c b/check/main.c
> index 52ef181add61..c739c094cef3 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -578,6 +578,8 @@ static void print_inode_error(struct btrfs_root *root, 
> struct inode_record *rec)
>   fprintf(stderr, ", orphan file extent");
>   if (errors & I_ERR_ODD_INODE_FLAGS)
>   fprintf(stderr, ", odd inode flags");
> + if (errors & I_ERR_INLINE_RAM_BYTES_WRONG)
> + fprintf(stderr, ", invalid inline ram bytes");
>   fprintf(stderr, "\n");
>   /* Print the orphan extents if needed */
>   if (errors & I_ERR_FILE_EXTENT_ORPHAN)
> @@ -1483,6 +1485,9 @@ static int process_file_extent(struct btrfs_root *root,
>   } else {
>   if (num_bytes > max_inline_size)
>   rec->errors |= I_ERR_FILE_EXTENT_TOO_LARGE;
> + if (btrfs_file_extent_inline_item_len(eb, item) !=
> + num_bytes)
> + rec->errors |= I_ERR_INLINE_RAM_BYTES_WRONG;
>   }
>   rec->found_size += num_bytes;
>   num_bytes = (num_bytes + mask) & ~mask;
> @@ -2534,6 +2539,41 @@ out:
>   return ret;
>  }
>  
> +static int repair_inline_ram_bytes(struct btrfs_trans_handle *trans,
> +struct btrfs_root *root,
> +struct btrfs_path *path,
> +struct inode_record *rec)
> +{
> + struct btrfs_key key;
> + struct btrfs_file_extent_item *fi;
> + struct btrfs_item *i;
> + u64 on_disk_item_len;
> + int ret;
> +
> + key.objectid = rec->ino;
> + key.offset = 0;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> +
> + ret = btrfs_search_slot(trans, root, , path, 0, 1);
> + if (ret > 0)
> + ret = -ENOENT;
> + if (ret < 0)
> + goto out;
> +
> + i = btrfs_item_nr(path->slots[0]);
> + on_disk_item_len = btrfs_file_extent_inline_item_len(path->nodes[0], i);
> + fi = btrfs_item_ptr(path->nodes[0], path->slots[0],
> + struct btrfs_file_extent_item);
> + btrfs_set_file_extent_ram_bytes(path->nodes[0], fi, on_disk_item_len);
> + btrfs_mark_buffer_dirty(path->nodes[0]);
> + printf("Repaired inline ram_bytes for root %llu ino %llu\n",
> + root->objectid, rec->ino);
> + rec->errors &= ~I_ERR_INLINE_RAM_BYTES_WRONG;
> +out:
> + btrfs_release_path(path);
> + return ret;
> +}
> +
>  static int try_repair_inode(struct btrfs_root *root, struct inode_record 
> *rec)
>  {
>   struct btrfs_trans_handle *trans;
> @@ -2545,8 +2585,9 @@ static int try_repair_inode(struct btrfs_root *root, 
> struct inode_record *rec)
>I_ERR_LINK_COUNT_WRONG |
>I_ERR_NO_INODE_ITEM |
>I_ERR_FILE_EXTENT_ORPHAN |
> -  I_ERR_FILE_EXTENT_DISCOUNT|
> -  I_ERR_FILE_NBYTES_WRONG)))
> +  I_ERR_FILE_EXTENT_DISCOUNT |
> +  I_ERR_FILE_NBYTES_WRONG |
> +  I_ERR_INLINE_RAM_BYTES_WRONG)))
>   return rec->errors;
>  
>   /*
> @@ -2575,6 +2616,8 @@ static int try_repair_inode(struct btrfs_root *root, 
> struct inode_record *rec)
>   ret = repair_inode_nlinks(trans, root, , rec);
>   if (!ret && rec->errors & I_ERR_FILE_NBYTES_WRONG)
>   ret = repair_inode_nbytes(trans, root, , rec);
> + if (!ret && rec->errors & I_ERR_INLINE_RAM_BYTES_WRONG)
> + ret = repair_inline_ram_bytes(trans, root, , rec);
>   btrfs_commit_transaction(trans, root);
>   btrfs_release_path();
>   return ret;
> diff --git a/check/mode-original.h b/check/mode-original.h
> index 13cfa5b9e1b3..ec2842e0b3be 100644
> --- a/check/mode-original.h
> +++ b/check/mode-original.h
> @@ -187,6 +187,7 @@ struct file_extent_hole {
>  #define I_ERR_FILE_EXTENT_ORPHAN (1 << 14)
>  #define I_ERR_FILE_EXTENT_TOO_LARGE  (1 << 15)
>  #define I_ERR_ODD_INODE_FLAGS(1 << 16)
> +#define I_ERR_INLINE_RAM_BYTES_WRONG (1 << 17)
>  
>  struct inode_record {
>   struct list_head backrefs;
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to 

[PATCH v2 3/6] btrfs-progs: check/original: Detect and repair wrong inline ram_bytes

2018-06-06 Thread Qu Wenruo
It looks like that around 2014, btrfs kernel has a regression that would
cause offset-by-one ram_bytes for inline extent.

Add the ability to repair it in original mode.

Reported-by: Steve Leung 
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Add extra output for print_inode_error().
  Reword the repair output message.
  Clear the error bitmap if repaired.
---
 check/main.c  | 47 +--
 check/mode-original.h |  1 +
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/check/main.c b/check/main.c
index 52ef181add61..c739c094cef3 100644
--- a/check/main.c
+++ b/check/main.c
@@ -578,6 +578,8 @@ static void print_inode_error(struct btrfs_root *root, 
struct inode_record *rec)
fprintf(stderr, ", orphan file extent");
if (errors & I_ERR_ODD_INODE_FLAGS)
fprintf(stderr, ", odd inode flags");
+   if (errors & I_ERR_INLINE_RAM_BYTES_WRONG)
+   fprintf(stderr, ", invalid inline ram bytes");
fprintf(stderr, "\n");
/* Print the orphan extents if needed */
if (errors & I_ERR_FILE_EXTENT_ORPHAN)
@@ -1483,6 +1485,9 @@ static int process_file_extent(struct btrfs_root *root,
} else {
if (num_bytes > max_inline_size)
rec->errors |= I_ERR_FILE_EXTENT_TOO_LARGE;
+   if (btrfs_file_extent_inline_item_len(eb, item) !=
+   num_bytes)
+   rec->errors |= I_ERR_INLINE_RAM_BYTES_WRONG;
}
rec->found_size += num_bytes;
num_bytes = (num_bytes + mask) & ~mask;
@@ -2534,6 +2539,41 @@ out:
return ret;
 }
 
+static int repair_inline_ram_bytes(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct btrfs_path *path,
+  struct inode_record *rec)
+{
+   struct btrfs_key key;
+   struct btrfs_file_extent_item *fi;
+   struct btrfs_item *i;
+   u64 on_disk_item_len;
+   int ret;
+
+   key.objectid = rec->ino;
+   key.offset = 0;
+   key.type = BTRFS_EXTENT_DATA_KEY;
+
+   ret = btrfs_search_slot(trans, root, , path, 0, 1);
+   if (ret > 0)
+   ret = -ENOENT;
+   if (ret < 0)
+   goto out;
+
+   i = btrfs_item_nr(path->slots[0]);
+   on_disk_item_len = btrfs_file_extent_inline_item_len(path->nodes[0], i);
+   fi = btrfs_item_ptr(path->nodes[0], path->slots[0],
+   struct btrfs_file_extent_item);
+   btrfs_set_file_extent_ram_bytes(path->nodes[0], fi, on_disk_item_len);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+   printf("Repaired inline ram_bytes for root %llu ino %llu\n",
+   root->objectid, rec->ino);
+   rec->errors &= ~I_ERR_INLINE_RAM_BYTES_WRONG;
+out:
+   btrfs_release_path(path);
+   return ret;
+}
+
 static int try_repair_inode(struct btrfs_root *root, struct inode_record *rec)
 {
struct btrfs_trans_handle *trans;
@@ -2545,8 +2585,9 @@ static int try_repair_inode(struct btrfs_root *root, 
struct inode_record *rec)
 I_ERR_LINK_COUNT_WRONG |
 I_ERR_NO_INODE_ITEM |
 I_ERR_FILE_EXTENT_ORPHAN |
-I_ERR_FILE_EXTENT_DISCOUNT|
-I_ERR_FILE_NBYTES_WRONG)))
+I_ERR_FILE_EXTENT_DISCOUNT |
+I_ERR_FILE_NBYTES_WRONG |
+I_ERR_INLINE_RAM_BYTES_WRONG)))
return rec->errors;
 
/*
@@ -2575,6 +2616,8 @@ static int try_repair_inode(struct btrfs_root *root, 
struct inode_record *rec)
ret = repair_inode_nlinks(trans, root, , rec);
if (!ret && rec->errors & I_ERR_FILE_NBYTES_WRONG)
ret = repair_inode_nbytes(trans, root, , rec);
+   if (!ret && rec->errors & I_ERR_INLINE_RAM_BYTES_WRONG)
+   ret = repair_inline_ram_bytes(trans, root, , rec);
btrfs_commit_transaction(trans, root);
btrfs_release_path();
return ret;
diff --git a/check/mode-original.h b/check/mode-original.h
index 13cfa5b9e1b3..ec2842e0b3be 100644
--- a/check/mode-original.h
+++ b/check/mode-original.h
@@ -187,6 +187,7 @@ struct file_extent_hole {
 #define I_ERR_FILE_EXTENT_ORPHAN   (1 << 14)
 #define I_ERR_FILE_EXTENT_TOO_LARGE(1 << 15)
 #define I_ERR_ODD_INODE_FLAGS  (1 << 16)
+#define I_ERR_INLINE_RAM_BYTES_WRONG   (1 << 17)
 
 struct inode_record {
struct list_head backrefs;
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6] btrfs-progs: check/original: Detect and repair wrong inline ram_bytes

2018-06-06 Thread Qu Wenruo


On 2018年06月06日 16:08, Su Yue wrote:
> 
> 
> On 06/06/2018 03:27 PM, Qu Wenruo wrote:
>> It looks like that around 2014, btrfs kernel has a regression that would
>> cause offset-by-one ram_bytes for inline extent.
>>
>> Add the ability to repair it in original mode.
>>
>> Reported-by: Steve Leung 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  check/main.c  | 44 +--
>>  check/mode-original.h |  1 +
>>  2 files changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/check/main.c b/check/main.c
>> index 52ef181add61..1374d8856d72 100644
>> --- a/check/main.c
>> +++ b/check/main.c
>> @@ -1483,6 +1483,9 @@ static int process_file_extent(struct btrfs_root *root,
>>  } else {
>>  if (num_bytes > max_inline_size)
>>  rec->errors |= I_ERR_FILE_EXTENT_TOO_LARGE;
>> +if (btrfs_file_extent_inline_item_len(eb, item) !=
>> +num_bytes)
>> +rec->errors |= I_ERR_INLINE_RAM_BYTES_WRONG;
>>  }
>>  rec->found_size += num_bytes;
>>  num_bytes = (num_bytes + mask) & ~mask;
>> @@ -2534,6 +2537,40 @@ out:
>>  return ret;
>>  }
>>  
>> +static int repair_inline_ram_bytes(struct btrfs_trans_handle *trans,
>> +   struct btrfs_root *root,
>> +   struct btrfs_path *path,
>> +   struct inode_record *rec)
>> +{
>> +struct btrfs_key key;
>> +struct btrfs_file_extent_item *fi;
>> +struct btrfs_item *i;
>> +u64 on_disk_item_len;
>> +int ret;
>> +
>> +key.objectid = rec->ino;
>> +key.offset = 0;
>> +key.type = BTRFS_EXTENT_DATA_KEY;
>> +
>> +ret = btrfs_search_slot(trans, root, , path, 0, 1);
>> +if (ret > 0)
>> +ret = -ENOENT;
>> +if (ret < 0)
>> +goto out;
>> +
>> +i = btrfs_item_nr(path->slots[0]);
>> +on_disk_item_len = btrfs_file_extent_inline_item_len(path->nodes[0], i);
>> +fi = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +struct btrfs_file_extent_item);
>> +btrfs_set_file_extent_ram_bytes(path->nodes[0], fi, on_disk_item_len);
>> +btrfs_mark_buffer_dirty(path->nodes[0]);
>> +printf("Successfully repaired inline ram_bytes for root %llu ino 
>> %llu\n",
>> +root->objectid, rec->ino);
> 
> Miss a line "rec->errors &= ~I_ERR_INLINE_RAM_BYTES_WRONG;"?.
> Although the new test case passes, the fake appearance is due to
> existed bug about returned codes of check_ino_rec().

Oh, I forgot this one, thanks for pointing out.

Thanks,
Qu

> 
> And refer to outputs of existed code, no need for the word
> 'Successfully'.
>> +out:
>> +btrfs_release_path(path);
>> +return ret;
>> +}
>> +
>>  static int try_repair_inode(struct btrfs_root *root, struct inode_record 
>> *rec)
>>  {
>>  struct btrfs_trans_handle *trans;
>> @@ -2545,8 +2582,9 @@ static int try_repair_inode(struct btrfs_root *root, 
>> struct inode_record *rec)
>>   I_ERR_LINK_COUNT_WRONG |
>>   I_ERR_NO_INODE_ITEM |
>>   I_ERR_FILE_EXTENT_ORPHAN |
>> - I_ERR_FILE_EXTENT_DISCOUNT|
>> - I_ERR_FILE_NBYTES_WRONG)))
>> + I_ERR_FILE_EXTENT_DISCOUNT |
>> + I_ERR_FILE_NBYTES_WRONG |
>> + I_ERR_INLINE_RAM_BYTES_WRONG)))
> 
> It's better if you can add otehr infomation into print_inode_error().
> 
> Others look fine to me.
> 
> Thanks,
> Su
>>  return rec->errors;
>>  
>>  /*
>> @@ -2575,6 +2613,8 @@ static int try_repair_inode(struct btrfs_root *root, 
>> struct inode_record *rec)
>>  ret = repair_inode_nlinks(trans, root, , rec);
>>  if (!ret && rec->errors & I_ERR_FILE_NBYTES_WRONG)
>>  ret = repair_inode_nbytes(trans, root, , rec);
>> +if (!ret && rec->errors & I_ERR_INLINE_RAM_BYTES_WRONG)
>> +ret = repair_inline_ram_bytes(trans, root, , rec);
>>  btrfs_commit_transaction(trans, root);
>>  btrfs_release_path();
>>  return ret;
>> diff --git a/check/mode-original.h b/check/mode-original.h
>> index 13cfa5b9e1b3..ec2842e0b3be 100644
>> --- a/check/mode-original.h
>> +++ b/check/mode-original.h
>> @@ -187,6 +187,7 @@ struct file_extent_hole {
>>  #define I_ERR_FILE_EXTENT_ORPHAN(1 << 14)
>>  #define I_ERR_FILE_EXTENT_TOO_LARGE (1 << 15)
>>  #define I_ERR_ODD_INODE_FLAGS   (1 << 16)
>> +#define I_ERR_INLINE_RAM_BYTES_WRONG(1 << 17)
>>  
>>  struct inode_record {
>>  struct list_head backrefs;
>>
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/6] btrfs-progs: check/original: Detect and repair wrong inline ram_bytes

2018-06-06 Thread Su Yue


On 06/06/2018 03:27 PM, Qu Wenruo wrote:
> It looks like that around 2014, btrfs kernel has a regression that would
> cause offset-by-one ram_bytes for inline extent.
> 
> Add the ability to repair it in original mode.
> 
> Reported-by: Steve Leung 
> Signed-off-by: Qu Wenruo 
> ---
>  check/main.c  | 44 +--
>  check/mode-original.h |  1 +
>  2 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/check/main.c b/check/main.c
> index 52ef181add61..1374d8856d72 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -1483,6 +1483,9 @@ static int process_file_extent(struct btrfs_root *root,
>   } else {
>   if (num_bytes > max_inline_size)
>   rec->errors |= I_ERR_FILE_EXTENT_TOO_LARGE;
> + if (btrfs_file_extent_inline_item_len(eb, item) !=
> + num_bytes)
> + rec->errors |= I_ERR_INLINE_RAM_BYTES_WRONG;
>   }
>   rec->found_size += num_bytes;
>   num_bytes = (num_bytes + mask) & ~mask;
> @@ -2534,6 +2537,40 @@ out:
>   return ret;
>  }
>  
> +static int repair_inline_ram_bytes(struct btrfs_trans_handle *trans,
> +struct btrfs_root *root,
> +struct btrfs_path *path,
> +struct inode_record *rec)
> +{
> + struct btrfs_key key;
> + struct btrfs_file_extent_item *fi;
> + struct btrfs_item *i;
> + u64 on_disk_item_len;
> + int ret;
> +
> + key.objectid = rec->ino;
> + key.offset = 0;
> + key.type = BTRFS_EXTENT_DATA_KEY;
> +
> + ret = btrfs_search_slot(trans, root, , path, 0, 1);
> + if (ret > 0)
> + ret = -ENOENT;
> + if (ret < 0)
> + goto out;
> +
> + i = btrfs_item_nr(path->slots[0]);
> + on_disk_item_len = btrfs_file_extent_inline_item_len(path->nodes[0], i);
> + fi = btrfs_item_ptr(path->nodes[0], path->slots[0],
> + struct btrfs_file_extent_item);
> + btrfs_set_file_extent_ram_bytes(path->nodes[0], fi, on_disk_item_len);
> + btrfs_mark_buffer_dirty(path->nodes[0]);
> + printf("Successfully repaired inline ram_bytes for root %llu ino 
> %llu\n",
> + root->objectid, rec->ino);

Miss a line "rec->errors &= ~I_ERR_INLINE_RAM_BYTES_WRONG;"?.
Although the new test case passes, the fake appearance is due to
existed bug about returned codes of check_ino_rec().

And refer to outputs of existed code, no need for the word
'Successfully'.
> +out:
> + btrfs_release_path(path);
> + return ret;
> +}
> +
>  static int try_repair_inode(struct btrfs_root *root, struct inode_record 
> *rec)
>  {
>   struct btrfs_trans_handle *trans;
> @@ -2545,8 +2582,9 @@ static int try_repair_inode(struct btrfs_root *root, 
> struct inode_record *rec)
>I_ERR_LINK_COUNT_WRONG |
>I_ERR_NO_INODE_ITEM |
>I_ERR_FILE_EXTENT_ORPHAN |
> -  I_ERR_FILE_EXTENT_DISCOUNT|
> -  I_ERR_FILE_NBYTES_WRONG)))
> +  I_ERR_FILE_EXTENT_DISCOUNT |
> +  I_ERR_FILE_NBYTES_WRONG |
> +  I_ERR_INLINE_RAM_BYTES_WRONG)))

It's better if you can add otehr infomation into print_inode_error().

Others look fine to me.

Thanks,
Su
>   return rec->errors;
>  
>   /*
> @@ -2575,6 +2613,8 @@ static int try_repair_inode(struct btrfs_root *root, 
> struct inode_record *rec)
>   ret = repair_inode_nlinks(trans, root, , rec);
>   if (!ret && rec->errors & I_ERR_FILE_NBYTES_WRONG)
>   ret = repair_inode_nbytes(trans, root, , rec);
> + if (!ret && rec->errors & I_ERR_INLINE_RAM_BYTES_WRONG)
> + ret = repair_inline_ram_bytes(trans, root, , rec);
>   btrfs_commit_transaction(trans, root);
>   btrfs_release_path();
>   return ret;
> diff --git a/check/mode-original.h b/check/mode-original.h
> index 13cfa5b9e1b3..ec2842e0b3be 100644
> --- a/check/mode-original.h
> +++ b/check/mode-original.h
> @@ -187,6 +187,7 @@ struct file_extent_hole {
>  #define I_ERR_FILE_EXTENT_ORPHAN (1 << 14)
>  #define I_ERR_FILE_EXTENT_TOO_LARGE  (1 << 15)
>  #define I_ERR_ODD_INODE_FLAGS(1 << 16)
> +#define I_ERR_INLINE_RAM_BYTES_WRONG (1 << 17)
>  
>  struct inode_record {
>   struct list_head backrefs;
> 




pEpkey.asc
Description: application/pgp-keys


[PATCH] btrfs: Get rid of the confusing btrfs_file_extent_inline_len()

2018-06-06 Thread Qu Wenruo
We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.

However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes at all.

In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.

Since we have tree-checker to verify all EXTENT_DATA, such mismatch can
be detected pretty easily, thus we can trust ram_bytes without the evil
btrfs_file_extent_inline_len().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h | 26 --
 fs/btrfs/file-item.c |  2 +-
 fs/btrfs/file.c  |  3 +--
 fs/btrfs/inode.c | 12 ++--
 fs/btrfs/print-tree.c|  4 ++--
 fs/btrfs/send.c  | 17 +++--
 fs/btrfs/tree-log.c  | 12 
 include/trace/events/btrfs.h |  2 +-
 8 files changed, 22 insertions(+), 56 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2771cc56a622..70ff1a5b28f6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2427,32 +2427,6 @@ static inline u32 btrfs_file_extent_inline_item_len(
return btrfs_item_size(eb, e) - BTRFS_FILE_EXTENT_INLINE_DATA_START;
 }
 
-/* this returns the number of file bytes represented by the inline item.
- * If an item is compressed, this is the uncompressed size
- */
-static inline u32 btrfs_file_extent_inline_len(const struct extent_buffer *eb,
-   int slot,
-   const struct btrfs_file_extent_item *fi)
-{
-   struct btrfs_map_token token;
-
-   btrfs_init_map_token();
-   /*
-* return the space used on disk if this item isn't
-* compressed or encoded
-*/
-   if (btrfs_token_file_extent_compression(eb, fi, ) == 0 &&
-   btrfs_token_file_extent_encryption(eb, fi, ) == 0 &&
-   btrfs_token_file_extent_other_encoding(eb, fi, ) == 0) {
-   return btrfs_file_extent_inline_item_len(eb,
-btrfs_item_nr(slot));
-   }
-
-   /* otherwise use the ram bytes field */
-   return btrfs_token_file_extent_ram_bytes(eb, fi, );
-}
-
-
 /* btrfs_dev_stats_item */
 static inline u64 btrfs_dev_stats_value(const struct extent_buffer *eb,
const struct btrfs_dev_stats_item *ptr,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index f9dd6d1836a3..8c3cd7072caf 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -942,7 +942,7 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode 
*inode,
btrfs_file_extent_num_bytes(leaf, fi);
} else if (type == BTRFS_FILE_EXTENT_INLINE) {
size_t size;
-   size = btrfs_file_extent_inline_len(leaf, slot, fi);
+   size = btrfs_file_extent_ram_bytes(leaf, fi);
extent_end = ALIGN(extent_start + size,
   fs_info->sectorsize);
}
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f660ba1e5e58..38557cba812c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -833,8 +833,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
btrfs_file_extent_num_bytes(leaf, fi);
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
extent_end = key.offset +
-   btrfs_file_extent_inline_len(leaf,
-path->slots[0], fi);
+   btrfs_file_extent_ram_bytes(leaf, fi);
} else {
/* can't happen */
BUG();
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..32b59cf2fb3a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1437,8 +1437,7 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
nocow = 1;
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
extent_end = found_key.offset +
-   btrfs_file_extent_inline_len(leaf,
-path->slots[0], fi);
+   btrfs_file_extent_ram_bytes(leaf, fi);
extent_end = ALIGN(extent_end,
   fs_info->sectorsize);
} else {
@@ -4543,8 +4542,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
BTRFS_I(inode), leaf, fi,
found_key.offset);
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
- 

[PATCH 4/6] btrfs-progs: check/lowmem: Prepare check_file_extent() to handle repair

2018-06-06 Thread Qu Wenruo
Current check_file_extent() doesn't support later repair work, since it
doesn't accept btrfs_path structure as parameter, thus it can't modify
btrfs trees, or later check will still use the old and wrong path.

Use btrfs_path to replace btrfs_key, extent_buffer and slot parameters,
so we can modify @path directly for repair, and reduce the number of
parameters for check_file_extent().

Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 45 +++--
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index b3198fffa250..3280591396bd 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -1740,18 +1740,18 @@ static int punch_extent_hole(struct btrfs_root *root, 
u64 ino, u64 start,
  * check and update the last offset of the file extent.
  *
  * @root:  the root of fs/file tree.
- * @fkey:  the key of the file extent.
  * @nodatasum: INODE_NODATASUM feature.
  * @size:  the sum of all EXTENT_DATA items size for this inode.
  * @end:   the offset of the last extent.
  *
  * Return 0 if no error occurred.
  */
-static int check_file_extent(struct btrfs_root *root, struct btrfs_key *fkey,
-struct extent_buffer *node, int slot,
+static int check_file_extent(struct btrfs_root *root, struct btrfs_path *path,
 unsigned int nodatasum, u64 *size, u64 *end)
 {
struct btrfs_file_extent_item *fi;
+   struct btrfs_key fkey;
+   struct extent_buffer *node = path->nodes[0];
u64 disk_bytenr;
u64 disk_num_bytes;
u64 extent_num_bytes;
@@ -1763,10 +1763,12 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info));
unsigned int extent_type;
unsigned int is_hole;
+   int slot = path->slots[0];
int compressed = 0;
int ret;
int err = 0;
 
+   btrfs_item_key_to_cpu(node, , slot);
fi = btrfs_item_ptr(node, slot, struct btrfs_file_extent_item);
 
/* Check inline extent */
@@ -1781,23 +1783,23 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
if (extent_num_bytes == 0) {
error(
"root %llu EXTENT_DATA[%llu %llu] has empty inline extent",
-   root->objectid, fkey->objectid, fkey->offset);
+   root->objectid, fkey.objectid, fkey.offset);
err |= FILE_EXTENT_ERROR;
}
if (compressed) {
if (extent_num_bytes > root->fs_info->sectorsize) {
error(
 "root %llu EXTENT_DATA[%llu %llu] too large inline extent ram size, have %llu, 
max: %u",
-   root->objectid, fkey->objectid,
-   fkey->offset, extent_num_bytes,
+   root->objectid, fkey.objectid,
+   fkey.offset, extent_num_bytes,
root->fs_info->sectorsize - 1);
err |= FILE_EXTENT_ERROR;
}
if (item_inline_len > max_inline_extent_size) {
error(
 "root %llu EXTENT_DATA[%llu %llu] too large inline extent on-disk size, have 
%u, max: %u",
-   root->objectid, fkey->objectid,
-   fkey->offset, item_inline_len,
+   root->objectid, fkey.objectid,
+   fkey.offset, item_inline_len,
max_inline_extent_size);
err |= FILE_EXTENT_ERROR;
}
@@ -1805,7 +1807,7 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
if (extent_num_bytes > max_inline_extent_size) {
error(
  "root %llu EXTENT_DATA[%llu %llu] too large inline extent size, have %llu, 
max: %u",
-   root->objectid, fkey->objectid, fkey->offset,
+   root->objectid, fkey.objectid, fkey.offset,
extent_num_bytes, max_inline_extent_size);
err |= FILE_EXTENT_ERROR;
}
@@ -1813,7 +1815,7 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
if (!compressed && extent_num_bytes != item_inline_len) {
error(
"root %llu EXTENT_DATA[%llu %llu] wrong inline size, have: 
%llu, expected: %u",
-   root->objectid, fkey->objectid, fkey->offset,
+   root->objectid, fkey.objectid, 

[PATCH 2/6] btrfs-progs: Get rid of the confusing btrfs_file_extent_inline_len()

2018-06-06 Thread Qu Wenruo
[BUG]
If one uncompressed inline extent has incorrect ram_bytes, neither btrfs
check nor dump-tree could detect such corruption.

[CAUSE]
Every caller tries to read inline extent ram_bytes is using
btrfs_file_extent_inline_len(), other than directly calling
btrfs_file_extent_ram_bytes().

For compressed extent, it's just calling btrfs_file_extent_ram_bytes().
However for uncompressed extent, it falls back to
btrfs_file_extent_inline_item_len(), makes us unable to detect anything
wrong in ram_bytes.

[FIX]
Just get rid of such confusing btrfs_file_extent_inline_len() function.

Reported-by: Steve Leung 
Signed-off-by: Qu Wenruo 
---
 check/main.c|  2 +-
 check/mode-lowmem.c |  2 +-
 cmds-restore.c  |  2 +-
 ctree.h | 22 --
 file.c  |  3 +--
 print-tree.c|  4 ++--
 6 files changed, 6 insertions(+), 29 deletions(-)

diff --git a/check/main.c b/check/main.c
index 235db373a86c..52ef181add61 100644
--- a/check/main.c
+++ b/check/main.c
@@ -1472,7 +1472,7 @@ static int process_file_extent(struct btrfs_root *root,
u8 compression = btrfs_file_extent_compression(eb, fi);
struct btrfs_item *item = btrfs_item_nr(slot);
 
-   num_bytes = btrfs_file_extent_inline_len(eb, slot, fi);
+   num_bytes = btrfs_file_extent_ram_bytes(eb, fi);
if (num_bytes == 0)
rec->errors |= I_ERR_BAD_FILE_EXTENT;
if (compression) {
diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index c42dfcc50352..b3198fffa250 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -1776,7 +1776,7 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
u32 item_inline_len;
 
item_inline_len = btrfs_file_extent_inline_item_len(node, e);
-   extent_num_bytes = btrfs_file_extent_inline_len(node, slot, fi);
+   extent_num_bytes = btrfs_file_extent_ram_bytes(node, fi);
compressed = btrfs_file_extent_compression(node, fi);
if (extent_num_bytes == 0) {
error(
diff --git a/cmds-restore.c b/cmds-restore.c
index 29a329a397c2..9542221fd265 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -302,7 +302,7 @@ static int copy_one_inline(struct btrfs_root *root, int fd,
fi = btrfs_item_ptr(leaf, path->slots[0],
struct btrfs_file_extent_item);
ptr = btrfs_file_extent_inline_start(fi);
-   len = btrfs_file_extent_inline_len(leaf, path->slots[0], fi);
+   len = btrfs_file_extent_ram_bytes(leaf, fi);
inline_item_len = btrfs_file_extent_inline_item_len(leaf, 
btrfs_item_nr(path->slots[0]));
read_extent_buffer(leaf, buf, ptr, inline_item_len);
 
diff --git a/ctree.h b/ctree.h
index de4b1b7e6416..04a77550c715 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2443,28 +2443,6 @@ static inline u32 btrfs_search_header_len(struct 
btrfs_ioctl_search_header *sh)
return get_unaligned_32(>len);
 }
 
-/* this returns the number of file bytes represented by the inline item.
- * If an item is compressed, this is the uncompressed size
- */
-static inline u32 btrfs_file_extent_inline_len(struct extent_buffer *eb,
-  int slot,
-  struct btrfs_file_extent_item 
*fi)
-{
-   /*
-* return the space used on disk if this item isn't
-* compressed or encoded
-*/
-   if (btrfs_file_extent_compression(eb, fi) == 0 &&
-   btrfs_file_extent_encryption(eb, fi) == 0 &&
-   btrfs_file_extent_other_encoding(eb, fi) == 0) {
-   return btrfs_file_extent_inline_item_len(eb,
-btrfs_item_nr(slot));
-   }
-
-   /* otherwise use the ram bytes field */
-   return btrfs_file_extent_ram_bytes(eb, fi);
-}
-
 #define btrfs_fs_incompat(fs_info, opt) \
__btrfs_fs_incompat((fs_info), BTRFS_FEATURE_INCOMPAT_##opt)
 
diff --git a/file.c b/file.c
index f5e645c47bda..056be1045814 100644
--- a/file.c
+++ b/file.c
@@ -255,8 +255,7 @@ int btrfs_read_file(struct btrfs_root *root, u64 ino, u64 
start, int len,
/* Inline extent, one inode should only one inline extent */
if (btrfs_file_extent_type(leaf, fi) ==
BTRFS_FILE_EXTENT_INLINE) {
-   extent_len = btrfs_file_extent_inline_len(leaf, slot,
- fi);
+   extent_len = btrfs_file_extent_ram_bytes(leaf, fi);
if (extent_start + extent_len <= start)
goto next;
read_extent_buffer(leaf, dest,
diff --git a/print-tree.c b/print-tree.c
index dfa7bb6b676f..a09ecfbb28f0 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -357,9 +357,9 @@ static void 

[PATCH 5/6] btrfs-progs: check/lowmem: Repair wrong inlien ram_bytes for uncompressed extent

2018-06-06 Thread Qu Wenruo
Similar to the original mode repair.

Reported-by: Steve Leung 
Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 73 -
 1 file changed, 72 insertions(+), 1 deletion(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 3280591396bd..cb778ce6f789 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -1735,6 +1735,70 @@ static int punch_extent_hole(struct btrfs_root *root, 
u64 ino, u64 start,
return ret;
 }
 
+static int repair_inline_ram_bytes(struct btrfs_root *root,
+  struct btrfs_path *path, u64 *ram_bytes_ret)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_key key;
+   struct btrfs_file_extent_item *fi;
+   struct btrfs_item *i;
+   u32 on_disk_data_len;
+   int ret;
+   int recover_ret;
+
+   trans = btrfs_start_transaction(root, 1);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   return ret;
+   }
+   btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]);
+   btrfs_release_path(path);
+   ret = btrfs_search_slot(trans, root, , path, 0, 1);
+   /* Not really possible */
+   if (ret > 0) {
+   ret = -ENOENT;
+   btrfs_release_path(path);
+   goto recover;
+   }
+   if (ret < 0) {
+   goto recover;
+   }
+   i = btrfs_item_nr(path->slots[0]);
+   on_disk_data_len = btrfs_file_extent_inline_item_len(path->nodes[0], i);
+
+   fi = btrfs_item_ptr(path->nodes[0], path->slots[0],
+   struct btrfs_file_extent_item);
+   if (btrfs_file_extent_type(path->nodes[0], fi) !=
+   BTRFS_FILE_EXTENT_INLINE ||
+   btrfs_file_extent_compression(path->nodes[0], fi) !=
+   BTRFS_COMPRESS_NONE)
+   return -EINVAL;
+   btrfs_set_file_extent_ram_bytes(path->nodes[0], fi, on_disk_data_len);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+
+   ret = btrfs_commit_transaction(trans, root);
+   if (!ret) {
+   printf(
+   "Successfully repaired inline ram_bytes for root %llu ino %llu\n",
+   root->objectid, key.objectid);
+   *ram_bytes_ret = on_disk_data_len;
+   }
+   return ret;
+
+recover:
+   /*
+* CoW search failed, mostly due to the extra CoW work
+* (extent allocation, etc).
+* Since we have a good path before, readonly search should still
+* work, or later checks will fail due to empty path
+*/
+   recover_ret = btrfs_search_slot(NULL, root, , path, 0, 0);
+
+   /* This really shouldn't happen, or we have a big trouble */
+   ASSERT(recover_ret == 0);
+   return ret;
+}
+
 /*
  * Check file extent datasum/hole, update the size of the file extents,
  * check and update the last offset of the file extent.
@@ -1817,7 +1881,14 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_path *path,
"root %llu EXTENT_DATA[%llu %llu] wrong inline size, have: 
%llu, expected: %u",
root->objectid, fkey.objectid, fkey.offset,
extent_num_bytes, item_inline_len);
-   err |= FILE_EXTENT_ERROR;
+   if (repair) {
+   ret = repair_inline_ram_bytes(root, path,
+ 
_num_bytes);
+   if (ret)
+   err |= FILE_EXTENT_ERROR;
+   } else {
+   err |= FILE_EXTENT_ERROR;
+   }
}
*end += extent_num_bytes;
*size += extent_num_bytes;
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] btrfs-progs: fsck-tests: Add test case for corrupted inline ram_bytes

2018-06-06 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 .../035-inline-bad-ram-bytes/offset_by_one.img   | Bin 0 -> 3072 bytes
 .../fsck-tests/035-inline-bad-ram-bytes/test.sh  |  11 +++
 2 files changed, 11 insertions(+)
 create mode 100644 tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img
 create mode 100755 tests/fsck-tests/035-inline-bad-ram-bytes/test.sh

diff --git a/tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img 
b/tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img
new file mode 100644
index 
..2f58208eed71d5e0ee053d825d2780fe6842a3db
GIT binary patch
literal 3072
zcmeH}`8ONr8pka~h^564TeVt2sb#d>Vg%KyR)sRL#x}O9A*eNoYK>9*R)ebA(+OgY
zkw_DJ2es9xEfJJfl+oALh*MKXA|Od*1VXpZ9ytbKd8C-sQZhzWJex%C1DN
z|1`d%k^8tG1+-(+=4j5v1vu8xW3A`;?2H}9PmcA?vA+Fme{Ek0d?oO|LLh49O}KHv
zP=bIeVqZ69EaGYNk~`IzZ#ddd1_bI{Jd?xs^HPaX;&|dxT)T&?tfq;i_6hAp?Hpb+
zod=eB=rgX3Tqg}pbh0Rb%HI&6yq^Pk&;7(FQj6gA1ZjCt5Zm)(yKY^~p&4vjzoV*0
zOq@4w#HE-XZhvuP;*D968)`CTW`ap00P7elRZGco#qT=LzSHu^uSw>s@Q_SPwAB{6
z-l=NBC!ndLr0*mn|GZVh^^oV+Sl*(Tp9cN+Q$37<8Pz)H+BfbeOm$}rmWcRo@h%S^
zXOVfhh@#d-Q8Vkajn-Wz^8KGUmc;;x6!oh*K?dT9spQ3k{aJG)`Qd{Gx9
z|5PwdxIP06KPjyz&40SHFUFw=_u}%$d1xpR#Xv=pRDN4a9#8g0h!SryT6SADH)^co
zw=~CXNoxZwf6$f*xHzISL*=?JV}uCQ;$OjTd^CvO8>Lm;>)+WI2hlUpy3*-vK9tWzn
zWjqnUW1*2a&<7ztt3J=%>`dQ4eVOQzGHYs?u95nZi{`7%ICs%4d02
zhS|?t=;^g`i`S4r^#veoQ-Z<5m;H4a2^BaFq{V!ITXvPEQxjrgPgH+Z2Ne|
zt7?Z#tWTSQ%mr6esLz=WeVa7!4Q_shk|j{2_mSJtSw)ip~~!p+u`4k4viSy2)G
z%Ilq7#O>BT44T8b_m)#rH3|Yk+}U>9
zEGf@)b#XjYRjo0s`<$O`D-d%%)$sF;yu!~HP_DYoSK0EWl{od^AXVg{ByQTvkux(G_=Uc0DPtp{`Pb<`I
zceSY%*O1^dwYj6#^H9E^4-AmfE59bo>viLs*%SW=Qjs@Pf;G-*^dy;sFt=ew`PN+KH)g
z@}WwcgY1{FMS6k+`bi-ZyS9fUx1eSUsT8uz!R!?KvDw?dkeEQs!!X^?y8;1nhaA17
zjI%mO9zQ!_eqNxZQ)TS{?UZBv(1362fKr99)Wc*Gc;RhdF)}Tb#U$)sf!}QEa374r
zkFZ#4_yZT+!;HPFPE+w@k)GjOQNF`)9GI%ba0z#047WaAP-D)FjuY6lS()*|G
zVbru!sW?U{r+r?@<&;ZOk;}W%)F%TkJUlV68Uj%Bj)oXJG**^={3HDhttp://vger.kernel.org/majordomo-info.html


[PATCH 1/6] btrfs-progs: restore: Fix wrong compressed item size for decompress()

2018-06-06 Thread Qu Wenruo
When using decompress() in copy_one_inline(), we're passing the
decompressed extent size (ram_bytes) into decompress().
However we only has @inline_item_len read out, the pending data will be
uninitialized data.

Thankfully, all compression methods supported have some extra data in
its header, thus it won't cause real problem.

Whatever fixing it is never a bad idea.

Signed-off-by: Qu Wenruo 
---
 cmds-restore.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 342f5cc7eabb..29a329a397c2 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -324,7 +324,8 @@ static int copy_one_inline(struct btrfs_root *root, int fd,
return -ENOMEM;
}
 
-   ret = decompress(root, buf, outbuf, len, _size, compress);
+   ret = decompress(root, buf, outbuf, inline_item_len, _size,
+compress);
if (ret) {
free(outbuf);
return ret;
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] btrfs-progs: check/original: Detect and repair wrong inline ram_bytes

2018-06-06 Thread Qu Wenruo
It looks like that around 2014, btrfs kernel has a regression that would
cause offset-by-one ram_bytes for inline extent.

Add the ability to repair it in original mode.

Reported-by: Steve Leung 
Signed-off-by: Qu Wenruo 
---
 check/main.c  | 44 +--
 check/mode-original.h |  1 +
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/check/main.c b/check/main.c
index 52ef181add61..1374d8856d72 100644
--- a/check/main.c
+++ b/check/main.c
@@ -1483,6 +1483,9 @@ static int process_file_extent(struct btrfs_root *root,
} else {
if (num_bytes > max_inline_size)
rec->errors |= I_ERR_FILE_EXTENT_TOO_LARGE;
+   if (btrfs_file_extent_inline_item_len(eb, item) !=
+   num_bytes)
+   rec->errors |= I_ERR_INLINE_RAM_BYTES_WRONG;
}
rec->found_size += num_bytes;
num_bytes = (num_bytes + mask) & ~mask;
@@ -2534,6 +2537,40 @@ out:
return ret;
 }
 
+static int repair_inline_ram_bytes(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct btrfs_path *path,
+  struct inode_record *rec)
+{
+   struct btrfs_key key;
+   struct btrfs_file_extent_item *fi;
+   struct btrfs_item *i;
+   u64 on_disk_item_len;
+   int ret;
+
+   key.objectid = rec->ino;
+   key.offset = 0;
+   key.type = BTRFS_EXTENT_DATA_KEY;
+
+   ret = btrfs_search_slot(trans, root, , path, 0, 1);
+   if (ret > 0)
+   ret = -ENOENT;
+   if (ret < 0)
+   goto out;
+
+   i = btrfs_item_nr(path->slots[0]);
+   on_disk_item_len = btrfs_file_extent_inline_item_len(path->nodes[0], i);
+   fi = btrfs_item_ptr(path->nodes[0], path->slots[0],
+   struct btrfs_file_extent_item);
+   btrfs_set_file_extent_ram_bytes(path->nodes[0], fi, on_disk_item_len);
+   btrfs_mark_buffer_dirty(path->nodes[0]);
+   printf("Successfully repaired inline ram_bytes for root %llu ino 
%llu\n",
+   root->objectid, rec->ino);
+out:
+   btrfs_release_path(path);
+   return ret;
+}
+
 static int try_repair_inode(struct btrfs_root *root, struct inode_record *rec)
 {
struct btrfs_trans_handle *trans;
@@ -2545,8 +2582,9 @@ static int try_repair_inode(struct btrfs_root *root, 
struct inode_record *rec)
 I_ERR_LINK_COUNT_WRONG |
 I_ERR_NO_INODE_ITEM |
 I_ERR_FILE_EXTENT_ORPHAN |
-I_ERR_FILE_EXTENT_DISCOUNT|
-I_ERR_FILE_NBYTES_WRONG)))
+I_ERR_FILE_EXTENT_DISCOUNT |
+I_ERR_FILE_NBYTES_WRONG |
+I_ERR_INLINE_RAM_BYTES_WRONG)))
return rec->errors;
 
/*
@@ -2575,6 +2613,8 @@ static int try_repair_inode(struct btrfs_root *root, 
struct inode_record *rec)
ret = repair_inode_nlinks(trans, root, , rec);
if (!ret && rec->errors & I_ERR_FILE_NBYTES_WRONG)
ret = repair_inode_nbytes(trans, root, , rec);
+   if (!ret && rec->errors & I_ERR_INLINE_RAM_BYTES_WRONG)
+   ret = repair_inline_ram_bytes(trans, root, , rec);
btrfs_commit_transaction(trans, root);
btrfs_release_path();
return ret;
diff --git a/check/mode-original.h b/check/mode-original.h
index 13cfa5b9e1b3..ec2842e0b3be 100644
--- a/check/mode-original.h
+++ b/check/mode-original.h
@@ -187,6 +187,7 @@ struct file_extent_hole {
 #define I_ERR_FILE_EXTENT_ORPHAN   (1 << 14)
 #define I_ERR_FILE_EXTENT_TOO_LARGE(1 << 15)
 #define I_ERR_ODD_INODE_FLAGS  (1 << 16)
+#define I_ERR_INLINE_RAM_BYTES_WRONG   (1 << 17)
 
 struct inode_record {
struct list_head backrefs;
-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] btrfs-progs: Fixes inline ram_bytes related bugs

2018-06-06 Thread Qu Wenruo
The patchset can be fetched from github (*):
https://github.com/adam900710/btrfs-progs/tree/inline_ram_bytes

It's based on David's devel branch, whose HEAD is:
commit 0d1c5812e28e286648781c7b35b542311cc01aa4 (david/devel)
Author: Matthias Benkard 
Date:   Wed Apr 25 16:34:54 2018 +0200

btrfs-progs: mkfs: traverse_directory: Reset error code on continue

Reported-by Steve Leung , his old btrfs (at least
offending inodes are from 2014) has inline uncompressed extent, while
its ram_bytes mismatch with item size.

Latest kernel tree check catches this bug, while we failed to detect by
dump-tree.

It turns out that btrfs-progs is doing something evil to avoid reading
ram_bytes from inline uncompressed extent.


So this patchset will address all such ram_bytes related problems.

The 1st patch is a not-so-relative fix for restore, which is using
ram_bytes for decompress. Although thanks to the compression header, we
won't read out-of-boundary, but fixing it is never a bad thing.

The 2nd patch will get rid of the evil btrfs_file_extent_inline_len()
which hides raw ram_bytes from us, and fooling us for a long long time.

The 3rd~5th patches introduce check/repair function for both original
and lowmem mode (although lowmem mode can detect it even before this patch).

The last one is the test case for it as usual.

*: Or should I just migrate to gitlab after M$ acquired github?

Qu Wenruo (6):
  btrfs-progs: restore: Fix wrong compressed item size for decompress()
  btrfs-progs: Get rid of the confusing btrfs_file_extent_inline_len()
  btrfs-progs: check/original: Detect and repair wrong inline ram_bytes
  btrfs-progs: check/lowmem: Prepare check_file_extent() to handle
repair
  btrfs-progs: check/lowmem: Repair wrong inlien ram_bytes for
uncompressed extent
  btrfs-progs: fsck-tests: Add test case for corrupted inline ram_bytes

 check/main.c  |  46 ++-
 check/mode-lowmem.c   | 120 ++
 check/mode-original.h |   1 +
 cmds-restore.c|   5 +-
 ctree.h   |  22 
 file.c|   3 +-
 print-tree.c  |   4 +-
 .../offset_by_one.img | Bin 0 -> 3072 bytes
 .../035-inline-bad-ram-bytes/test.sh  |  11 ++
 9 files changed, 157 insertions(+), 55 deletions(-)
 create mode 100644 tests/fsck-tests/035-inline-bad-ram-bytes/offset_by_one.img
 create mode 100755 tests/fsck-tests/035-inline-bad-ram-bytes/test.sh

-- 
2.17.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: btrfs_close_devices(): only fsync() if device->writeable

2018-06-06 Thread james harvey
Prevent unnecessary error from failing fsync(), if opened read only.

Performed 'grep "writeable = " *.h *.c' to make sure there were no odd
situations where fsync() might still be desired here.  They're all straight-
forward.  The only situation where writeable will be 0 is if btrfs_open_devices
is given flags without O_RDWR.  There is no situation where a writeable volume
temporarily becomes unwriteable, or anything like that.  Given that it's being
opened O_RDWR, there's no reason to attempt fsync().

utils.c

   int btrfs_add_to_fsid() {
  ...
  device->writeable = 1;

volumes.c

   int btrfs_close_devices() {
  ...
  while (!list_empty(_devices->devices)) {
 ...
 // just after the fsync() being patched
267: device->writeable = 0;
   ...
   int btrfs_open_devices() {
  ...
  list_for_each_entry(device, _devices->devices, dev_list) {
 ...
 if (flags & O_RDWR)
332:device->writeable = 1

kernel btrfs_close_devices() does not have a corresponding fsync() that I see.

Signed-off-by: James Harvey 
---
 volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/volumes.c b/volumes.c
index c6e34321..36d1bde6 100644
--- a/volumes.c
+++ b/volumes.c
@@ -254,7 +254,7 @@ again:
device = list_entry(fs_devices->devices.next,
struct btrfs_device, dev_list);
if (device->fd != -1) {
-   if (fsync(device->fd) == -1) {
+   if (device->writeable && fsync(device->fd) == -1) {
warning("fsync on device %llu failed: %m",
device->devid);
ret = -errno;
--
2.17.0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html