Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-10-29 Thread Nikolay Borisov



On 29.10.18 г. 7:53 ч., Qu Wenruo wrote:
> [snip]
>>> The cause sounds valid, however would you please explain more about how
>>> such cleaning on unrelated delalloc range happens?
>>
>> So in my case the following happened - 2 block groups were created as
>> delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
>> dirtied, so when page 0 - 0-4k when into writepage_delalloc,
>> find_lock_delalloc_range would return the range 0-1m. So the call to
>> fill_delalloc instantiates OE 0-1m and writeback continues as expected.
>>
>> Now, when the 2nd page from range 0-1m is again set for writeback and
>> find_lock_delalloc_range is called with delalloc_start ==  4096 it will
>> actually return the range 1m-128m.
> 
> IMHO this looks strange and may need extra investigation.
> 
> Normally I would expect it returns the range 0~1M or 4K~1M.

It cannot return 4k-1m since the writeback for the first page has
already dealt with this range. Also take a look in writepage_delalloc
how find_lock_delalloc_range is called : for 'start' it's passed the
page offset, calculated in __extent_writepage. And when
find_delalloc_range is called it just searches for an extent which ends
after passed start value. In find_delalloc_range  firsttree_search
is called which returns the 1m-128m range, then we go in the while(1)
loop on the first itertaion found is 0 so *end is populated with 128m ,
found is set to 1 and *start is set to 1m.

On the second iteration the check  if (found && (state->start !=
cur_start || (state->state & EXTENT_BOUNDARY)))

is triggered since the next extent found will have EXTENT_BOUNDARY since
it will be the next block group from relocation. EXTENT_BOUNDARY will be
set from relocate_file_extent_cluster' main loop:

  if (nr < cluster->nr &&

page_start + offset == cluster->boundary[nr]) {

set_extent_bits(&BTRFS_I(inode)->io_tree,

page_start, page_end,

EXTENT_BOUNDARY);

nr++;

}
> 
> But that doesn't affect the fix patch anyway.
> 
> Thanks,
> Qu
> 
>> Then fill_delalloc is called with
>> locked_page belonging to the range which was already instantiated and
>> the start/end arguments pointing to 1m-128m if an error occurred in
>> run_delalloc_range in this case then  btrfs_cleanup_ordered_extents will
>> be called which will clear Private2 bit for pages belonging to 1m-128m
>> range and the OE will be cleared of all but the first page (since the
>> code wrongly assumed locked_page always belongs to the range currently
>> being instantiated).
>>
> 
> 


Re: a new kind of "No space left on device" error

2018-10-29 Thread Henk Slager
On Mon, Oct 29, 2018 at 7:20 AM Dave  wrote:
>
> This is one I have not seen before.
>
> When running a simple, well-tested and well-used script that makes
> backups using btrfs send | receive, I got these two errors:
>
> At subvol snapshot
> ERROR: rename o131621-1091-0 ->
> usr/lib/node_modules/node-gyp/gyp/pylib/gyp/MSVSVersion.py failed: No
> space left on device
>
> At subvol snapshot
> ERROR: rename o259-1095-0 -> myser/.bash_profile failed: No space left on 
> device
>
> I have run this script many, many times and never seen errors like
> this. There is plenty of room on the device:
>
> # btrfs fi df /mnt/
> Data, single: total=18.01GiB, used=16.53GiB
> System, DUP: total=8.00MiB, used=16.00KiB
> Metadata, DUP: total=1.00GiB, used=145.12MiB
> GlobalReserve, single: total=24.53MiB, used=0.00B
>
> # df -h /mnt/
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sdc254G   17G   36G  33% /mnt
>
> The send | receive appears to have mostly succeeded because the final
> expected size is about 17G, as shown above. That will use only about
> 1/3 of the available disk space, when completed. I don't see any
> reason for "No space left on device" errors, but maybe somebody here
> can spot a problem I am missing.
What kernel and progs versions?
What are the mount options for the filesystem?
Can you tell something about the device /dev/sdc2 (SSD, HDD, SD-card,
USBstick, LANstorage, etc)?
Could it be that your ENOSPACE has the same cause as this:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg81554.html


[PATCH] Btrfs: fix missing data checksums after a ranged fsync (msync)

2018-10-29 Thread fdmanana
From: Filipe Manana 

Recently we got a massive simplification for fsync, where for the fast
path we no longer log new extents while their respective ordered extents
are still running.

However that simplification introduced a subtle regression for the case
where we use a ranged fsync (msync). Consider the following example:

   CPU 0CPU 1

mmap write to range [2Mb, 4Mb[
  mmap write to range [512Kb, 1Mb[
  msync range [512K, 1Mb[
--> triggers fast fsync
(BTRFS_INODE_NEEDS_FULL_SYNC
 not set)
--> creates extent map A for this
range and adds it to list of
modified extents
--> starts ordered extent A for
this range
--> waits for it to complete

writeback triggered for range
[2Mb, 4Mb[
  --> create extent map B and
  adds it to the list of
  modified extents
  --> creates ordered extent B

--> start looking for and logging
modified extents
--> logs extent maps A and B
--> finds checksums for extent A
in the csum tree, but not for
extent B
  fsync (msync) finishes

  --> ordered extent B
  finishes and its
  checksums are added
  to the csum tree



After replaying the log, we have the extent covering the range [2Mb, 4Mb[
but do not have the data checksum items covering that file range.

This happens because at the very beginning of an fsync (btrfs_sync_file())
we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
wait for any ordered extents in that range to complete before we start
logging the extents. However if right before we start logging the extent
in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
or msync (btrfs_sync_file() starts writeback before acquiring the inode's
lock), an ordered extent is created for that other range and a new extent
map is created to represent that range and added to the inode's list of
modified extents.

That means that we will see that other extent in that list when collecting
extents for logging (done at btrfs_log_changed_extents()) and log the
extent before the respective ordered extent finishes - namely before the
checksum items are added to the checksums tree, which is where
log_extent_csums() looks for the checksums, therefore making us log an
extent without logging its checksums. Before that massive simplification
of fsync, this wasn't a problem because besides looking for checkums in
the checksums tree, we also looked for them in any ordered extent still
running.

The consequence of data checksums missing for a file range is that users
attempting to read the affected file range will get -EIO errors and dmesg
reports the following:

 [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
 [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 
57344 csum 0x98f94189 expected csum 0x mirror 1

So fix this by skipping an extents outside of our logging range at
btrfs_log_changed_extents() and leaving them on the list of modified
extents so that any subsequent ranged fsync may collect them if needed.
Also, if we find a hole extent outside of the range still log it, just
to prevent having gaps between extent items after replaying the log,
otherwise fsck will complain when we are not using the NO_HOLES feature
(fstest btrfs/056 triggers such case).

Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the 
log_one_extent path")
CC: sta...@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c86c5dd100b2..d49edd25f2e5 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4394,6 +4394,23 @@ static int btrfs_log_changed_extents(struct 
btrfs_trans_handle *trans,
test_gen = root->fs_info->last_trans_committed;
 
list_for_each_entry_safe(em, n, &tree->modified_extents, list) {
+   /*
+* Skip extents outside our logging range. It's important to do
+* it for correctness because if we don't ignore them, we may
+* log them before their ordered extent completes, and therefore
+* we could log them without logging their respective checksums
+* (the checksum items are added to the csum tree at t

[PATCH] fstests: fix fssum to actually ignore file holes when supposed to

2018-10-29 Thread fdmanana
From: Filipe Manana 

Unless the '-s' option is passed to fssum, it should not detect file holes
and have their existence influence the computed checksum for a file. This
tool was added to test btrfs' send/receive feature, so that it checks for
any metadata and data differences between the original filesystem and the
filesystem that receives send streams.

For a long time the test btrfs/007, which tests btrfs' send/receive with
fsstress, fails sporadically reporting data differences between files.
However the md5sum/sha1sum from the reported files in the original and
new filesystems are the same. The reason why fssum fails is because even
in normal mode it still accounts for number of holes that exist in the
file and their respective lengths. This is done using the SEEK_DATA mode
of lseek. The btrfs send feature does not preserve holes nor prealloc
extents (not supported by the current protocol), so whenever a hole or
prealloc (unwritten) extent is detected in the source filesystem, it
issues a write command full of zeroes, which will translate to a regular
(written) extent in the destination filesystem. This is why fssum reports
a different checksum. A prealloc extent also counts as hole when using
lseek.

For example when passing a seed of 1540592967 to fsstress in btrfs/007,
the test fails, as file p0/d0/f7 has a prealloc extent in the original
filesystem (in the incr snapshot).

Fix this by making fssum just read the hole file and feed its data to the
digest calculation function when option '-s' is not given. If we ever get
btrfs' send/receive to support holes and fallocate, we can just change
the test and pass the '-s' option to all fssum calls.

Signed-off-by: Filipe Manana 
---
 src/fssum.c | 65 +
 1 file changed, 5 insertions(+), 60 deletions(-)

diff --git a/src/fssum.c b/src/fssum.c
index 5da39abf..f1da72fb 100644
--- a/src/fssum.c
+++ b/src/fssum.c
@@ -224,71 +224,16 @@ int
 sum_file_data_permissive(int fd, sum_t *dst)
 {
int ret;
-   off_t pos;
-   off_t old;
-   int i;
-   uint64_t zeros = 0;
-
-   pos = lseek(fd, 0, SEEK_CUR);
-   if (pos == (off_t)-1)
-   return errno == ENXIO ? 0 : -2;
 
while (1) {
-   old = pos;
-   pos = lseek(fd, pos, SEEK_DATA);
-   if (pos == (off_t)-1) {
-   if (errno == ENXIO) {
-   ret = 0;
-   pos = lseek(fd, 0, SEEK_END);
-   if (pos != (off_t)-1)
-   zeros += pos - old;
-   } else {
-   ret = -2;
-   }
-   break;
-   }
ret = read(fd, buf, sizeof(buf));
-   assert(ret); /* eof found by lseek */
-   if (ret <= 0)
+   if (ret < 0)
+   return -errno;
+   sum_add(dst, buf, ret);
+   if (ret < sizeof(buf))
break;
-   if (old < pos) /* hole */
-   zeros += pos - old;
-   for (i = 0; i < ret; ++i) {
-   for (old = i; buf[i] == 0 && i < ret; ++i)
-   ;
-   if (old < i) /* code like a hole */
-   zeros += i - old;
-   if (i == ret)
-   break;
-   if (zeros) {
-   if (verbose >= 2)
-   fprintf(stderr,
-   "adding %llu zeros to sum\n",
-   (unsigned long long)zeros);
-   sum_add_u64(dst, 0);
-   sum_add_u64(dst, zeros);
-   zeros = 0;
-   }
-   for (old = i; buf[i] != 0 && i < ret; ++i)
-   ;
-   if (verbose >= 2)
-   fprintf(stderr, "adding %u non-zeros to sum\n",
-   i - (int)old);
-   sum_add(dst, buf + old, i - old);
-   }
-   pos += ret;
}
-
-   if (zeros) {
-   if (verbose >= 2)
-   fprintf(stderr,
-   "adding %llu zeros to sum (finishing)\n",
-   (unsigned long long)zeros);
-   sum_add_u64(dst, 0);
-   sum_add_u64(dst, zeros);
-   }
-
-   return ret;
+   return 0;
 }
 
 int
-- 
2.11.0



Re: [PATCH] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-29 Thread Filipe Manana
On Mon, Oct 29, 2018 at 6:31 AM Qu Wenruo  wrote:
>
> For latest kernel, there is a chance that btrfs/057 reports false
> errors.

By latest kernel you mean 4.20?

>
> The false error would look like:
>   btrfs/057 4s ... - output mismatch (see 
> /home/adam/xfstests-dev/results//btrfs/057.out.bad)
>   --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
>   +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
> 14:07:28.443651293 +0800
>   @@ -1,3 +1,3 @@
>QA output created by 057
>4096 4096
>   -4096 4096
>   +28672 28672
>
> This is related to the fact that "btrfs subvolume sync" (or
> vanilla sync) will not ensure orphan (unlinked but still exist) files to
> be removed.

So when did that happen, which commit introduced the behaviour change?

>
> In fact, for that false error case, if inspecting the fs after umount,
> its qgroup number is correct and btrfs check won't report qgroup error.
>
> To fix the false alerts, just skip any manual qgroup number comparison,
> and let fsck done after the test case to detect problem.
>
> This also elimiate the necessary of using specified mount and mkfs
> option, allowing us to improve coverage.
>
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 
> ---
>  tests/btrfs/057 | 17 -
>  tests/btrfs/057.out |  3 +--
>  2 files changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/tests/btrfs/057 b/tests/btrfs/057
> index b019f4e1..0b5a36d3 100755
> --- a/tests/btrfs/057
> +++ b/tests/btrfs/057
> @@ -33,12 +33,9 @@ _require_scratch
>  rm -f $seqres.full
>
>  # use small leaf size to get higher btree height.
> -run_check _scratch_mkfs "-b 1g --nodesize 4096"
> +run_check _scratch_mkfs "-b 1g"
>
> -# inode cache is saved in the FS tree itself for every
> -# individual FS tree,that affects the sizes reported by qgroup show
> -# so we need to explicitly turn it off to get consistent values.
> -_scratch_mount "-o noinode_cache"
> +_scratch_mount
>
>  # -w ensures that the only ops are ones which cause write I/O
>  run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
> @@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 
> 1000 \
>  _run_btrfs_util_prog quota enable $SCRATCH_MNT
>  _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
>
> -# remove all file/dir other than subvolume
> -rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
> -rm -rf $SCRATCH_MNT/p* >& /dev/null
> -
> -_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
> -units=`_btrfs_qgroup_units`
> -$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
> -   | $AWK_PROG '{print $2" "$3}'
> +echo "Silence is golden"
> +# btrfs check will detect any qgroup number mismatch.
>
>  status=0
>  exit
> diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
> index 60cb92d0..185023c7 100644
> --- a/tests/btrfs/057.out
> +++ b/tests/btrfs/057.out
> @@ -1,3 +1,2 @@
>  QA output created by 057
> -4096 4096
> -4096 4096
> +Silence is golden
> --
> 2.18.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-29 Thread Qu Wenruo


On 2018/10/29 下午5:52, Filipe Manana wrote:
> On Mon, Oct 29, 2018 at 6:31 AM Qu Wenruo  wrote:
>>
>> For latest kernel, there is a chance that btrfs/057 reports false
>> errors.
> 
> By latest kernel you mean 4.20?

I mean almost all kernels.

> 
>>
>> The false error would look like:
>>   btrfs/057 4s ... - output mismatch (see 
>> /home/adam/xfstests-dev/results//btrfs/057.out.bad)
>>   --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
>>   +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
>> 14:07:28.443651293 +0800
>>   @@ -1,3 +1,3 @@
>>QA output created by 057
>>4096 4096
>>   -4096 4096
>>   +28672 28672
>>
>> This is related to the fact that "btrfs subvolume sync" (or
>> vanilla sync) will not ensure orphan (unlinked but still exist) files to
>> be removed.
> 
> So when did that happen, which commit introduced the behaviour change?

No behavior change, it's always the case.
Just not that easy to hit.

Thanks,
Qu

> 
>>
>> In fact, for that false error case, if inspecting the fs after umount,
>> its qgroup number is correct and btrfs check won't report qgroup error.
>>
>> To fix the false alerts, just skip any manual qgroup number comparison,
>> and let fsck done after the test case to detect problem.
>>
>> This also elimiate the necessary of using specified mount and mkfs
>> option, allowing us to improve coverage.
>>
>> Reported-by: Nikolay Borisov 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  tests/btrfs/057 | 17 -
>>  tests/btrfs/057.out |  3 +--
>>  2 files changed, 5 insertions(+), 15 deletions(-)
>>
>> diff --git a/tests/btrfs/057 b/tests/btrfs/057
>> index b019f4e1..0b5a36d3 100755
>> --- a/tests/btrfs/057
>> +++ b/tests/btrfs/057
>> @@ -33,12 +33,9 @@ _require_scratch
>>  rm -f $seqres.full
>>
>>  # use small leaf size to get higher btree height.
>> -run_check _scratch_mkfs "-b 1g --nodesize 4096"
>> +run_check _scratch_mkfs "-b 1g"
>>
>> -# inode cache is saved in the FS tree itself for every
>> -# individual FS tree,that affects the sizes reported by qgroup show
>> -# so we need to explicitly turn it off to get consistent values.
>> -_scratch_mount "-o noinode_cache"
>> +_scratch_mount
>>
>>  # -w ensures that the only ops are ones which cause write I/O
>>  run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
>> @@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 
>> 1000 \
>>  _run_btrfs_util_prog quota enable $SCRATCH_MNT
>>  _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
>>
>> -# remove all file/dir other than subvolume
>> -rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
>> -rm -rf $SCRATCH_MNT/p* >& /dev/null
>> -
>> -_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
>> -units=`_btrfs_qgroup_units`
>> -$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
>> -   | $AWK_PROG '{print $2" "$3}'
>> +echo "Silence is golden"
>> +# btrfs check will detect any qgroup number mismatch.
>>
>>  status=0
>>  exit
>> diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
>> index 60cb92d0..185023c7 100644
>> --- a/tests/btrfs/057.out
>> +++ b/tests/btrfs/057.out
>> @@ -1,3 +1,2 @@
>>  QA output created by 057
>> -4096 4096
>> -4096 4096
>> +Silence is golden
>> --
>> 2.18.0
>>
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-29 Thread Filipe Manana
On Mon, Oct 29, 2018 at 11:33 AM Qu Wenruo  wrote:
>
>
>
> On 2018/10/29 下午5:52, Filipe Manana wrote:
> > On Mon, Oct 29, 2018 at 6:31 AM Qu Wenruo  wrote:
> >>
> >> For latest kernel, there is a chance that btrfs/057 reports false
> >> errors.
> >
> > By latest kernel you mean 4.20?
>
> I mean almost all kernels.

So s/For latest kernel/For any recent kernel/ or something like that
which isn't singular.

>
> >
> >>
> >> The false error would look like:
> >>   btrfs/057 4s ... - output mismatch (see 
> >> /home/adam/xfstests-dev/results//btrfs/057.out.bad)
> >>   --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
> >>   +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
> >> 14:07:28.443651293 +0800
> >>   @@ -1,3 +1,3 @@
> >>QA output created by 057
> >>4096 4096
> >>   -4096 4096
> >>   +28672 28672
> >>
> >> This is related to the fact that "btrfs subvolume sync" (or
> >> vanilla sync) will not ensure orphan (unlinked but still exist) files to
> >> be removed.
> >
> > So when did that happen, which commit introduced the behaviour change?
>
> No behavior change, it's always the case.
> Just not that easy to hit.
>
> Thanks,
> Qu
>
> >
> >>
> >> In fact, for that false error case, if inspecting the fs after umount,
> >> its qgroup number is correct and btrfs check won't report qgroup error.
> >>
> >> To fix the false alerts, just skip any manual qgroup number comparison,
> >> and let fsck done after the test case to detect problem.
> >>
> >> This also elimiate the necessary of using specified mount and mkfs
> >> option, allowing us to improve coverage.
> >>
> >> Reported-by: Nikolay Borisov 
> >> Signed-off-by: Qu Wenruo 

Anyway, looks good to me.

Reviewed-by: Filipe Manana 

> >> ---
> >>  tests/btrfs/057 | 17 -
> >>  tests/btrfs/057.out |  3 +--
> >>  2 files changed, 5 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/tests/btrfs/057 b/tests/btrfs/057
> >> index b019f4e1..0b5a36d3 100755
> >> --- a/tests/btrfs/057
> >> +++ b/tests/btrfs/057
> >> @@ -33,12 +33,9 @@ _require_scratch
> >>  rm -f $seqres.full
> >>
> >>  # use small leaf size to get higher btree height.
> >> -run_check _scratch_mkfs "-b 1g --nodesize 4096"
> >> +run_check _scratch_mkfs "-b 1g"
> >>
> >> -# inode cache is saved in the FS tree itself for every
> >> -# individual FS tree,that affects the sizes reported by qgroup show
> >> -# so we need to explicitly turn it off to get consistent values.
> >> -_scratch_mount "-o noinode_cache"
> >> +_scratch_mount
> >>
> >>  # -w ensures that the only ops are ones which cause write I/O
> >>  run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
> >> @@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 
> >> -n 1000 \
> >>  _run_btrfs_util_prog quota enable $SCRATCH_MNT
> >>  _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
> >>
> >> -# remove all file/dir other than subvolume
> >> -rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
> >> -rm -rf $SCRATCH_MNT/p* >& /dev/null
> >> -
> >> -_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
> >> -units=`_btrfs_qgroup_units`
> >> -$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n 
> >> '/[0-9]/p' \
> >> -   | $AWK_PROG '{print $2" "$3}'
> >> +echo "Silence is golden"
> >> +# btrfs check will detect any qgroup number mismatch.
> >>
> >>  status=0
> >>  exit
> >> diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
> >> index 60cb92d0..185023c7 100644
> >> --- a/tests/btrfs/057.out
> >> +++ b/tests/btrfs/057.out
> >> @@ -1,3 +1,2 @@
> >>  QA output created by 057
> >> -4096 4096
> >> -4096 4096
> >> +Silence is golden
> >> --
> >> 2.18.0
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-29 Thread Filipe Manana
On Mon, Oct 29, 2018 at 6:35 AM Qu Wenruo  wrote:
>
> For latest kernel, there is a chance that btrfs/057 reports false
> errors.
>
> The false error would look like:
>   btrfs/057 4s ... - output mismatch (see 
> /home/adam/xfstests-dev/results//btrfs/057.out.bad)
>   --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
>   +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
> 14:07:28.443651293 +0800
>   @@ -1,3 +1,3 @@
>QA output created by 057
>4096 4096
>   -4096 4096
>   +28672 28672
>
> This is related to the fact that "btrfs subvolume sync" (or
> vanilla sync) will not ensure orphan (unlinked but still exist) files to
> be removed.
>
> In fact, for that false error case, if inspecting the fs after umount,
> its qgroup number is correct and btrfs check won't report qgroup error.
>
> To fix the false alerts, just skip any manual qgroup number comparison,
> and let fsck done after the test case to detect problem.
>
> This also elimiate the necessary of using specified mount and mkfs
> option, allowing us to improve coverage.
>
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Filipe Manana 

> ---
>  tests/btrfs/057 | 17 -
>  tests/btrfs/057.out |  3 +--
>  2 files changed, 5 insertions(+), 15 deletions(-)
>
> diff --git a/tests/btrfs/057 b/tests/btrfs/057
> index b019f4e1..0b5a36d3 100755
> --- a/tests/btrfs/057
> +++ b/tests/btrfs/057
> @@ -33,12 +33,9 @@ _require_scratch
>  rm -f $seqres.full
>
>  # use small leaf size to get higher btree height.
> -run_check _scratch_mkfs "-b 1g --nodesize 4096"
> +run_check _scratch_mkfs "-b 1g"

The comment above should go away too.

>
> -# inode cache is saved in the FS tree itself for every
> -# individual FS tree,that affects the sizes reported by qgroup show
> -# so we need to explicitly turn it off to get consistent values.
> -_scratch_mount "-o noinode_cache"
> +_scratch_mount
>
>  # -w ensures that the only ops are ones which cause write I/O
>  run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
> @@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 
> 1000 \
>  _run_btrfs_util_prog quota enable $SCRATCH_MNT
>  _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
>
> -# remove all file/dir other than subvolume
> -rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
> -rm -rf $SCRATCH_MNT/p* >& /dev/null
> -
> -_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
> -units=`_btrfs_qgroup_units`
> -$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
> -   | $AWK_PROG '{print $2" "$3}'
> +echo "Silence is golden"
> +# btrfs check will detect any qgroup number mismatch.
>
>  status=0
>  exit
> diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
> index 60cb92d0..185023c7 100644
> --- a/tests/btrfs/057.out
> +++ b/tests/btrfs/057.out
> @@ -1,3 +1,2 @@
>  QA output created by 057
> -4096 4096
> -4096 4096
> +Silence is golden
> --
> 2.18.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-10-29 Thread Nikolay Borisov



On 29.10.18 г. 9:51 ч., Nikolay Borisov wrote:
> 
> 
> On 29.10.18 г. 7:53 ч., Qu Wenruo wrote:
>> [snip]
 The cause sounds valid, however would you please explain more about how
 such cleaning on unrelated delalloc range happens?
>>>
>>> So in my case the following happened - 2 block groups were created as
>>> delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
>>> dirtied, so when page 0 - 0-4k when into writepage_delalloc,
>>> find_lock_delalloc_range would return the range 0-1m. So the call to
>>> fill_delalloc instantiates OE 0-1m and writeback continues as expected.
>>>
>>> Now, when the 2nd page from range 0-1m is again set for writeback and
>>> find_lock_delalloc_range is called with delalloc_start ==  4096 it will
>>> actually return the range 1m-128m.
>>
>> IMHO this looks strange and may need extra investigation.
>>
>> Normally I would expect it returns the range 0~1M or 4K~1M.
> 
> It cannot return 4k-1m since the writeback for the first page has
> already dealt with this range. Also take a look in writepage_delalloc
> how find_lock_delalloc_range is called : for 'start' it's passed the
> page offset, calculated in __extent_writepage. And when
> find_delalloc_range is called it just searches for an extent which ends
> after passed start value. In find_delalloc_range  firsttree_search
> is called which returns the 1m-128m range, then we go in the while(1)
> loop on the first itertaion found is 0 so *end is populated with 128m ,
> found is set to 1 and *start is set to 1m.
> 
> On the second iteration the check  if (found && (state->start !=
> cur_start || (state->state & EXTENT_BOUNDARY)))
> 
> is triggered since the next extent found will have EXTENT_BOUNDARY since
> it will be the next block group from relocation. EXTENT_BOUNDARY will be
> set from relocate_file_extent_cluster' main loop:
> 
>   if (nr < cluster->nr &&
> 
> page_start + offset == cluster->boundary[nr]) {
> 
> set_extent_bits(&BTRFS_I(inode)->io_tree,
> 
> page_start, page_end,
> 
> EXTENT_BOUNDARY);
> 
> nr++;
> 
> }

So it seems I was wrong w.r.t to sequence of events that result in the  extra 
extent being returned. 
Here is what I got after further investigation. First let's look at the 
relocation side: 


btrfs-4018  [001]    186.783244: relocate_file_extent_cluster: Setting 
EXTENT_BOUNDARY for page: 74cc47c4 page_offset: 0 end: 4095 <- first 
page of range 0-1m
btrfs-4018  [001]    186.783248: relocate_file_extent_cluster: Setting 
DELALLOC for page: 74cc47c4 page_offset: 0 block group:1104150528 
btrfs-4018  [001]    186.783286: relocate_file_extent_cluster: Setting 
DELALLOC for page: 4a28475a page_offset: 4096 block group:1104150528 <- 
2nd page of range 0-1m 
btrfs-4018  [001]    186.784855: relocate_file_extent_cluster: Setting 
EXTENT_BOUNDARY for page: f58f50dc page_offset: 1048576 end: 1052671 < 
- 1st page of range 1m-128m

Mind the addresses of the given pages, they are all predicated on btrfs_ino == 
260, which is the ino for relocation inode. 

So the assumption is that when writing back page 4a28475a we cannot 
really be processing range >1m since it will
be outside of the range for the page, but this is not the case. Now on the 
writeback side: 

kworker/u12:1-68[002]    188.100471: find_lock_delalloc_range: 
Processing delalloc range: 0 - 1048575 for page: 74cc47c4 < - writeback 
for first page of range 0-1m happens 
so we naturally instantiate this range.

kworker/u12:1-68[002] ...1   188.106523: find_delalloc_range.constprop.25: 
1213015261107184058: Got start: 1048576 end 1052671 <- This is output from 
find_delalloc_range when start (the offset of the passed page to 
find_lock_delalloc_range is 4096 i.e it's the 2nd page for range 0-1m). So we 
find 1m - 1m + 4k on the first iteration of the loop in find_delalloc_range, at 
this point *start = 1048576 and *end = 1052671 and cached_state = the extent 
_state representing this first 4k range of the larger 1m-128m range.

kworker/u12:1-68[002] ...1   188.106526: find_delalloc_range.constprop.25: 
1213015261107184058: Got start: 1052672 end 135266303 - We loop for the second 
time, this time we find the 1m+4k - 128m range and actually trigger the 
(total_bytes >= max_bytes) check

kworker/u12:1-68[002] ...1   188.106526: find_delalloc_range.constprop.25: 
1213015261107184058: returning from total_bytes >= max_bytes

kworker/u12:1-68[002]    188.106528: find_delalloc_range.constprop.25: 
1213015261107184058: Returning cached_state->start: 1048576 cached_state->end: 
1052671 *start = 1048576 *end = 135266303 <--- this is what is returned from 
find_delalloc_range - cached_state and *start and *end differ, since *end was 
set to 128m right before we hit the total_by

Re: [PATCH v2] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-10-29 Thread Nikolay Borisov



On 29.10.18 г. 14:21 ч., Nikolay Borisov wrote:
> 
> 
> On 29.10.18 г. 9:51 ч., Nikolay Borisov wrote:
>>
>>
>> On 29.10.18 г. 7:53 ч., Qu Wenruo wrote:
>>> [snip]
> The cause sounds valid, however would you please explain more about how
> such cleaning on unrelated delalloc range happens?

 So in my case the following happened - 2 block groups were created as
 delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
 dirtied, so when page 0 - 0-4k when into writepage_delalloc,
 find_lock_delalloc_range would return the range 0-1m. So the call to
 fill_delalloc instantiates OE 0-1m and writeback continues as expected.

 Now, when the 2nd page from range 0-1m is again set for writeback and
 find_lock_delalloc_range is called with delalloc_start ==  4096 it will
 actually return the range 1m-128m.
>>>
>>> IMHO this looks strange and may need extra investigation.
>>>
>>> Normally I would expect it returns the range 0~1M or 4K~1M.
>>
>> It cannot return 4k-1m since the writeback for the first page has
>> already dealt with this range. Also take a look in writepage_delalloc
>> how find_lock_delalloc_range is called : for 'start' it's passed the
>> page offset, calculated in __extent_writepage. And when
>> find_delalloc_range is called it just searches for an extent which ends
>> after passed start value. In find_delalloc_range  firsttree_search
>> is called which returns the 1m-128m range, then we go in the while(1)
>> loop on the first itertaion found is 0 so *end is populated with 128m ,
>> found is set to 1 and *start is set to 1m.
>>
>> On the second iteration the check  if (found && (state->start !=
>> cur_start || (state->state & EXTENT_BOUNDARY)))
>>
>> is triggered since the next extent found will have EXTENT_BOUNDARY since
>> it will be the next block group from relocation. EXTENT_BOUNDARY will be
>> set from relocate_file_extent_cluster' main loop:
>>
>>   if (nr < cluster->nr &&
>>
>> page_start + offset == cluster->boundary[nr]) {
>>
>> set_extent_bits(&BTRFS_I(inode)->io_tree,
>>
>> page_start, page_end,
>>
>> EXTENT_BOUNDARY);
>>
>> nr++;
>>
>> }
> 
> So it seems I was wrong w.r.t to sequence of events that result in the  extra 
> extent being returned. 
> Here is what I got after further investigation. First let's look at the 
> relocation side: 
> 
> 
> btrfs-4018  [001]    186.783244: relocate_file_extent_cluster: Setting 
> EXTENT_BOUNDARY for page: 74cc47c4 page_offset: 0 end: 4095 <- first 
> page of range 0-1m
> btrfs-4018  [001]    186.783248: relocate_file_extent_cluster: Setting 
> DELALLOC for page: 74cc47c4 page_offset: 0 block group:1104150528 
> btrfs-4018  [001]    186.783286: relocate_file_extent_cluster: Setting 
> DELALLOC for page: 4a28475a page_offset: 4096 block group:1104150528 
> <- 2nd page of range 0-1m 
> btrfs-4018  [001]    186.784855: relocate_file_extent_cluster: Setting 
> EXTENT_BOUNDARY for page: f58f50dc page_offset: 1048576 end: 1052671 
> < - 1st page of range 1m-128m
> 
> Mind the addresses of the given pages, they are all predicated on btrfs_ino 
> == 260, which is the ino for relocation inode. 
> 
> So the assumption is that when writing back page 4a28475a we cannot 
> really be processing range >1m since it will
> be outside of the range for the page, but this is not the case. Now on the 
> writeback side: 
> 
> kworker/u12:1-68[002]    188.100471: find_lock_delalloc_range: 
> Processing delalloc range: 0 - 1048575 for page: 74cc47c4 < - 
> writeback for first page of range 0-1m happens 
> so we naturally instantiate this range.
> 
> kworker/u12:1-68[002] ...1   188.106523: 
> find_delalloc_range.constprop.25: 1213015261107184058: Got start: 1048576 end 
> 1052671 <- This is output from find_delalloc_range when start (the offset of 
> the passed page to find_lock_delalloc_range is 4096 i.e it's the 2nd page for 
> range 0-1m). So we find 1m - 1m + 4k on the first iteration of the loop in 
> find_delalloc_range, at this point *start = 1048576 and *end = 1052671 and 
> cached_state = the extent _state representing this first 4k range of the 
> larger 1m-128m range.
> 
> kworker/u12:1-68[002] ...1   188.106526: 
> find_delalloc_range.constprop.25: 1213015261107184058: Got start: 1052672 end 
> 135266303 - We loop for the second time, this time we find the 1m+4k - 128m 
> range and actually trigger the (total_bytes >= max_bytes) check
> 
> kworker/u12:1-68[002] ...1   188.106526: 
> find_delalloc_range.constprop.25: 1213015261107184058: returning from 
> total_bytes >= max_bytes
> 
> kworker/u12:1-68[002]    188.106528: 
> find_delalloc_range.constprop.25: 1213015261107184058: Returning 
> cached_state->start: 1048576 cached_state->end

Understanding "btrfs filesystem usage"

2018-10-29 Thread Ulli Horlacher
I want to know how many free space is left and have problems in
interpreting the output of: 

btrfs filesystem usage
btrfs filesystem df
btrfs filesystem show


In detail:


root@diaspora:~# btrfs filesystem usage /local
Overall:
Device size: 883.51GiB
Device allocated:825.51GiB
Device unallocated:   58.00GiB
Device missing:  0.00B
Used:494.45GiB
Free (estimated):387.94GiB  (min: 387.94GiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:  320.00MiB  (used: 0.00B)

Data,single: Size:823.47GiB, Used:493.53GiB
   /dev/sda4 823.47GiB

Metadata,single: Size:2.01GiB, Used:938.67MiB
   /dev/sda4   2.01GiB

System,single: Size:32.00MiB, Used:128.00KiB
   /dev/sda4  32.00MiB

Unallocated:
   /dev/sda4  58.00GiB


root@diaspora:~# btrfs filesystem df /local
Data, single: total=823.47GiB, used=493.53GiB
System, single: total=32.00MiB, used=128.00KiB
Metadata, single: total=2.01GiB, used=938.67MiB
GlobalReserve, single: total=320.00MiB, used=0.00B


root@diaspora:~# btrfs filesystem show /local
Label: 'local'  uuid: 11faaa39-5805-4e92-a891-e8ceb4afa9f7
Total devices 1 FS bytes used 494.45GiB
devid1 size 883.51GiB used 825.51GiB path /dev/sda4


What does the last line mean?
825.51GiB used = allocated by (file) data?
In this case I would have only 58 GB free disk space left?

"btrfs filesystem usage" says 494 GB used, but
"btrfs filesystem show" says 825 GB used?!

The man pages do not explain the different meanings of "used" and
"allocated"  

And at last:

root@diaspora:~# df -HT /local
Filesystem Type   Size  Used Avail Use% Mounted on
/dev/sda4  btrfs  949G  532G  417G  57% /local



-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20181029181141.ga31...@rus.uni-stuttgart.de>


Re: Understanding "btrfs filesystem usage"

2018-10-29 Thread Remi Gauvin
On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
> I want to know how many free space is left and have problems in
> interpreting the output of: 
> 
> btrfs filesystem usage
> btrfs filesystem df
> btrfs filesystem show
> 
>

In my not so humble opinion, the filesystem usage command has the
easiest to understand output.  It' lays out all the pertinent information.

You can clearly see 825GiB is allocated, with 494GiB used, therefore,
filesystem show is actually using the "Allocated" value as "Used".
Allocated can be thought of "Reserved For".  As the output of the Usage
command and df command clearly show, you have almost 400GiB space available.

Note that the btrfs commands are clearly and explicitly displaying
values in Binary units, (Mi, and Gi prefix, respectively).  If you want
df command to match, use -h instead of -H (see man df)

An observation:

The disparity between 498GiB used and 823Gib is pretty high.  This is
probably the result of using an SSD with an older kernel.  If your
kernel is not very recent, (sorry, I forget where this was fixed,
somewhere around 4.14 or 4.15), then consider mounting with the nossd
option.  You can improve this by running a balance.

Something like:
btrfs balance start -dusage=55

You do *not* want to end up with all your space allocated to Data, but
not actually used by data.  Bad things can happen if you run out of
Unallocated space for more metadata. (not catastrophic, but awkward and
unexpected downtime that can be a little tricky to sort out.)


<>

Re: Understanding "btrfs filesystem usage"

2018-10-29 Thread Hugo Mills
On Mon, Oct 29, 2018 at 05:57:10PM -0400, Remi Gauvin wrote:
> On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
> > I want to know how many free space is left and have problems in
> > interpreting the output of: 
> > 
> > btrfs filesystem usage
> > btrfs filesystem df
> > btrfs filesystem show
> > 
> >
> 
> In my not so humble opinion, the filesystem usage command has the
> easiest to understand output.  It' lays out all the pertinent information.

   Opinions are divided. I find it almost impossible to read, and
always use btrfs fi df and btrfs fi show together.

   There's short tutorials of how to read the output in both cases in
the FAQ, which is where I start out by directing people in this
instance.

   Hugo.

> You can clearly see 825GiB is allocated, with 494GiB used, therefore,
> filesystem show is actually using the "Allocated" value as "Used".
> Allocated can be thought of "Reserved For".  As the output of the Usage
> command and df command clearly show, you have almost 400GiB space available.
> 
> Note that the btrfs commands are clearly and explicitly displaying
> values in Binary units, (Mi, and Gi prefix, respectively).  If you want
> df command to match, use -h instead of -H (see man df)
> 
> An observation:
> 
> The disparity between 498GiB used and 823Gib is pretty high.  This is
> probably the result of using an SSD with an older kernel.  If your
> kernel is not very recent, (sorry, I forget where this was fixed,
> somewhere around 4.14 or 4.15), then consider mounting with the nossd
> option.  You can improve this by running a balance.
> 
> Something like:
> btrfs balance start -dusage=55
> 
> You do *not* want to end up with all your space allocated to Data, but
> not actually used by data.  Bad things can happen if you run out of
> Unallocated space for more metadata. (not catastrophic, but awkward and
> unexpected downtime that can be a little tricky to sort out.)
> 
> 

> begin:vcard
> fn:Remi Gauvin
> n:Gauvin;Remi
> org:Georgian Infotech
> adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
> email;internet:r...@georgianit.com
> tel;work:226-256-1545
> version:2.1
> end:vcard
> 


-- 
Hugo Mills | Great oxymorons of the world, no. 8:
hugo@... carfax.org.uk | The Latest In Proven Technology
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


[PATCH] Btrfs: incremental send, fix infinite loop when apply children dir moves

2018-10-29 Thread robbieko
From: Robbie Ko 

In apply_children_dir_moves, we first create an empty list (stack),
then we get an entry from pending_dir_moves and add it to the stack,
but we didn't delete the entry from rb_tree.

So, in add_pending_dir_move, we create a new entry and then use the
parent_ino in the current rb_tree to find the corresponding entry,
and if so, add the new entry to the corresponding list.

However, the entry may have been added to the stack, causing new
entries to be added to the stack as well.

Finally, each time we take the first entry from the stack and start
processing, it ends up with an infinite loop.

Fix this problem by remove node from pending_dir_moves,
avoid add new pending_dir_move to error list.

Signed-off-by: Robbie Ko 
---
 fs/btrfs/send.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 094cc144..5be83b5 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -3340,7 +3340,8 @@ static void free_pending_move(struct send_ctx *sctx, 
struct pending_dir_move *m)
kfree(m);
 }
 
-static void tail_append_pending_moves(struct pending_dir_move *moves,
+static void tail_append_pending_moves(struct send_ctx *sctx,
+ struct pending_dir_move *moves,
  struct list_head *stack)
 {
if (list_empty(&moves->list)) {
@@ -3351,6 +3352,10 @@ static void tail_append_pending_moves(struct 
pending_dir_move *moves,
list_add_tail(&moves->list, stack);
list_splice_tail(&list, stack);
}
+   if (!RB_EMPTY_NODE(&moves->node)) {
+   rb_erase(&moves->node, &sctx->pending_dir_moves);
+   RB_CLEAR_NODE(&moves->node);
+   }
 }
 
 static int apply_children_dir_moves(struct send_ctx *sctx)
@@ -3365,7 +3370,7 @@ static int apply_children_dir_moves(struct send_ctx *sctx)
return 0;
 
INIT_LIST_HEAD(&stack);
-   tail_append_pending_moves(pm, &stack);
+   tail_append_pending_moves(sctx, pm, &stack);
 
while (!list_empty(&stack)) {
pm = list_first_entry(&stack, struct pending_dir_move, list);
@@ -3376,7 +3381,7 @@ static int apply_children_dir_moves(struct send_ctx *sctx)
goto out;
pm = get_pending_dir_moves(sctx, parent_ino);
if (pm)
-   tail_append_pending_moves(pm, &stack);
+   tail_append_pending_moves(sctx, pm, &stack);
}
return 0;
 
-- 
1.9.1