Re: Problems with btrfs

2018-04-27 Thread Chris Murphy
On Fri, Apr 27, 2018 at 6:20 PM, Qu Wenruo  wrote:
>
>
> On 2018年04月28日 02:38, David C. Partridge wrote:
>> I'm running Ubuntu 16.04.  I rebooted my server today as it wasn't
>> responding.
>>
>> When I rebooted the root FS was read only.
>>
>> I booted a live Ubuntu CD and checked the drive with the results shown in
>> attachment btrfs-check.log.
>>
>> The error was still there after completing the btrfs check --repair :( And
>> when I deleted some old subvolumes from there after that the filesystem went
>> read-only with a bunch errors in dmesg which are in the attached file
>> btrfs-dmesg-errs.log.
>>
>> Of course my last backup was longer ago than I like to think about. Though I
>> could restore back to about 8 months ago, I'd very much prefer not to ...
>>
>> On my bootable CD the Kernel is 4.18.3 and btrfs-progs is 4.7.3-1
>
> Hello time traveler. :)
> Or it's just 4.16.3.

4.8.13 actually according to the dmesg

The problem might have been setup by this problem (and fix in later kernels)
https://patchwork.kernel.org/patch/5449291/

Question if it's an SSD and what the mount options are.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add balance args info during start and resume

2018-04-27 Thread Anand Jain



On 04/27/2018 02:46 PM, Nikolay Borisov wrote:



On 26.04.2018 11:01, Anand Jain wrote:

Balance args info is an important information to be reviewed on the
system under audit. So this patch adds that.


Would have been good to add example output to the change log.


Here are the samples (will add in v2): (only for the illustrations, 
configs may not be ideal)


1. Find raid1 or single convert them to raid5 for both data and metadata:

btrfs bal start -dprofiles='raid1|single',convert=raid5 
-mprofiles='raid1|single',convert=raid5 /btrfs


 kernel: BTRFS info (device sdb): balance: start data 
profiles=raid1|single convert=raid5 metadata profiles=raid1|single 
convert=raid5 system profiles=raid1|single convert=raid5

::
 kernel: BTRFS info (device sdb): balance: ended with status: 0

2. Start, Pause and resume.

btrfs bal start -dprofiles=raid5,convert=single 
-mprofiles='raid1|single',convert=raid5 --background /btrfs && btrfs bal 
pause /btrfs && btrfs bal resume /btrfs


 kernel: BTRFS info (device sdb): balance: start data profiles=raid5 
convert=single metadata profiles=raid1|single convert=raid5 system 
profiles=raid1|single convert=raid5

 ::
 kernel: BTRFS info (device sdb): balance: paused  (<-- I am planning 
to add %age completed here at a later point of time).

 ::
 kernel: BTRFS info (device sdb): balance: resume data profiles=raid5 
convert=single metadata profiles=raid1|single convert=raid5 system 
profiles=raid1|single convert=raid5

 ::
 kernel: BTRFS info (device sdb): balance: ended with status: 0

::


@@ -3770,6 +3796,121 @@ static inline int validate_convert_profile(struct 
btrfs_balance_args *bctl_arg,
 (bctl_arg->target & ~allowed)));
  }
  
+static void get_balance_args(struct btrfs_balance_args *bargs, char *args)

+{
+   char value[64];
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_PROFILES) {
+   strcat(args, "profiles=");
+   get_all_raid_names(bargs->profiles, value);
+   strcat(args, value);
+   strcat(args, " ");
+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_USAGE) {
+   snprintf(value, 64, "usage=%llu ", bargs->usage);
+   strcat(args, value);

+   if (bg_flags & BTRFS_AVAIL_ALLOC_BIT_SINGLE) {
+   if (found) {


 Hmm.. I didn't get this part of your comments.


+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) {
+   snprintf(value, 64, "usage_min=%u usage_max=%u ",
+   bargs->usage_min, bargs->usage_max);
+   strcat(args, value);
+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_DEVID) {
+   snprintf(value, 64, "devid=%llu ", bargs->devid);
+   strcat(args, value);
+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_DRANGE) {
+   snprintf(value, 64, "pstart=%llu pend=%llu ",
+bargs->pstart, bargs->pend);
+   strcat(args, value);
+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_VRANGE) {
+   snprintf(value, 64, "vstart=%llu vend %llu ",
+bargs->vstart, bargs->vend);
+   strcat(args, value);
+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_LIMIT) {
+   snprintf(value, 64, "limit=%llu ", bargs->limit);
+   strcat(args, value);
+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_LIMIT_RANGE) {
+   snprintf(value, 64, "limit_min=%u limit_max=%u ",
+bargs->limit_min, bargs->limit_max);
+   strcat(args, value);
+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_STRIPES_RANGE) {
+   snprintf(value, 64, "stripes_min=%u stripes_max=%u ",
+bargs->stripes_min, bargs->stripes_max);
+   strcat(args, value);
+   }
+
+   if (bargs->flags & BTRFS_BALANCE_ARGS_CONVERT) {
+   int index = btrfs_bg_flags_to_raid_index(bargs->target);
+   snprintf(value, 64, "convert=%s ",
+get_raid_name(index));
+   strcat(args, value);
+   }
+


Having all those ifs as independent statements means we can potentially
have each and every value present, I haven't really counted how long the
string could potentially get. Is it not possible to convert some of them
into if/else if constructs or is it indeed possible to have all of them
present at once?


 It won't help much there are only -usage- and -limit- which are
 mutually exclusive with -usage_range- and -limit_range- respectively.
 Moreover its just one time when balance starts/resumes.

 Further as these are flags, I like to do the correct way, that is,
 depend on the flags to print its args, so that we know if there is
 anything wrong. And there is one..

 btrfs bal start -dusage="5..90",usage=100 /btrfs

   kernel: BTRFS info (device sdb): balance: start data usage=100 <-- 
we pick the -usage- which is correct.


 

Re: [PATCH v2] btrfs: Add test that checks rmdir(2) can delete a subvolume

2018-04-27 Thread Eryu Guan
On Fri, Apr 27, 2018 at 06:26:35PM +0200, David Sterba wrote:
> On Fri, Apr 27, 2018 at 05:02:45PM +0900, Misono Tomohiro wrote:
> > Add btrfs test that checks "rmdir" or "rm -r" command can delete a
> > subvolume like an ordinary drectory.
> > 
> > This behavior has been restricted long time but becomes allowed by
> > following patch in the kernel:
> >   btrfs: Allow rmdir(2) to delete an empty subvolume
> 
> AFAICS this test will fail on older kernels, eg. stable versions, but
> this is not a bug that needs to be fixed by backporting patches. The
> usual way is to use the expunges, but as this is a feature coverage, I
> think the test should detect if the kernel has the feature at all.
> Similar to the fallocate modes etc.

Agreed, if the change is treated as a behavior change not a bug on old
kernels, we probably need a way to detect the case and _notrun if kernel
doesn't support it.

> 
> For the test itself, the scenarios look sufficient, so
> Reviewed-by: David Sterba 

Thanks for the review!

Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NVMe SSD + compression - benchmarking

2018-04-27 Thread Qu Wenruo


On 2018年04月28日 01:41, Brendan Hide wrote:
> Hey, all
> 
> I'm following up on the queries I had last week since I have installed
> the NVMe SSD into the PCI-e adapter. I'm having difficulty knowing
> whether or not I'm doing these benchmarks correctly.
> 
> As a first test, I put together a 4.7GB .tar containing mostly
> duplicated copies of the kernel source code (rather compressible).
> Writing this to the SSD I was seeing repeatable numbers - but noted that
> the new (supposedly faster) zstd compression is noticeably slower than
> all other methods. Perhaps this is partly due to lack of
> multi-threading? No matter, I did also notice a supposedly impossible
> stat when there is no compression, in that it seems to be faster than
> the PCI-E 2.0 bus theoretically can deliver:

I'd say the test method is more like real world usage other than benchmark.
Moreover, the kernel source copying is not that good for compression, as
mostly of the files are smaller than 128K, which means they can't take
much advantage of multi thread split based on 128K.

And kernel source is consistent of multiple small files, and btrfs is
really slow for metadata heavy workload.

I'd recommend to start with simpler workload, then go step by step
towards more complex workload.

Large file sequence write with large block size would be a nice start
point, as it could take all advantage of multithread compression.


Another advice here is, if you really want a super fast storage, and
there is plenty memory, brd module will be your best friend.
And for modern mainstream hardware, brd could provide performance over
1GiB/s:
$ sudo modprobe brd rd_nr=1 rd_size=2097152
$ LANG=C dd if=/dev/zero  bs=1M of=/dev/ram0  count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.45593 s, 1.5 GB/s

Thanks,
Qu

> 
> compression type / write speed / read speed (in GBps)
> zlib / 1.24 / 2.07
> lzo / 1.17 / 2.04
> zstd / 0.75 / 1.97
> no / 1.42 / 2.79
> 
> The SSD is PCI-E 3.0 4-lane capable and is connected to a PCI-E 2.0
> 16-lane slot. lspci -vv confirms it is using 4 lanes. This means it's
> peak throughput *should* be 2.0 GBps - but above you can see the average
> read benchmark is 2.79GBps. :-/
> 
> The crude timing script I've put together does the following:
> - Format the SSD anew with btrfs and no custom settings
> - wait 180 seconds for possible hardware TRIM to settle (possibly
> overkill since the SSD is new)
> - Mount the fs using all defaults except for compression, which could be
> of zlib, lzo, zstd, or no
> - sync
> - Drop all caches
> - Time the following
>  - Copy the file to the test fs (source is a ramdisk)
>  - sync
> - Drop all caches
> - Time the following
>  - Copy back from the test fs to ramdisk
>  - sync
> - unmount
> 
> I can see how, with compression, it *can* be faster than 2 GBps (though
> it isn't). But I cannot see how having no compression could possibly be
> faster than 2 GBps. :-/
> 
> I can of course get more info if it'd help figure out this puzzle:
> 
> Kernel info:
> Linux localhost.localdomain 4.16.3-1-vfio #1 SMP PREEMPT Sun Apr 22
> 12:35:45 SAST 2018 x86_64 GNU/Linux
> ^ Close to the regular ArchLinux kernel - but with vfio, and compiled
> with -arch=native. See https://aur.archlinux.org/pkgbase/linux-vfio/
> 
> CPU model:
> model name    : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
> 
> Motherboard model:
> Product Name: Z68MA-G45 (MS-7676)
> 
> lspci output for the slot:
> 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe
> SSD Controller SM961/PM961
> ^ The disk id sans serial is Samsung_SSD_960_EVO_1TB
> 
> dmidecode output for the slot:
> Handle 0x001E, DMI type 9, 17 bytes
> System Slot Information
>     Designation: J8B4
>     Type: x16 PCI Express
>     Current Usage: In Use
>     Length: Long
>     ID: 4
>     Characteristics:
>     3.3 V is provided
>     Opening is shared
>     PME signal is supported
>     Bus Address: :02:01.1
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



signature.asc
Description: OpenPGP digital signature


Re: Problems with btrfs

2018-04-27 Thread Qu Wenruo


On 2018年04月28日 02:38, David C. Partridge wrote:
> I'm running Ubuntu 16.04.  I rebooted my server today as it wasn't
> responding.
> 
> When I rebooted the root FS was read only.
> 
> I booted a live Ubuntu CD and checked the drive with the results shown in
> attachment btrfs-check.log.
> 
> The error was still there after completing the btrfs check --repair :( And
> when I deleted some old subvolumes from there after that the filesystem went
> read-only with a bunch errors in dmesg which are in the attached file
> btrfs-dmesg-errs.log.
> 
> Of course my last backup was longer ago than I like to think about. Though I
> could restore back to about 8 months ago, I'd very much prefer not to ...
> 
> On my bootable CD the Kernel is 4.18.3 and btrfs-progs is 4.7.3-1

Hello time traveler. :)
Or it's just 4.16.3.

Btrfs-progs is a little old, and it is not recommended to execute btrfs
check --repair until you know that the problem is.

Fortunately it doesn't cause further damage, although it doesn't fix it
neither.

The offending tree block is 224304857088, please dump the following info
(using latest btrfs-progs if possible)

# btrfs inspect dump-tree -b 224304857088 /dev/mapper/Charon--vg-root
# btrfs inpsect dump-tree /dev/mapper/Charon--vg-root | grep -C 20

Thanks,
Qu

> 
> HELP!!!
> 
> If do have enough space to copy stuff off the root volume but I think I'll
> need hand holding through the recovery process.
> 
> The volume has sub-volumes called @ and @home which are mounted as / and
> /home respectively. 
> 
> Thanks
> David
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: Inconsistent behavior of fsync in btrfs

2018-04-27 Thread Jayashree Mohan
Thanks Chris and Ted for putting down the expected fsync behaviour of
btrfs and ext4 clearly. This sort of information is not documented
anywhere; it would be really useful if all major filesystems
explicitly stated what fsync behavior to expect. Filesystems
definitely seem to provide more guarantees than POSIX and it would of
great help to researchers and developers, if you all documented what
guarantees we can expect from each filesystem.

On Fri, Apr 27, 2018 at 3:53 PM, Theodore Y. Ts'o  wrote:
> On Fri, Apr 27, 2018 at 11:33:29AM -0600, Chris Mason wrote:
>> My goal for the fsync tree log was to make it just do the right thing most
>> of the time.  We mostly got there, thanks to a ton of fixes and test cases
>> from Filipe.
>>
>> fsync(some file) -- all the names for this file will exist, without having
>> to fsync the directory.
>>
>> fsync(some dir) -- ugh, don't fsync the directory.  But if you do, all the
>> files/subdirs will exist, and unlinks will be done and renames will be
>> included.  This is slow and may require a full FS commit, which is why we
>> don't want dirs fsunk.
>
> What ext4 does is this:
>
> fsync(some file) -- for a newly created file, the filename that it was
> created under will exist.  If the file has a hard-link added,
> the hard link is not guarnateed to be written to disk
>
> fsync(some dir) -- all changes to file names in thentee directory will
> exist after the crash.  It does *not* guarantee that any data
> changes to any of files in the directories will persist after
> a crash.
>
> It seems to me that it would be desirable if all of the major file
> systems have roughly the same minimum guarantee for fsync(2), so that
> application writers don't have to make file-system specific
> assumptions.  In general the goal ought to be "the right thing" should
> happen.
>
> The reason why ext4 doesn't sync all possible hard link names is that
> (a) that's not a common requiremnt for most applications, and (b) it's
> too hard to find all of the directories which might contain a hard
> link to a particular file.  But otherwise, the semantics seem to
> largely match up with what Chris as suggested for btrfs.
>
> - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-27 Thread Chris Murphy
On Fri, Apr 27, 2018 at 2:53 PM, Theodore Y. Ts'o  wrote:

> It seems to me that it would be desirable if all of the major file
> systems have roughly the same minimum guarantee for fsync(2), so that
> application writers don't have to make file-system specific
> assumptions.  In general the goal ought to be "the right thing" should
> happen.

Yes please.

I'd also like to know about sync() differences as well, as I've found
not through code eval but through testing that file systems differ in
sync() behavior as well.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-27 Thread Theodore Y. Ts'o
On Fri, Apr 27, 2018 at 11:33:29AM -0600, Chris Mason wrote:
> My goal for the fsync tree log was to make it just do the right thing most
> of the time.  We mostly got there, thanks to a ton of fixes and test cases
> from Filipe.
> 
> fsync(some file) -- all the names for this file will exist, without having
> to fsync the directory.
> 
> fsync(some dir) -- ugh, don't fsync the directory.  But if you do, all the
> files/subdirs will exist, and unlinks will be done and renames will be
> included.  This is slow and may require a full FS commit, which is why we
> don't want dirs fsunk.

What ext4 does is this:

fsync(some file) -- for a newly created file, the filename that it was
created under will exist.  If the file has a hard-link added,
the hard link is not guarnateed to be written to disk

fsync(some dir) -- all changes to file names in thentee directory will
exist after the crash.  It does *not* guarantee that any data
changes to any of files in the directories will persist after
a crash.

It seems to me that it would be desirable if all of the major file
systems have roughly the same minimum guarantee for fsync(2), so that
application writers don't have to make file-system specific
assumptions.  In general the goal ought to be "the right thing" should
happen.

The reason why ext4 doesn't sync all possible hard link names is that
(a) that's not a common requiremnt for most applications, and (b) it's
too hard to find all of the directories which might contain a hard
link to a particular file.  But otherwise, the semantics seem to
largely match up with what Chris as suggested for btrfs.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Jeff Mahoney
On 4/27/18 12:40 PM, David Sterba wrote:
> On Fri, Apr 27, 2018 at 12:02:13PM -0400, Jeff Mahoney wrote:
 +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
 +{
>>>
>>> And this had to be moved upwards as there was earlier use of
>>> btrfs_queue_work that matched following the hunk.
>>
>> Weird.  That must be exactly the kind of mismerge artifact that we were
>> talking about the other day.  In my tree it's in the right spot.
> 
> I've tried current master, upcoming pull request queue (misc-4.17, one
> nonc-onflicting patch) and current misc-next. None of them applies the
> patch cleanly and the function is still added after the first use, so
> this would not compile.
> 
> The result can be found in
> https://github.com/kdave/btrfs-devel/commits/ext/jeffm/qgroup-fixes
> 

Thanks.  The "Fixes" is incorrect there.  I had the right commit message
but not the right commit id.  It should be:

8d9eddad1946 (Btrfs: fix qgroup rescan worker initialization)

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Noah Massey
On Fri, Apr 27, 2018 at 11:56 AM, David Sterba  wrote:
> On Thu, Apr 26, 2018 at 03:23:49PM -0400, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>> ended up reintroducing the hang-on-unmount bug that the commit it
>> intended to fix addressed.
>>
>> The race this time is between qgroup_rescan_init setting
>> ->qgroup_rescan_running = true and the worker starting.  There are
>> many scenarios where we initialize the worker and never start it.  The
>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>> This can happen even without involving error handling, since mounting
>> the file system read-only returns between initializing the worker and
>> queueing it.
>>
>> The right place to do it is when we're queuing the worker.  The flag
>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>> a completion.
>>
>> This patch introduces a new helper, queue_rescan_worker, that handles
>> the ->qgroup_rescan_running flag, including any races with umount.
>>
>> While we're at it, ->qgroup_rescan_running is protected only by the
>> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
>> to take the spinlock too.
>>
>> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> Signed-off-by: Jeff Mahoney 
>
> I've added this to misc-next as I'd like to push it to the next rc. The
> Fixes is fixed.
>

I don't see it pushed to misc-next yet, but based on f89fbcd776, could
you update the reference in the first line of the commit to match the
Fixes line?

Thanks,
Noah
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs-progs: build: use m4_flatten instead of m4_chomp

2018-04-27 Thread Jeff Mahoney
On 4/27/18 2:56 PM, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Commit 2e1932e6a38 (btrfs-progs: build: simplify version tracking)
> started m4_chomp to strip the newlines from the version file.  m4_chomp
> was introduced in autoconf 2.64 but SLE11 ships with autoconf 2.63.
> For purposes of just stripping the newline, m4_flatten is sufficient.

Scratch that.  The previous patch also requires autoconf 2.64.

-Jeff

> Signed-off-by: Jeff Mahoney 
> ---
>  configure.ac | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/configure.ac b/configure.ac
> index 17880206..a0cebf15 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -1,5 +1,5 @@
>  AC_INIT([btrfs-progs],
> - m4_chomp(m4_include([VERSION])),
> + m4_flatten(m4_include([VERSION])),
>   [linux-btrfs@vger.kernel.org],,
>   [http://btrfs.wiki.kernel.org])
>  
> 


-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs-progs: build: use m4_flatten instead of m4_chomp

2018-04-27 Thread jeffm
From: Jeff Mahoney 

Commit 2e1932e6a38 (btrfs-progs: build: simplify version tracking)
started m4_chomp to strip the newlines from the version file.  m4_chomp
was introduced in autoconf 2.64 but SLE11 ships with autoconf 2.63.
For purposes of just stripping the newline, m4_flatten is sufficient.

Signed-off-by: Jeff Mahoney 
---
 configure.ac | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index 17880206..a0cebf15 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1,5 +1,5 @@
 AC_INIT([btrfs-progs],
-   m4_chomp(m4_include([VERSION])),
+   m4_flatten(m4_include([VERSION])),
[linux-btrfs@vger.kernel.org],,
[http://btrfs.wiki.kernel.org])
 
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs-progs: build: detect whether -std=gnu90 is supported

2018-04-27 Thread jeffm
From: Jeff Mahoney 

Older versions of gcc don't support -std=gnu90 so btrfs-progs won't
build at all on older distros.  We can detect whether the compiler
supports -std=gnu90 and fall back to -std=gnu89 if it doesn't.

Signed-off-by: Jeff Mahoney 
---
 Makefile|  1 -
 Makefile.inc.in |  1 +
 configure.ac|  2 ++
 m4/ax_check_compile_flag.m4 | 74 +
 4 files changed, 77 insertions(+), 1 deletion(-)
 create mode 100644 m4/ax_check_compile_flag.m4

diff --git a/Makefile b/Makefile
index 5ba76d2e..c0eb3d68 100644
--- a/Makefile
+++ b/Makefile
@@ -63,7 +63,6 @@ ABSTOPDIR = $(shell pwd)
 TOPDIR := .
 
 # Common build flags
-CSTD = -std=gnu90
 CFLAGS = $(SUBST_CFLAGS) \
 $(CSTD) \
 -include config.h \
diff --git a/Makefile.inc.in b/Makefile.inc.in
index 52410f69..fc3dec69 100644
--- a/Makefile.inc.in
+++ b/Makefile.inc.in
@@ -4,6 +4,7 @@
 export
 
 CC = @CC@
+CSTD = @CSTD@
 LN_S = @LN_S@
 AR = @AR@
 RM = @RM@
diff --git a/configure.ac b/configure.ac
index 2dea1c64..17880206 100644
--- a/configure.ac
+++ b/configure.ac
@@ -30,6 +30,8 @@ AC_CANONICAL_HOST
 AC_C_CONST
 AC_C_VOLATILE
 AC_C_BIGENDIAN
+AX_CHECK_COMPILE_FLAG([-std=gnu90],[CSTD=-std=gnu90],[CSTD=-std=gnu89])
+AC_SUBST([CSTD])
 
 AC_SYS_LARGEFILE
 
diff --git a/m4/ax_check_compile_flag.m4 b/m4/ax_check_compile_flag.m4
new file mode 100644
index ..dcabb92a
--- /dev/null
+++ b/m4/ax_check_compile_flag.m4
@@ -0,0 +1,74 @@
+# ===
+#  https://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html
+# ===
+#
+# SYNOPSIS
+#
+#   AX_CHECK_COMPILE_FLAG(FLAG, [ACTION-SUCCESS], [ACTION-FAILURE], 
[EXTRA-FLAGS], [INPUT])
+#
+# DESCRIPTION
+#
+#   Check whether the given FLAG works with the current language's compiler
+#   or gives an error.  (Warnings, however, are ignored)
+#
+#   ACTION-SUCCESS/ACTION-FAILURE are shell commands to execute on
+#   success/failure.
+#
+#   If EXTRA-FLAGS is defined, it is added to the current language's default
+#   flags (e.g. CFLAGS) when the check is done.  The check is thus made with
+#   the flags: "CFLAGS EXTRA-FLAGS FLAG".  This can for example be used to
+#   force the compiler to issue an error when a bad flag is given.
+#
+#   INPUT gives an alternative input source to AC_COMPILE_IFELSE.
+#
+#   NOTE: Implementation based on AX_CFLAGS_GCC_OPTION. Please keep this
+#   macro in sync with AX_CHECK_{PREPROC,LINK}_FLAG.
+#
+# LICENSE
+#
+#   Copyright (c) 2008 Guido U. Draheim 
+#   Copyright (c) 2011 Maarten Bosmans 
+#
+#   This program is free software: you can redistribute it and/or modify it
+#   under the terms of the GNU General Public License as published by the
+#   Free Software Foundation, either version 3 of the License, or (at your
+#   option) any later version.
+#
+#   This program is distributed in the hope that it will be useful, but
+#   WITHOUT ANY WARRANTY; without even the implied warranty of
+#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
+#   Public License for more details.
+#
+#   You should have received a copy of the GNU General Public License along
+#   with this program. If not, see .
+#
+#   As a special exception, the respective Autoconf Macro's copyright owner
+#   gives unlimited permission to copy, distribute and modify the configure
+#   scripts that are the output of Autoconf when processing the Macro. You
+#   need not follow the terms of the GNU General Public License when using
+#   or distributing such scripts, even though portions of the text of the
+#   Macro appear in them. The GNU General Public License (GPL) does govern
+#   all other use of the material that constitutes the Autoconf Macro.
+#
+#   This special exception to the GPL applies to versions of the Autoconf
+#   Macro released by the Autoconf Archive. When you make and distribute a
+#   modified version of the Autoconf Macro, you may extend this special
+#   exception to the GPL to apply to your modified version as well.
+
+#serial 5
+
+AC_DEFUN([AX_CHECK_COMPILE_FLAG],
+[AC_PREREQ(2.64)dnl for _AC_LANG_PREFIX and AS_VAR_IF
+AS_VAR_PUSHDEF([CACHEVAR],[ax_cv_check_[]_AC_LANG_ABBREV[]flags_$4_$1])dnl
+AC_CACHE_CHECK([whether _AC_LANG compiler accepts $1], CACHEVAR, [
+  ax_check_save_flags=$[]_AC_LANG_PREFIX[]FLAGS
+  _AC_LANG_PREFIX[]FLAGS="$[]_AC_LANG_PREFIX[]FLAGS $4 $1"
+  AC_COMPILE_IFELSE([m4_default([$5],[AC_LANG_PROGRAM()])],
+[AS_VAR_SET(CACHEVAR,[yes])],
+[AS_VAR_SET(CACHEVAR,[no])])
+  _AC_LANG_PREFIX[]FLAGS=$ax_check_save_flags])
+AS_VAR_IF(CACHEVAR,yes,
+  [m4_default([$2], :)],
+  [m4_default([$3], :)])
+AS_VAR_POPDEF([CACHEVAR])dnl
+])dnl AX_CHECK_COMPILE_FLAGS
-- 
2.12.3

--
To unsubscribe from this list: send 

[PATCH 1/3] btrfs-progs: convert: fix support for e2fsprogs < 1.42

2018-04-27 Thread jeffm
From: Jeff Mahoney 

Commit 324d4c1857a (btrfs-progs: convert: Add larger device support)
introduced new dependencies on the 64-bit API provided by e2fsprogs.
That API was introduced in v1.42 (along with bigalloc).

This patch maps the following to their equivalents in e2fsprogs < 1.42.
- ext2fs_get_block_bitmap_range2
- ext2fs_inode_data_blocks2
- ext2fs_read_ext_attr2

Since we need to detect and define EXT2_FLAG_64BITS for compatibilty
anyway, it makes sense to use that to detect the older e2fsprogs instead
of defining a new flag ourselves.

Signed-off-by: Jeff Mahoney 
---
 configure.ac  |  6 +-
 convert/source-ext2.h | 12 +++-
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/configure.ac b/configure.ac
index af13a959..2dea1c64 100644
--- a/configure.ac
+++ b/configure.ac
@@ -140,11 +140,7 @@ BTRFSCONVERT_EXT2=0
 BTRFSCONVERT_REISERFS=0
 if test "x$enable_convert" = xyes; then
if test "x$with_convert" = "xauto" || echo "$with_convert" | grep -q 
"ext2"; then
-   PKG_CHECK_MODULES(EXT2FS, [ext2fs >= 1.42],,
-   [PKG_CHECK_MODULES(EXT2FS, [ext2fs],
-   [AC_DEFINE([HAVE_OLD_E2FSPROGS], [1],
- [E2fsprogs does not support 
BIGALLOC])]
-   )])
+   PKG_CHECK_MODULES(EXT2FS, [ext2fs])
PKG_CHECK_MODULES(COM_ERR, [com_err])
convertfs="${convertfs:+$convertfs,}ext2"
BTRFSCONVERT_EXT2=1
diff --git a/convert/source-ext2.h b/convert/source-ext2.h
index 80833b21..c3214125 100644
--- a/convert/source-ext2.h
+++ b/convert/source-ext2.h
@@ -33,8 +33,18 @@
  * BIGALLOC.
  * Unlike normal RO compat flag, BIGALLOC affects how e2fsprogs check used
  * space, and btrfs-convert heavily relies on it.
+ *
+ * e2fsprogs 1.42 also introduced the 64-bit API.  Any file system
+ * that requires it will have EXT4_FEATURE_INCOMPAT_64BIT set and
+ * will fail to open with earlier releases.  We can map it to the
+ * older API without risk of corruption.
  */
-#ifdef HAVE_OLD_E2FSPROGS
+#ifndef EXT2_FLAG_64BITS
+#define EXT2_FLAG_64BITS   (0)
+#define ext2fs_get_block_bitmap_range2 ext2fs_get_block_bitmap_range
+#define ext2fs_inode_data_blocks2 ext2fs_inode_data_blocks
+#define ext2fs_read_ext_attr2 ext2fs_read_ext_attr
+#define ext2fs_blocks_count(s) ((s)->s_blocks_count)
 #define EXT2FS_CLUSTER_RATIO(fs)   (1)
 #define EXT2_CLUSTERS_PER_GROUP(s) (EXT2_BLOCKS_PER_GROUP(s))
 #define EXT2FS_B2C(fs, blk)(blk)
-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] btrfs-progs: build on SLE11

2018-04-27 Thread jeffm
From: Jeff Mahoney 

This patch set allows btrfs-progs to build without further patching
on a stock SLE11 SP4 install.  The major issues were:
- new dependencies on e2fsprogs 1.42
- use of -std=gnu90, which gcc 4.3.4 doesn't support
- use of m4_chomp, introduced in autoconf 2.64.

This patch set is also posted as a GitHub pull request:
https://github.com/kdave/btrfs-progs/pull/125

-Jeff

---

Jeff Mahoney (3):
  btrfs-progs: convert: fix support for e2fsprogs < 1.42
  btrfs-progs: build: detect whether -std=gnu90 is supported
  btrfs-progs: build: use m4_flatten instead of m4_chomp

 Makefile|  1 -
 Makefile.inc.in |  1 +
 configure.ac| 10 +++---
 convert/source-ext2.h   | 12 +++-
 m4/ax_check_compile_flag.m4 | 74 +
 5 files changed, 90 insertions(+), 8 deletions(-)
 create mode 100644 m4/ax_check_compile_flag.m4

-- 
2.12.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Problems with btrfs

2018-04-27 Thread David C. Partridge
I'm running Ubuntu 16.04.  I rebooted my server today as it wasn't
responding.

When I rebooted the root FS was read only.

I booted a live Ubuntu CD and checked the drive with the results shown in
attachment btrfs-check.log.

The error was still there after completing the btrfs check --repair :( And
when I deleted some old subvolumes from there after that the filesystem went
read-only with a bunch errors in dmesg which are in the attached file
btrfs-dmesg-errs.log.

Of course my last backup was longer ago than I like to think about. Though I
could restore back to about 8 months ago, I'd very much prefer not to ...

On my bootable CD the Kernel is 4.18.3 and btrfs-progs is 4.7.3-1

HELP!!!

If do have enough space to copy stuff off the root volume but I think I'll
need hand holding through the recovery process.

The volume has sub-volumes called @ and @home which are mounted as / and
/home respectively. 

Thanks
David




btrfs-check.log
Description: Binary data


btrfs-dmesg-errs.log
Description: Binary data


NVMe SSD + compression - benchmarking

2018-04-27 Thread Brendan Hide

Hey, all

I'm following up on the queries I had last week since I have installed 
the NVMe SSD into the PCI-e adapter. I'm having difficulty knowing 
whether or not I'm doing these benchmarks correctly.


As a first test, I put together a 4.7GB .tar containing mostly 
duplicated copies of the kernel source code (rather compressible). 
Writing this to the SSD I was seeing repeatable numbers - but noted that 
the new (supposedly faster) zstd compression is noticeably slower than 
all other methods. Perhaps this is partly due to lack of 
multi-threading? No matter, I did also notice a supposedly impossible 
stat when there is no compression, in that it seems to be faster than 
the PCI-E 2.0 bus theoretically can deliver:


compression type / write speed / read speed (in GBps)
zlib / 1.24 / 2.07
lzo / 1.17 / 2.04
zstd / 0.75 / 1.97
no / 1.42 / 2.79

The SSD is PCI-E 3.0 4-lane capable and is connected to a PCI-E 2.0 
16-lane slot. lspci -vv confirms it is using 4 lanes. This means it's 
peak throughput *should* be 2.0 GBps - but above you can see the average 
read benchmark is 2.79GBps. :-/


The crude timing script I've put together does the following:
- Format the SSD anew with btrfs and no custom settings
- wait 180 seconds for possible hardware TRIM to settle (possibly 
overkill since the SSD is new)
- Mount the fs using all defaults except for compression, which could be 
of zlib, lzo, zstd, or no

- sync
- Drop all caches
- Time the following
 - Copy the file to the test fs (source is a ramdisk)
 - sync
- Drop all caches
- Time the following
 - Copy back from the test fs to ramdisk
 - sync
- unmount

I can see how, with compression, it *can* be faster than 2 GBps (though 
it isn't). But I cannot see how having no compression could possibly be 
faster than 2 GBps. :-/


I can of course get more info if it'd help figure out this puzzle:

Kernel info:
Linux localhost.localdomain 4.16.3-1-vfio #1 SMP PREEMPT Sun Apr 22 
12:35:45 SAST 2018 x86_64 GNU/Linux
^ Close to the regular ArchLinux kernel - but with vfio, and compiled 
with -arch=native. See https://aur.archlinux.org/pkgbase/linux-vfio/


CPU model:
model name: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

Motherboard model:
Product Name: Z68MA-G45 (MS-7676)

lspci output for the slot:
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe 
SSD Controller SM961/PM961

^ The disk id sans serial is Samsung_SSD_960_EVO_1TB

dmidecode output for the slot:
Handle 0x001E, DMI type 9, 17 bytes
System Slot Information
Designation: J8B4
Type: x16 PCI Express
Current Usage: In Use
Length: Long
ID: 4
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: :02:01.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-27 Thread Chris Mason

On 27 Apr 2018, at 10:07, David Sterba wrote:


On Thu, Apr 26, 2018 at 07:59:23PM -0500, Jayashree Mohan wrote:

Thanks for the response. We are using a tool we developed called
CrashMonkey[1] to run crash consistency tests and generate the bug
reports above. We'd be happy to guide you through setting up
CrashMonkey and getting these bugs reproduced. However, if you want 
to

be able to reproduce them with your current setup (xfstest +
dm-flakey), I have the workload scripts attached to the end of the
mail which might make your task simpler.

Interestingly we seem to have found another bug that breaks rename
atomicity and results in a previously fsynced file missing.

Workload:
1. mkdir A
2. creat A/bar (*)
3. fsync A/bar
4. mkdir B
5. creat B/bar
6. rename B/bar A/bar
7. creat A/foo
8. fsync A/foo
9. fsync A
--- crash---

When we recover from the crash, we see that file A/bar goes missing.
If the rename did not persist, we expect to see A/bar(*) created in
step 2 above, or if the rename indeed persisted, we still expect file
A/bar to be present.


I'm no fsync expert and the lack of standard or well defined behaviour
(mentioned elsewhere) leads me to question, on what do you base your
expectations? Not only for this report, but in general during your
testing.

Comparing various filesystems will show that at best it's 
implementation

defined and everybody has their own reasons for doing it one way or
another, or request fsync at particular time etc.

We have a manual page in section 5 that contains general topics of
btrfs, so documenting the fsync specifics would be good.


My goal for the fsync tree log was to make it just do the right thing 
most of the time.  We mostly got there, thanks to a ton of fixes and 
test cases from Filipe.


fsync(some file) -- all the names for this file will exist, without 
having to fsync the directory.


fsync(some dir) -- ugh, don't fsync the directory.  But if you do, all 
the files/subdirs will exist, and unlinks will be done and renames will 
be included.  This is slow and may require a full FS commit, which is 
why we don't want dirs fsunk.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread David Sterba
On Fri, Apr 27, 2018 at 12:02:13PM -0400, Jeff Mahoney wrote:
> >> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
> >> +{
> > 
> > And this had to be moved upwards as there was earlier use of
> > btrfs_queue_work that matched following the hunk.
> 
> Weird.  That must be exactly the kind of mismerge artifact that we were
> talking about the other day.  In my tree it's in the right spot.

I've tried current master, upcoming pull request queue (misc-4.17, one
nonc-onflicting patch) and current misc-next. None of them applies the
patch cleanly and the function is still added after the first use, so
this would not compile.

The result can be found in
https://github.com/kdave/btrfs-devel/commits/ext/jeffm/qgroup-fixes
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: qgroup, don't try to insert status item after ENOMEM in rescan worker

2018-04-27 Thread David Sterba
On Fri, Apr 27, 2018 at 12:11:00PM -0400, Jeff Mahoney wrote:
> If we fail to allocate memory for a path, don't bother trying to
> insert the qgroup status item.  We haven't done anything yet and it'll
> fail also.  Just print an error and be done with it.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/qgroup.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 8de423a0c7e3..b795bad54705 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2648,7 +2648,6 @@ static void btrfs_qgroup_rescan_worker(struct 
> btrfs_work *work)
>   btrfs_end_transaction(trans);
>   }
>  
> -out:
>   btrfs_free_path(path);
>  
>   mutex_lock(_info->qgroup_rescan_lock);
> @@ -2684,13 +2683,13 @@ static void btrfs_qgroup_rescan_worker(struct 
> btrfs_work *work)
>  
>   if (btrfs_fs_closing(fs_info)) {
>   btrfs_info(fs_info, "qgroup scan paused");
> - } else if (err >= 0) {
> + err = 0;
> + } else if (err >= 0)
>   btrfs_info(fs_info, "qgroup scan completed%s",
>   err > 0 ? " (inconsistency flag cleared)" : "");
> - } else {
> +out:
> + if (err < 0)
>   btrfs_err(fs_info, "qgroup scan failed with %d", err);

Ah right, with the err = 0 in the fs_closing check we won't see both
messages reported, "qgroup scan paused" and "qgroup scan failed with %d".

Reviewed-by: David Sterba 

> - }
> -
>  done:
>   mutex_lock(_info->qgroup_rescan_lock);
>   fs_info->qgroup_rescan_running = false;
> -- 
> 2.12.3
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: Add test that checks rmdir(2) can delete a subvolume

2018-04-27 Thread David Sterba
On Fri, Apr 27, 2018 at 05:02:45PM +0900, Misono Tomohiro wrote:
> Add btrfs test that checks "rmdir" or "rm -r" command can delete a
> subvolume like an ordinary drectory.
> 
> This behavior has been restricted long time but becomes allowed by
> following patch in the kernel:
>   btrfs: Allow rmdir(2) to delete an empty subvolume

AFAICS this test will fail on older kernels, eg. stable versions, but
this is not a bug that needs to be fixed by backporting patches. The
usual way is to use the expunges, but as this is a feature coverage, I
think the test should detect if the kernel has the feature at all.
Similar to the fallocate modes etc.

For the test itself, the scenarios look sufficient, so
Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 12/16] btrfs: track running balance in a simpler way

2018-04-27 Thread David Sterba
On Fri, Apr 27, 2018 at 10:10:24AM +0800, Anand Jain wrote:
> 
> 
> On 04/20/2018 12:33 AM, David Sterba wrote:
> > Currently fs_info::balance_running is 0 or 1 and does not use the
> > semantics of atomics. The pause and cancel check for 0, that can happen
> > only after __btrfs_balance exits for whatever reason.
> > 
> > Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple
> > times but will block on the balance_mutex that protects the
> > fs_info::flags bit.
> > 
> > Signed-off-by: David Sterba 
> 
>   Though it appears to me that we could optimize the lock and atomic bit
>   usage a little more, however it can be taken for the next round of
>   cleanups. So for now..
> 
>   Reviewed-by: Anand Jain 

Thanks, patches updated.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: qgroup, don't try to insert status item after ENOMEM in rescan worker

2018-04-27 Thread Jeff Mahoney
If we fail to allocate memory for a path, don't bother trying to
insert the qgroup status item.  We haven't done anything yet and it'll
fail also.  Just print an error and be done with it.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/qgroup.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 8de423a0c7e3..b795bad54705 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2648,7 +2648,6 @@ static void btrfs_qgroup_rescan_worker(struct btrfs_work 
*work)
btrfs_end_transaction(trans);
}
 
-out:
btrfs_free_path(path);
 
mutex_lock(_info->qgroup_rescan_lock);
@@ -2684,13 +2683,13 @@ static void btrfs_qgroup_rescan_worker(struct 
btrfs_work *work)
 
if (btrfs_fs_closing(fs_info)) {
btrfs_info(fs_info, "qgroup scan paused");
-   } else if (err >= 0) {
+   err = 0;
+   } else if (err >= 0)
btrfs_info(fs_info, "qgroup scan completed%s",
err > 0 ? " (inconsistency flag cleared)" : "");
-   } else {
+out:
+   if (err < 0)
btrfs_err(fs_info, "qgroup scan failed with %d", err);
-   }
-
 done:
mutex_lock(_info->qgroup_rescan_lock);
fs_info->qgroup_rescan_running = false;
-- 
2.12.3


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-27 Thread David Sterba
On Thu, Apr 26, 2018 at 07:59:23PM -0500, Jayashree Mohan wrote:
> Thanks for the response. We are using a tool we developed called
> CrashMonkey[1] to run crash consistency tests and generate the bug
> reports above. We'd be happy to guide you through setting up
> CrashMonkey and getting these bugs reproduced. However, if you want to
> be able to reproduce them with your current setup (xfstest +
> dm-flakey), I have the workload scripts attached to the end of the
> mail which might make your task simpler.
> 
> Interestingly we seem to have found another bug that breaks rename
> atomicity and results in a previously fsynced file missing.
> 
> Workload:
> 1. mkdir A
> 2. creat A/bar (*)
> 3. fsync A/bar
> 4. mkdir B
> 5. creat B/bar
> 6. rename B/bar A/bar
> 7. creat A/foo
> 8. fsync A/foo
> 9. fsync A
> --- crash---
> 
> When we recover from the crash, we see that file A/bar goes missing.
> If the rename did not persist, we expect to see A/bar(*) created in
> step 2 above, or if the rename indeed persisted, we still expect file
> A/bar to be present.

I'm no fsync expert and the lack of standard or well defined behaviour
(mentioned elsewhere) leads me to question, on what do you base your
expectations? Not only for this report, but in general during your
testing.

Comparing various filesystems will show that at best it's implementation
defined and everybody has their own reasons for doing it one way or
another, or request fsync at particular time etc.

We have a manual page in section 5 that contains general topics of
btrfs, so documenting the fsync specifics would be good.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: qgroup, don't try to insert status item after ENOMEM in rescan worker

2018-04-27 Thread Jeff Mahoney
On 4/27/18 11:44 AM, David Sterba wrote:
> On Thu, Apr 26, 2018 at 11:39:50PM +0300, Nikolay Borisov wrote:
>> On 26.04.2018 22:23, je...@suse.com wrote:
>>> From: Jeff Mahoney 
>>>
>>> If we fail to allocate memory for a path, don't bother trying to
>>> insert the qgroup status item.  We haven't done anything yet and it'll
>>> fail also.  Just print an error and be done with it.
>>>
>>> Signed-off-by: Jeff Mahoney 
>>
>> nit: So the code is correct however, having the out label there is
>> really ugly. What about on path alloc failure just have the print in the
>> if branch do goto done?
> 
> Yeah, I don't like jumping to the inner blocks either. I saw this in the
> qgroup code so we should clean it up and not add new instances.
> 
> In this case, only the path allocation failure jumps to the out label,
> so printing the message and then jump to done makes sense to me.
> However, the message would have to be duplicated in the end, and I don't
> see a better way without further restructuring the code.
> 

It doesn't require major surgery.  The else can be disconnected.

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Jeff Mahoney
On 4/27/18 11:56 AM, David Sterba wrote:
> On Thu, Apr 26, 2018 at 03:23:49PM -0400, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>> ended up reintroducing the hang-on-unmount bug that the commit it
>> intended to fix addressed.
>>
>> The race this time is between qgroup_rescan_init setting
>> ->qgroup_rescan_running = true and the worker starting.  There are
>> many scenarios where we initialize the worker and never start it.  The
>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>> This can happen even without involving error handling, since mounting
>> the file system read-only returns between initializing the worker and
>> queueing it.
>>
>> The right place to do it is when we're queuing the worker.  The flag
>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>> a completion.
>>
>> This patch introduces a new helper, queue_rescan_worker, that handles
>> the ->qgroup_rescan_running flag, including any races with umount.
>>
>> While we're at it, ->qgroup_rescan_running is protected only by the
>> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
>> to take the spinlock too.
>>
>> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> Signed-off-by: Jeff Mahoney 
> 
> I've added this to misc-next as I'd like to push it to the next rc. The
> Fixes is fixed.
> 
>> +/* qgroup rescan worker is running or queued to run */
>>  bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
> 
> Comments merged.

Thanks.

>>  /* filesystem state */
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index aa259d6986e1..be491b6c020a 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -2072,6 +2072,30 @@ int btrfs_qgroup_account_extents(struct 
>> btrfs_trans_handle *trans,
>>  return ret;
>>  }
>>  
>> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
>> +{
> 
> And this had to be moved upwards as there was earlier use of
> btrfs_queue_work that matched following the hunk.

Weird.  That must be exactly the kind of mismerge artifact that we were
talking about the other day.  In my tree it's in the right spot.

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Jeff Mahoney
On 4/27/18 4:48 AM, Filipe Manana wrote:
> On Thu, Apr 26, 2018 at 8:23 PM,   wrote:
>> From: Jeff Mahoney 
>>
>> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
>> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
>> ended up reintroducing the hang-on-unmount bug that the commit it
>> intended to fix addressed.
>>
>> The race this time is between qgroup_rescan_init setting
>> ->qgroup_rescan_running = true and the worker starting.  There are
>> many scenarios where we initialize the worker and never start it.  The
>> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
>> This can happen even without involving error handling, since mounting
>> the file system read-only returns between initializing the worker and
>> queueing it.
>>
>> The right place to do it is when we're queuing the worker.  The flag
>> really just means that btrfs_ioctl_quota_rescan_wait should wait for
>> a completion.
>>
>> This patch introduces a new helper, queue_rescan_worker, that handles
>> the ->qgroup_rescan_running flag, including any races with umount.
>>
>> While we're at it, ->qgroup_rescan_running is protected only by the
>> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
>> to take the spinlock too.
>>
>> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
> 
> The commit id and subjects don't match:
> 
> commit d2c609b834d62f1e91f1635a27dca29f7806d3d6
> Author: Jeff Mahoney 
> Date:   Mon Aug 15 12:10:33 2016 -0400
> 
> btrfs: properly track when rescan worker is running
> 


Thanks.  Fixed.

-Jeff

-- 
Jeff Mahoney
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread David Sterba
On Thu, Apr 26, 2018 at 03:23:49PM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
> ended up reintroducing the hang-on-unmount bug that the commit it
> intended to fix addressed.
> 
> The race this time is between qgroup_rescan_init setting
> ->qgroup_rescan_running = true and the worker starting.  There are
> many scenarios where we initialize the worker and never start it.  The
> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
> This can happen even without involving error handling, since mounting
> the file system read-only returns between initializing the worker and
> queueing it.
> 
> The right place to do it is when we're queuing the worker.  The flag
> really just means that btrfs_ioctl_quota_rescan_wait should wait for
> a completion.
> 
> This patch introduces a new helper, queue_rescan_worker, that handles
> the ->qgroup_rescan_running flag, including any races with umount.
> 
> While we're at it, ->qgroup_rescan_running is protected only by the
> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
> to take the spinlock too.
> 
> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
> Signed-off-by: Jeff Mahoney 

I've added this to misc-next as I'd like to push it to the next rc. The
Fixes is fixed.

> + /* qgroup rescan worker is running or queued to run */
>   bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */

Comments merged.

>   /* filesystem state */
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index aa259d6986e1..be491b6c020a 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2072,6 +2072,30 @@ int btrfs_qgroup_account_extents(struct 
> btrfs_trans_handle *trans,
>   return ret;
>  }
>  
> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
> +{

And this had to be moved upwards as there was earlier use of
btrfs_queue_work that matched following the hunk.

> +}
> +
>  /*
>   * called from commit_transaction. Writes all changed qgroups to disk.
>   */
> @@ -2123,8 +2147,7 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans,
>   ret = qgroup_rescan_init(fs_info, 0, 1);
>   if (!ret) {
>   qgroup_rescan_zero_tracking(fs_info);
> - btrfs_queue_work(fs_info->qgroup_rescan_workers,
> -  _info->qgroup_rescan_work);
> + queue_rescan_worker(fs_info);
>   }
>   ret = 0;
>   }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: qgroup, don't try to insert status item after ENOMEM in rescan worker

2018-04-27 Thread David Sterba
On Thu, Apr 26, 2018 at 11:39:50PM +0300, Nikolay Borisov wrote:
> On 26.04.2018 22:23, je...@suse.com wrote:
> > From: Jeff Mahoney 
> > 
> > If we fail to allocate memory for a path, don't bother trying to
> > insert the qgroup status item.  We haven't done anything yet and it'll
> > fail also.  Just print an error and be done with it.
> > 
> > Signed-off-by: Jeff Mahoney 
> 
> nit: So the code is correct however, having the out label there is
> really ugly. What about on path alloc failure just have the print in the
> if branch do goto done?

Yeah, I don't like jumping to the inner blocks either. I saw this in the
qgroup code so we should clean it up and not add new instances.

In this case, only the path allocation failure jumps to the out label,
so printing the message and then jump to done makes sense to me.
However, the message would have to be duplicated in the end, and I don't
see a better way without further restructuring the code.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Inconsistent behavior of fsync in btrfs

2018-04-27 Thread Chris Mason

On 26 Apr 2018, at 18:59, Jayashree Mohan wrote:


Hi Chris,

Thanks for the response. We are using a tool we developed called
CrashMonkey[1] to run crash consistency tests and generate the bug
reports above. We'd be happy to guide you through setting up
CrashMonkey and getting these bugs reproduced. However, if you want to
be able to reproduce them with your current setup (xfstest +
dm-flakey), I have the workload scripts attached to the end of the
mail which might make your task simpler.

Interestingly we seem to have found another bug that breaks rename
atomicity and results in a previously fsynced file missing.

Workload:
1. mkdir A
2. creat A/bar (*)
3. fsync A/bar
4. mkdir B
5. creat B/bar
6. rename B/bar A/bar
7. creat A/foo
8. fsync A/foo
9. fsync A


The original workloads have been easy to reproduce/fix so far.  I want 
to make sure my patches aren't slowing things down and I'll send them 
out ~Monday.


This one looks similar but I'll double check the rename handling and 
make sure I've got all the cases.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Optimal maintenance for RAID5 array

2018-04-27 Thread Menion
Hi all
I am running a RAID5 array built on 5x8TB HD. The filesystem usage is
aproximatively 6TB now
I rung kernel 4.16.5 and btrfs progs 4.16 (planning to upgrade to
4.16.1) under Ubuntu xenial
I am not sure what is the best/safest way to maintain the array, in
particular which is the best scrub/whatever strategy to apply
The filesystem is 95% used as storage, meaning that few files are
moved around or deleted
Bye
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: Unexport and rename btrfs_invalidate_inodes

2018-04-27 Thread Nikolay Borisov
This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.

Signed-off-by: Nikolay Borisov 
---

Forgot to rename the function at the actual call site. 
 fs/btrfs/ctree.h |   1 -
 fs/btrfs/inode.c | 130 ---
 2 files changed, 66 insertions(+), 65 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 27aa9b58b001..9536872f2ec4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3231,7 +3231,6 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
  struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
-void btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d35992b8bb5e..d42cd8e6e8a6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4384,6 +4384,71 @@ static noinline int may_destroy_subvol(struct btrfs_root 
*root)
return ret;
 }
 
+/* Delete all dentries for inodes belonging to the root */
+static void btrfs_prune_dentries(struct btrfs_root *root)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct rb_node *node;
+   struct rb_node *prev;
+   struct btrfs_inode *entry;
+   struct inode *inode;
+   u64 objectid = 0;
+
+   if (!test_bit(BTRFS_FS_STATE_ERROR, _info->fs_state))
+   WARN_ON(btrfs_root_refs(>root_item) != 0);
+
+   spin_lock(>inode_lock);
+again:
+   node = root->inode_tree.rb_node;
+   prev = NULL;
+   while (node) {
+   prev = node;
+   entry = rb_entry(node, struct btrfs_inode, rb_node);
+
+   if (objectid < btrfs_ino(BTRFS_I(>vfs_inode)))
+   node = node->rb_left;
+   else if (objectid > btrfs_ino(BTRFS_I(>vfs_inode)))
+   node = node->rb_right;
+   else
+   break;
+   }
+   if (!node) {
+   while (prev) {
+   entry = rb_entry(prev, struct btrfs_inode, rb_node);
+   if (objectid <= btrfs_ino(BTRFS_I(>vfs_inode))) {
+   node = prev;
+   break;
+   }
+   prev = rb_next(prev);
+   }
+   }
+   while (node) {
+   entry = rb_entry(node, struct btrfs_inode, rb_node);
+   objectid = btrfs_ino(BTRFS_I(>vfs_inode)) + 1;
+   inode = igrab(>vfs_inode);
+   if (inode) {
+   spin_unlock(>inode_lock);
+   if (atomic_read(>i_count) > 1)
+   d_prune_aliases(inode);
+   /*
+* btrfs_drop_inode will have it removed from
+* the inode cache when its usage count
+* hits zero.
+*/
+   iput(inode);
+   cond_resched();
+   spin_lock(>inode_lock);
+   goto again;
+   }
+
+   if (cond_resched_lock(>inode_lock))
+   goto again;
+
+   node = rb_next(node);
+   }
+   spin_unlock(>inode_lock);
+}
+
 int btrfs_delete_subvolume(struct inode *dir, struct dentry *dentry)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
@@ -4510,7 +4575,7 @@ int btrfs_delete_subvolume(struct inode *dir, struct 
dentry *dentry)
spin_unlock(>root_item_lock);
} else {
d_invalidate(dentry);
-   btrfs_invalidate_inodes(dest);
+   btrfs_prune_dentries(dest);
ASSERT(dest->send_in_progress == 0);
 
/* the last ref */
@@ -5823,69 +5888,6 @@ static void inode_tree_del(struct inode *inode)
}
 }
 
-void btrfs_invalidate_inodes(struct btrfs_root *root)
-{
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct rb_node *node;
-   struct rb_node *prev;
-   struct btrfs_inode *entry;
-   struct inode *inode;
-   u64 objectid = 0;
-
-   if (!test_bit(BTRFS_FS_STATE_ERROR, _info->fs_state))
-   WARN_ON(btrfs_root_refs(>root_item) != 0);
-
-   spin_lock(>inode_lock);
-again:
-   node = root->inode_tree.rb_node;
-   prev = NULL;
-   while (node) {
-   prev = node;
-   entry = rb_entry(node, struct btrfs_inode, 

Re: [PATCH v2] Btrfs: set keep_lock when necessary in btrfs_defrag_leaves

2018-04-27 Thread David Sterba
On Fri, Apr 27, 2018 at 11:23:23AM +0800, Liu Bo wrote:
> >> ---
> >> v2: update commit log with more details.
> >>
> >>  fs/btrfs/tree-defrag.c | 3 +--
> >>  1 file changed, 1 insertion(+), 2 deletions(-)
> >>
> >> diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c
> >> index 3c0987ab587d..c12747904d4c 100644
> >> --- a/fs/btrfs/tree-defrag.c
> >> +++ b/fs/btrfs/tree-defrag.c
> >> @@ -65,8 +65,6 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> >>   memcpy(, >defrag_progress, sizeof(key));
> >>   }
> >>
> >> - path->keep_locks = 1;
> >> -
> >>   ret = btrfs_search_forward(root, , path, 
> >> BTRFS_OLDEST_GENERATION);
> >
> > What does btrfs_search_forward do as the first statement:
> >
> > 5115 int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key 
> > *min_key,
> > 5116  struct btrfs_path *path,
> > 5117  u64 min_trans)
> > 5118 {
> >  declarations
> > 5128
> > 5129 path->keep_locks = 1;
> >
> > So even if removed from above, there will be no change. The value of
> > keep_locks is preserved after btrfs_path_release.
> >
> 
> It will set it back,
> 
> out:
> path->keep_locks = keep_locks;

Oh, right. So much for reading the whole function.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Take trans lock before access running trans in check_delayed_ref

2018-04-27 Thread Nikolay Borisov


On 27.04.2018 12:58, ethanwu wrote:
> In preivous patch:
> Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
> We avoid starting btrfs transaction and get this information from
> fs_info->running_transaction directly.
> 
> When accessing running_transaction in check_delayed_ref, there's a
> chance that current transaction will be freed by commit transaction
> after the NULL pointer check of running_transaction is passed.
> 
> After looking all the other places using fs_info->running_transaction,
> they are either protected by trans_lock or holding the transactions.
> 
> Fix this by using trans_lock and incrementing the use_count.

Yep, looking at how ->running_transaction is set/cleared, holding the
trans_lock is correct here. However there is one point which you need to
fix.

> 
> Signed-off-by: ethanwu 
> ---
>  fs/btrfs/extent-tree.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c1618ab..c8fd37b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3149,7 +3149,11 @@ static noinline int check_delayed_ref(struct 
> btrfs_root *root,
>   struct rb_node *node;
>   int ret = 0;
>  
> + spin_lock(>fs_info->trans_lock);
>   cur_trans = root->fs_info->running_transaction;
> + if (cur_trans)
> + atomic_inc(_trans->use_count);

The ->use_count is using refcount_inc now, not atomic_inc. Please base
all further upstream patches on upstream code

> + spin_unlock(>fs_info->trans_lock);
>   if (!cur_trans)
>   return 0;
>  
> @@ -3158,6 +3162,7 @@ static noinline int check_delayed_ref(struct btrfs_root 
> *root,
>   head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
>   if (!head) {
>   spin_unlock(_refs->lock);
> + btrfs_put_transaction(cur_trans);
>   return 0;
>   }
>  
> @@ -3174,6 +3179,7 @@ static noinline int check_delayed_ref(struct btrfs_root 
> *root,
>   mutex_lock(>mutex);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
> + btrfs_put_transaction(cur_trans);
>   return -EAGAIN;
>   }
>   spin_unlock(_refs->lock);
> @@ -3206,6 +3212,7 @@ static noinline int check_delayed_ref(struct btrfs_root 
> *root,
>   }
>   spin_unlock(>lock);
>   mutex_unlock(>mutex);
> + btrfs_put_transaction(cur_trans);
>   return ret;
>  }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Take trans lock before access running trans in check_delayed_ref

2018-04-27 Thread ethanwu
In preivous patch:
Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
We avoid starting btrfs transaction and get this information from
fs_info->running_transaction directly.

When accessing running_transaction in check_delayed_ref, there's a
chance that current transaction will be freed by commit transaction
after the NULL pointer check of running_transaction is passed.

After looking all the other places using fs_info->running_transaction,
they are either protected by trans_lock or holding the transactions.

Fix this by using trans_lock and incrementing the use_count.

Signed-off-by: ethanwu 
---
 fs/btrfs/extent-tree.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c1618ab..c8fd37b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3149,7 +3149,11 @@ static noinline int check_delayed_ref(struct btrfs_root 
*root,
struct rb_node *node;
int ret = 0;
 
+   spin_lock(>fs_info->trans_lock);
cur_trans = root->fs_info->running_transaction;
+   if (cur_trans)
+   atomic_inc(_trans->use_count);
+   spin_unlock(>fs_info->trans_lock);
if (!cur_trans)
return 0;
 
@@ -3158,6 +3162,7 @@ static noinline int check_delayed_ref(struct btrfs_root 
*root,
head = btrfs_find_delayed_ref_head(delayed_refs, bytenr);
if (!head) {
spin_unlock(_refs->lock);
+   btrfs_put_transaction(cur_trans);
return 0;
}
 
@@ -3174,6 +3179,7 @@ static noinline int check_delayed_ref(struct btrfs_root 
*root,
mutex_lock(>mutex);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
+   btrfs_put_transaction(cur_trans);
return -EAGAIN;
}
spin_unlock(_refs->lock);
@@ -3206,6 +3212,7 @@ static noinline int check_delayed_ref(struct btrfs_root 
*root,
}
spin_unlock(>lock);
mutex_unlock(>mutex);
+   btrfs_put_transaction(cur_trans);
return ret;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] btrfs: Split btrfs_del_delalloc_inode into 2 functions

2018-04-27 Thread Nikolay Borisov
This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 15 ---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5b7080d2fbff..27aa9b58b001 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3174,6 +3174,8 @@ noinline int can_nocow_extent(struct inode *inode, u64 
offset, u64 *len,
  u64 *orig_start, u64 *orig_block_len,
  u64 *ram_bytes);
 
+void __btrfs_del_delalloc_inode(struct btrfs_root *root,
+   struct btrfs_inode *inode);
 struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry);
 int btrfs_set_inode_index(struct btrfs_inode *dir, u64 *index);
 int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 078e02c21825..164950d96c8a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1742,12 +1742,12 @@ static void btrfs_add_delalloc_inodes(struct btrfs_root 
*root,
spin_unlock(>delalloc_lock);
 }
 
-static void btrfs_del_delalloc_inode(struct btrfs_root *root,
-struct btrfs_inode *inode)
+
+void __btrfs_del_delalloc_inode(struct btrfs_root *root,
+   struct btrfs_inode *inode)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
 
-   spin_lock(>delalloc_lock);
if (!list_empty(>delalloc_inodes)) {
list_del_init(>delalloc_inodes);
clear_bit(BTRFS_INODE_IN_DELALLOC_LIST,
@@ -1760,6 +1760,15 @@ static void btrfs_del_delalloc_inode(struct btrfs_root 
*root,
spin_unlock(_info->delalloc_root_lock);
}
}
+}
+
+
+static void btrfs_del_delalloc_inode(struct btrfs_root *root,
+struct btrfs_inode *inode)
+{
+
+   spin_lock(>delalloc_lock);
+   __btrfs_del_delalloc_inode(root, inode);
spin_unlock(>delalloc_lock);
 }
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] btrfs: Add assert in __btrfs_del_delalloc_inode

2018-04-27 Thread Nikolay Borisov
The invariant is that when nr_delalloc_inodes is 0 then the root
mustn't have any inodes on its delalloc inodes list.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 164950d96c8a..d35992b8bb5e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1754,6 +1754,7 @@ void __btrfs_del_delalloc_inode(struct btrfs_root *root,
  >runtime_flags);
root->nr_delalloc_inodes--;
if (!root->nr_delalloc_inodes) {
+   ASSERT(list_empty(>delalloc_inodes));
spin_lock(_info->delalloc_root_lock);
BUG_ON(list_empty(>delalloc_root));
list_del_init(>delalloc_root);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] Fix delalloc inodes leaking on btrfs unmount

2018-04-27 Thread Nikolay Borisov
After investigating crashes on generic/176 it turned that the culprit in fact
is the random failure induced by generic/019. As it happens, if on unmount the 
filesystem is in BTRFS_FS_STATE_ERROR then btrfs_error_commit_super is called. 
This unveiled 2 bugs:
 1. btrfs_destroy_delalloc_inodes's implementation was completely bogus, since
 it only called btrfs_invalidate_inodes which only pruned dentries and didn't 
 do anything to free any inodes with pending delalloc bytes. Once this is fixed 
 with the use of invalide_inode_pages2 the second bug transpired. 
 2. The last call ot run_delayed_iputs is made before btrfs_cleanup_transaction
 is called. The latter in turn could queue up more delayed iputs resulting from 
 invalidates_inode_pages2. 

This series fixes the problem by first fixing btrfs_destroy_delalloc_inode to 
properly cleanup delalloc inodes and as a result cleans up the code a bit. 

I've given it a good bashing through xfstest (4 full xfstest cycles + 100 
iterations of generic/475 since it was hitting some early assertion failures,
which are fixed in the final version) so am pretty confident in the change. 

Nikolay Borisov (5):
  btrfs: Unexport btrfs_alloc_delalloc_work
  btrfs: Split btrfs_del_delalloc_inode into 2 functions
  btrfs: Add assert in __btrfs_del_delalloc_inode
  btrfs: Fix delalloc inodes invalidation during transaction abort
  btrfs: Unexport and rename btrfs_invalidate_inodes

 fs/btrfs/ctree.h   |  12 +
 fs/btrfs/disk-io.c |  22 
 fs/btrfs/inode.c   | 152 ++---
 3 files changed, 99 insertions(+), 87 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs: Unexport btrfs_alloc_delalloc_work

2018-04-27 Thread Nikolay Borisov
It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.h | 9 -
 fs/btrfs/inode.c | 8 
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 789c7da2721c..5b7080d2fbff 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3167,15 +3167,6 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode 
*inode,
 struct extent_map *em);
 
 /* inode.c */
-struct btrfs_delalloc_work {
-   struct inode *inode;
-   struct completion completion;
-   struct list_head list;
-   struct btrfs_work work;
-};
-
-struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode);
-
 struct extent_map *btrfs_get_extent_fiemap(struct btrfs_inode *inode,
struct page *page, size_t pg_offset, u64 start,
u64 len, int create);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9b8f2904ad55..078e02c21825 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10145,6 +10145,13 @@ static int btrfs_rename2(struct inode *old_dir, struct 
dentry *old_dentry,
return btrfs_rename(old_dir, old_dentry, new_dir, new_dentry, flags);
 }
 
+struct btrfs_delalloc_work {
+   struct inode *inode;
+   struct completion completion;
+   struct list_head list;
+   struct btrfs_work work;
+};
+
 static void btrfs_run_delalloc_work(struct btrfs_work *work)
 {
struct btrfs_delalloc_work *delalloc_work;
@@ -10162,6 +10169,7 @@ static void btrfs_run_delalloc_work(struct btrfs_work 
*work)
complete(_work->completion);
 }
 
+static
 struct btrfs_delalloc_work *btrfs_alloc_delalloc_work(struct inode *inode)
 {
struct btrfs_delalloc_work *work;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs: Unexport and rename btrfs_invalidate_inodes

2018-04-27 Thread Nikolay Borisov
This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.h |   1 -
 fs/btrfs/inode.c | 128 ---
 2 files changed, 65 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 27aa9b58b001..9536872f2ec4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3231,7 +3231,6 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
  struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
-void btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d35992b8bb5e..ba29e68a5fdd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4384,6 +4384,71 @@ static noinline int may_destroy_subvol(struct btrfs_root 
*root)
return ret;
 }
 
+/* Delete all dentries for inodes belonging to the root */
+static void btrfs_prune_dentries(struct btrfs_root *root)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct rb_node *node;
+   struct rb_node *prev;
+   struct btrfs_inode *entry;
+   struct inode *inode;
+   u64 objectid = 0;
+
+   if (!test_bit(BTRFS_FS_STATE_ERROR, _info->fs_state))
+   WARN_ON(btrfs_root_refs(>root_item) != 0);
+
+   spin_lock(>inode_lock);
+again:
+   node = root->inode_tree.rb_node;
+   prev = NULL;
+   while (node) {
+   prev = node;
+   entry = rb_entry(node, struct btrfs_inode, rb_node);
+
+   if (objectid < btrfs_ino(BTRFS_I(>vfs_inode)))
+   node = node->rb_left;
+   else if (objectid > btrfs_ino(BTRFS_I(>vfs_inode)))
+   node = node->rb_right;
+   else
+   break;
+   }
+   if (!node) {
+   while (prev) {
+   entry = rb_entry(prev, struct btrfs_inode, rb_node);
+   if (objectid <= btrfs_ino(BTRFS_I(>vfs_inode))) {
+   node = prev;
+   break;
+   }
+   prev = rb_next(prev);
+   }
+   }
+   while (node) {
+   entry = rb_entry(node, struct btrfs_inode, rb_node);
+   objectid = btrfs_ino(BTRFS_I(>vfs_inode)) + 1;
+   inode = igrab(>vfs_inode);
+   if (inode) {
+   spin_unlock(>inode_lock);
+   if (atomic_read(>i_count) > 1)
+   d_prune_aliases(inode);
+   /*
+* btrfs_drop_inode will have it removed from
+* the inode cache when its usage count
+* hits zero.
+*/
+   iput(inode);
+   cond_resched();
+   spin_lock(>inode_lock);
+   goto again;
+   }
+
+   if (cond_resched_lock(>inode_lock))
+   goto again;
+
+   node = rb_next(node);
+   }
+   spin_unlock(>inode_lock);
+}
+
 int btrfs_delete_subvolume(struct inode *dir, struct dentry *dentry)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(dentry->d_sb);
@@ -5823,69 +5888,6 @@ static void inode_tree_del(struct inode *inode)
}
 }
 
-void btrfs_invalidate_inodes(struct btrfs_root *root)
-{
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct rb_node *node;
-   struct rb_node *prev;
-   struct btrfs_inode *entry;
-   struct inode *inode;
-   u64 objectid = 0;
-
-   if (!test_bit(BTRFS_FS_STATE_ERROR, _info->fs_state))
-   WARN_ON(btrfs_root_refs(>root_item) != 0);
-
-   spin_lock(>inode_lock);
-again:
-   node = root->inode_tree.rb_node;
-   prev = NULL;
-   while (node) {
-   prev = node;
-   entry = rb_entry(node, struct btrfs_inode, rb_node);
-
-   if (objectid < btrfs_ino(BTRFS_I(>vfs_inode)))
-   node = node->rb_left;
-   else if (objectid > btrfs_ino(BTRFS_I(>vfs_inode)))
-   node = node->rb_right;
-   else
-   break;
-   }
-   if (!node) {
-   while (prev) {
-   entry = rb_entry(prev, struct btrfs_inode, rb_node);
-

[PATCH 4/5] btrfs: Fix delalloc inodes invalidation during transaction abort

2018-04-27 Thread Nikolay Borisov
When a transaction is aborted btrfs_cleanup_transaction is called to
cleanup all the various in-flight bits and pieces which migth be
active. One of those is delalloc inodes - inodes which have dirty
pages which haven't been persisted yet. Currently the process of
freeing such delalloc inodes in exceptional circumstances such as
transaction abort boiled down to calling btrfs_invalidate_inodes whose
sole job is to invalidate the dentries for all inodes related to a
root. This is in fact wrong and insufficient since such delalloc inodes
will likely have pending pages or ordered-extents and will be linked to
the sb->s_inode_list. This means that unmounting a btrfs instance with
an aborted transaction could potentially lead inodes/their pages
visible to the system long after their superblock has been freed. This
in turn leads to a "use-after-free" situation once page shrink is
triggered. This situation could be simulated by running generic/019
which would cause such inodes to be left hanging, followed by
generic/176 which causes memory pressure and page eviction which lead
to touching the freed super block instance. This situation is
additionally detected by the unmount code of VFS with the following
message:

"VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice 
day..."

Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(>inode_tree));
in free_fs_root for the same reason.

This patch aims to rectify the sitaution by doing the following:

1. Change btrfs_destroy_delalloc_inodes so that it calls
invalidate_inode_pages2 for every inode on the delalloc list, this
ensures that all the pages of the inode are released. This function
boils down to calling btrfs_releasepage. During test I observed cases
where inodes on the delalloc list were having an i_count of 0, so this
necessitates using igrab to be sure we are working on a non-freed inode.

2. Since calling btrfs_releasepage might queue delayed iputs move the
call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
calling run_delayed_iputs for the last time. This is necessary to ensure
that delayed iputs are run.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f50786e19a7a..0f88a551862a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3816,6 +3816,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
set_bit(BTRFS_FS_CLOSING_DONE, _info->flags);
 
btrfs_free_qgroup_config(fs_info);
+   ASSERT(list_empty(_info->delalloc_roots));
 
if (percpu_counter_sum(_info->delalloc_bytes)) {
btrfs_info(fs_info, "at unmount delalloc count %lld",
@@ -4123,15 +4124,15 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info)
 
 static void btrfs_error_commit_super(struct btrfs_fs_info *fs_info)
 {
+   /* cleanup FS via transaction */
+   btrfs_cleanup_transaction(fs_info);
+
mutex_lock(_info->cleaner_mutex);
btrfs_run_delayed_iputs(fs_info);
mutex_unlock(_info->cleaner_mutex);
 
down_write(_info->cleanup_work_sem);
up_write(_info->cleanup_work_sem);
-
-   /* cleanup FS via transaction */
-   btrfs_cleanup_transaction(fs_info);
 }
 
 static void btrfs_destroy_ordered_extents(struct btrfs_root *root)
@@ -4256,19 +4257,19 @@ static void btrfs_destroy_delalloc_inodes(struct 
btrfs_root *root)
list_splice_init(>delalloc_inodes, );
 
while (!list_empty()) {
+   struct inode *inode = NULL;
btrfs_inode = list_first_entry(, struct btrfs_inode,
   delalloc_inodes);
-
-   list_del_init(_inode->delalloc_inodes);
-   clear_bit(BTRFS_INODE_IN_DELALLOC_LIST,
- _inode->runtime_flags);
+   __btrfs_del_delalloc_inode(root, btrfs_inode);
spin_unlock(>delalloc_lock);
 
-   btrfs_invalidate_inodes(btrfs_inode->root);
-
+   inode = igrab(_inode->vfs_inode);
+   if (inode) {
+   invalidate_inode_pages2(inode->i_mapping);
+   iput(inode);
+   }
spin_lock(>delalloc_lock);
}
-
spin_unlock(>delalloc_lock);
 }
 
@@ -4284,7 +4285,6 @@ static void btrfs_destroy_all_delalloc_inodes(struct 
btrfs_fs_info *fs_info)
while (!list_empty()) {
root = list_first_entry(, struct btrfs_root,
 delalloc_root);
-   list_del_init(>delalloc_root);
root = btrfs_grab_fs_root(root);
BUG_ON(!root);
spin_unlock(_info->delalloc_root_lock);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Filipe Manana
On Thu, Apr 26, 2018 at 8:23 PM,   wrote:
> From: Jeff Mahoney 
>
> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
> ended up reintroducing the hang-on-unmount bug that the commit it
> intended to fix addressed.
>
> The race this time is between qgroup_rescan_init setting
> ->qgroup_rescan_running = true and the worker starting.  There are
> many scenarios where we initialize the worker and never start it.  The
> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
> This can happen even without involving error handling, since mounting
> the file system read-only returns between initializing the worker and
> queueing it.
>
> The right place to do it is when we're queuing the worker.  The flag
> really just means that btrfs_ioctl_quota_rescan_wait should wait for
> a completion.
>
> This patch introduces a new helper, queue_rescan_worker, that handles
> the ->qgroup_rescan_running flag, including any races with umount.
>
> While we're at it, ->qgroup_rescan_running is protected only by the
> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
> to take the spinlock too.
>
> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)

The commit id and subjects don't match:

commit d2c609b834d62f1e91f1635a27dca29f7806d3d6
Author: Jeff Mahoney 
Date:   Mon Aug 15 12:10:33 2016 -0400

btrfs: properly track when rescan worker is running


> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/ctree.h  |  1 +
>  fs/btrfs/qgroup.c | 40 
>  2 files changed, 29 insertions(+), 12 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index da308774b8a4..dbba615f4d0f 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1045,6 +1045,7 @@ struct btrfs_fs_info {
> struct btrfs_workqueue *qgroup_rescan_workers;
> struct completion qgroup_rescan_completion;
> struct btrfs_work qgroup_rescan_work;
> +   /* qgroup rescan worker is running or queued to run */
> bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
>
> /* filesystem state */
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index aa259d6986e1..be491b6c020a 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2072,6 +2072,30 @@ int btrfs_qgroup_account_extents(struct 
> btrfs_trans_handle *trans,
> return ret;
>  }
>
> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
> +{
> +   mutex_lock(_info->qgroup_rescan_lock);
> +   if (btrfs_fs_closing(fs_info)) {
> +   mutex_unlock(_info->qgroup_rescan_lock);
> +   return;
> +   }
> +   if (WARN_ON(fs_info->qgroup_rescan_running)) {
> +   btrfs_warn(fs_info, "rescan worker already queued");
> +   mutex_unlock(_info->qgroup_rescan_lock);
> +   return;
> +   }
> +
> +   /*
> +* Being queued is enough for btrfs_qgroup_wait_for_completion
> +* to need to wait.
> +*/
> +   fs_info->qgroup_rescan_running = true;
> +   mutex_unlock(_info->qgroup_rescan_lock);
> +
> +   btrfs_queue_work(fs_info->qgroup_rescan_workers,
> +_info->qgroup_rescan_work);
> +}
> +
>  /*
>   * called from commit_transaction. Writes all changed qgroups to disk.
>   */
> @@ -2123,8 +2147,7 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans,
> ret = qgroup_rescan_init(fs_info, 0, 1);
> if (!ret) {
> qgroup_rescan_zero_tracking(fs_info);
> -   btrfs_queue_work(fs_info->qgroup_rescan_workers,
> -_info->qgroup_rescan_work);
> +   queue_rescan_worker(fs_info);
> }
> ret = 0;
> }
> @@ -2713,7 +2736,6 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 
> progress_objectid,
> sizeof(fs_info->qgroup_rescan_progress));
> fs_info->qgroup_rescan_progress.objectid = progress_objectid;
> init_completion(_info->qgroup_rescan_completion);
> -   fs_info->qgroup_rescan_running = true;
>
> spin_unlock(_info->qgroup_lock);
> mutex_unlock(_info->qgroup_rescan_lock);
> @@ -2785,9 +2807,7 @@ btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info)
>
> qgroup_rescan_zero_tracking(fs_info);
>
> -   btrfs_queue_work(fs_info->qgroup_rescan_workers,
> -_info->qgroup_rescan_work);
> -
> +   queue_rescan_worker(fs_info);
> return 0;
>  }
>
> @@ -2798,9 +2818,7 @@ int btrfs_qgroup_wait_for_completion(struct 
> btrfs_fs_info *fs_info,
> int ret = 0;
>
> mutex_lock(_info->qgroup_rescan_lock);
> -   spin_lock(_info->qgroup_lock);
> running = fs_info->qgroup_rescan_running;
> -   

Re: [PATCH 1/3] btrfs: qgroups, fix rescan worker running races

2018-04-27 Thread Nikolay Borisov


On 26.04.2018 22:23, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Commit d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
> fixed the issue with BTRFS_IOC_QUOTA_RESCAN_WAIT being racy, but
> ended up reintroducing the hang-on-unmount bug that the commit it
> intended to fix addressed.
> 
> The race this time is between qgroup_rescan_init setting
> ->qgroup_rescan_running = true and the worker starting.  There are
> many scenarios where we initialize the worker and never start it.  The
> completion btrfs_ioctl_quota_rescan_wait waits for will never come.
> This can happen even without involving error handling, since mounting
> the file system read-only returns between initializing the worker and
> queueing it.
> 
> The right place to do it is when we're queuing the worker.  The flag
> really just means that btrfs_ioctl_quota_rescan_wait should wait for
> a completion.
> 
> This patch introduces a new helper, queue_rescan_worker, that handles
> the ->qgroup_rescan_running flag, including any races with umount.
> 
> While we're at it, ->qgroup_rescan_running is protected only by the
> ->qgroup_rescan_mutex.  btrfs_ioctl_quota_rescan_wait doesn't need
> to take the spinlock too.
> 
> Fixes: d2c609b834d6 (Btrfs: fix qgroup rescan worker initialization)
> Signed-off-by: Jeff Mahoney 


LGTM.

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h  |  1 +
>  fs/btrfs/qgroup.c | 40 
>  2 files changed, 29 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index da308774b8a4..dbba615f4d0f 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1045,6 +1045,7 @@ struct btrfs_fs_info {
>   struct btrfs_workqueue *qgroup_rescan_workers;
>   struct completion qgroup_rescan_completion;
>   struct btrfs_work qgroup_rescan_work;
> + /* qgroup rescan worker is running or queued to run */
>   bool qgroup_rescan_running; /* protected by qgroup_rescan_lock */
>  
>   /* filesystem state */
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index aa259d6986e1..be491b6c020a 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2072,6 +2072,30 @@ int btrfs_qgroup_account_extents(struct 
> btrfs_trans_handle *trans,
>   return ret;
>  }
>  
> +static void queue_rescan_worker(struct btrfs_fs_info *fs_info)
> +{
> + mutex_lock(_info->qgroup_rescan_lock);
> + if (btrfs_fs_closing(fs_info)) {
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return;
> + }
> + if (WARN_ON(fs_info->qgroup_rescan_running)) {
> + btrfs_warn(fs_info, "rescan worker already queued");
> + mutex_unlock(_info->qgroup_rescan_lock);
> + return;
> + }
> +
> + /*
> +  * Being queued is enough for btrfs_qgroup_wait_for_completion
> +  * to need to wait.
> +  */
> + fs_info->qgroup_rescan_running = true;
> + mutex_unlock(_info->qgroup_rescan_lock);
> +
> + btrfs_queue_work(fs_info->qgroup_rescan_workers,
> +  _info->qgroup_rescan_work);
> +}
> +
>  /*
>   * called from commit_transaction. Writes all changed qgroups to disk.
>   */
> @@ -2123,8 +2147,7 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans,
>   ret = qgroup_rescan_init(fs_info, 0, 1);
>   if (!ret) {
>   qgroup_rescan_zero_tracking(fs_info);
> - btrfs_queue_work(fs_info->qgroup_rescan_workers,
> -  _info->qgroup_rescan_work);
> + queue_rescan_worker(fs_info);
>   }

So here it's not possible to race, since if qgroup_rescan_init returns 0
then we are guaranteed to queue the rescan.

>   ret = 0;
>   }
> @@ -2713,7 +2736,6 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 
> progress_objectid,
>   sizeof(fs_info->qgroup_rescan_progress));
>   fs_info->qgroup_rescan_progress.objectid = progress_objectid;
>   init_completion(_info->qgroup_rescan_completion);
> - fs_info->qgroup_rescan_running = true;
>  
>   spin_unlock(_info->qgroup_lock);
>   mutex_unlock(_info->qgroup_rescan_lock);
> @@ -2785,9 +2807,7 @@ btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info)
>  
>   qgroup_rescan_zero_tracking(fs_info);
>  
> - btrfs_queue_work(fs_info->qgroup_rescan_workers,
> -  _info->qgroup_rescan_work);
> -
> + queue_rescan_worker(fs_info);

Which leaves this to be the only problematic case, in case transaction
joining/commit fails, right?

>   return 0;
>  }
>  
> @@ -2798,9 +2818,7 @@ int btrfs_qgroup_wait_for_completion(struct 
> btrfs_fs_info *fs_info,
>   int ret = 0;
>  
>   mutex_lock(_info->qgroup_rescan_lock);
> - spin_lock(_info->qgroup_lock);
>   running = fs_info->qgroup_rescan_running;
> - spin_unlock(_info->qgroup_lock);
>   

Re: [PATCH v2] btrfs: Add test that checks rmdir(2) can delete a subvolume

2018-04-27 Thread Eryu Guan
On Fri, Apr 27, 2018 at 05:02:45PM +0900, Misono Tomohiro wrote:
> Add btrfs test that checks "rmdir" or "rm -r" command can delete a
> subvolume like an ordinary drectory.
> 
> This behavior has been restricted long time but becomes allowed by
> following patch in the kernel:
>   btrfs: Allow rmdir(2) to delete an empty subvolume
> 
> Signed-off-by: Tomohiro Misono 

Looks good to me, thanks! But still, I'd like to see a Reviewed-by tag
from btrfs developers.

Thanks,
Eryu

> ---
> v1 -> v2
>   - rebase to current master
>   - remove underscore from local function
>   - remove source common/btrfs
>   - remove unnecessary _filter_scratch and echo messages when fail
> 
> As I will take a whole week vacation next week, my replay will be late.
> Thanks,
> Tomohiro Misono
> 
> 
>  tests/btrfs/160 | 141 
> 
>  tests/btrfs/160.out |   2 +
>  tests/btrfs/group   |   1 +
>  3 files changed, 144 insertions(+)
>  create mode 100755 tests/btrfs/160
>  create mode 100644 tests/btrfs/160.out
> 
> diff --git a/tests/btrfs/160 b/tests/btrfs/160
> new file mode 100755
> index ..86d2dca7
> --- /dev/null
> +++ b/tests/btrfs/160
> @@ -0,0 +1,141 @@
> +#! /bin/bash
> +# FS QA Test btrfs/159
> +#
> +# QA test that checks rmdir(2) works for subvolumes like ordinary 
> directories.
> +#
> +# This behavior has been restricted long time but becomes allowed by 
> following
> +# patch in the kernel:
> +#   btrfs: Allow rmdir(2) to delete an empty subvolume
> +#
> +#---
> +# Copyright (c) 2018 Fujitsu. All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +create_subvol()
> +{
> + $BTRFS_UTIL_PROG subvolume create $1 >> $seqres.full 2>&1
> +}
> +
> +create_snapshot()
> +{
> + $BTRFS_UTIL_PROG subvolume snapshot $@ >> $seqres.full 2>&1
> +}
> +
> +rmdir_subvol()
> +{
> + rmdir $1 >> $seqres.full 2>&1
> +}
> +
> +rm_r_subvol() {
> + rm -r $1 >> $seqres.full 2>&1
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +
> +_scratch_mkfs > /dev/null 2>&1 || _fail "mkfs failed"
> +_scratch_mount
> +
> +# Check that an empty subvolume can be deleted by rmdir
> +create_subvol $SCRATCH_MNT/sub1
> +rmdir_subvol $SCRATCH_MNT/sub1 || \
> + echo "rmdir should delete an empty subvolume"
> +
> +# Check that non-empty subvolume cannot be deleted by rmdir
> +create_subvol $SCRATCH_MNT/sub2
> +touch $SCRATCH_MNT/sub2/file
> +rmdir_subvol $SCRATCH_MNT/sub2 && \
> + echo "rmdir should fail for non-empty subvolume"
> +rm $SCRATCH_MNT/sub2/file
> +rmdir_subvol $SCRATCH_MNT/sub2 || \
> + echo "rmdir should delete an empty subvolume"
> +
> +# Check that read-only empty subvolume can be deleted by rmdir
> +create_subvol $SCRATCH_MNT/sub3
> +create_snapshot -r $SCRATCH_MNT/sub3 $SCRATCH_MNT/snap
> +$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sub3 ro true >> $seqres.full 2>&1
> +rmdir_subvol $SCRATCH_MNT/sub3 || \
> + echo "rmdir should delete an empty subvolume"
> +rmdir_subvol $SCRATCH_MNT/snap || \
> + echo "rmdir should delete a readonly empty subvolume"
> +
> +# Check that the default subvolume cannot be deleted by rmdir
> +create_subvol $SCRATCH_MNT/sub4
> +subvolid=$(_btrfs_get_subvolid $SCRATCH_MNT sub4)
> +$BTRFS_UTIL_PROG subvolume set-default $subvolid $SCRATCH_MNT \
> + >> $seqres.full 2>&1
> +rmdir_subvol $SCRATCH_MNT/sub4 && \
> + echo "rmdir should fail for the default subvolume"
> +
> +# Check that subvolume stub (created by snapshot) can be deleted by rmdir
> +# (Note: this has been always allowed)
> +create_subvol $SCRATCH_MNT/sub5
> +create_subvol $SCRATCH_MNT/sub5/sub6
> +create_snapshot $SCRATCH_MNT/sub5 

Re: [PATCH] btrfs: modify path->reada in btrfs_shrink_device to READA_BACK

2018-04-27 Thread Nikolay Borisov


On 27.04.2018 11:22, Gu Jinxiang wrote:
> In btrfs_shrink_device, before btrfs_search_slot, path->reada is
> set to READA_FORWARD. But I think READA_BACK is correct.
> Since,
>  1.key.offset is set to (u64)-1
>  2.After btrfs_search_slot, btrfs_previous_item is called.
> So, for readahead previous items, READA_BACK is correct one.

My analysis arrives at the same conclusions so:

Reviewed-by: Nikolay Borisov 

> 
> Signed-off-by: Gu Jinxiang 
> ---
>  fs/btrfs/volumes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 292266f6ab9c..1a3506a3003d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4473,7 +4473,7 @@ int btrfs_shrink_device(struct btrfs_device *device, 
> u64 new_size)
>   if (!path)
>   return -ENOMEM;
>  
> - path->reada = READA_FORWARD;
> + path->reada = READA_BACK;
>  
>   mutex_lock(_info->chunk_mutex);
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: modify path->reada in btrfs_shrink_device to READA_BACK

2018-04-27 Thread Gu Jinxiang
In btrfs_shrink_device, before btrfs_search_slot, path->reada is
set to READA_FORWARD. But I think READA_BACK is correct.
Since,
 1.key.offset is set to (u64)-1
 2.After btrfs_search_slot, btrfs_previous_item is called.
So, for readahead previous items, READA_BACK is correct one.

Signed-off-by: Gu Jinxiang 
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 292266f6ab9c..1a3506a3003d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4473,7 +4473,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 
new_size)
if (!path)
return -ENOMEM;
 
-   path->reada = READA_FORWARD;
+   path->reada = READA_BACK;
 
mutex_lock(_info->chunk_mutex);
 
-- 
1.9.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: Add test that checks rmdir(2) can delete a subvolume

2018-04-27 Thread Misono Tomohiro
Add btrfs test that checks "rmdir" or "rm -r" command can delete a
subvolume like an ordinary drectory.

This behavior has been restricted long time but becomes allowed by
following patch in the kernel:
  btrfs: Allow rmdir(2) to delete an empty subvolume

Signed-off-by: Tomohiro Misono 
---
v1 -> v2
  - rebase to current master
  - remove underscore from local function
  - remove source common/btrfs
  - remove unnecessary _filter_scratch and echo messages when fail

As I will take a whole week vacation next week, my replay will be late.
Thanks,
Tomohiro Misono


 tests/btrfs/160 | 141 
 tests/btrfs/160.out |   2 +
 tests/btrfs/group   |   1 +
 3 files changed, 144 insertions(+)
 create mode 100755 tests/btrfs/160
 create mode 100644 tests/btrfs/160.out

diff --git a/tests/btrfs/160 b/tests/btrfs/160
new file mode 100755
index ..86d2dca7
--- /dev/null
+++ b/tests/btrfs/160
@@ -0,0 +1,141 @@
+#! /bin/bash
+# FS QA Test btrfs/159
+#
+# QA test that checks rmdir(2) works for subvolumes like ordinary directories.
+#
+# This behavior has been restricted long time but becomes allowed by following
+# patch in the kernel:
+#   btrfs: Allow rmdir(2) to delete an empty subvolume
+#
+#---
+# Copyright (c) 2018 Fujitsu. All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+create_subvol()
+{
+   $BTRFS_UTIL_PROG subvolume create $1 >> $seqres.full 2>&1
+}
+
+create_snapshot()
+{
+   $BTRFS_UTIL_PROG subvolume snapshot $@ >> $seqres.full 2>&1
+}
+
+rmdir_subvol()
+{
+   rmdir $1 >> $seqres.full 2>&1
+}
+
+rm_r_subvol() {
+   rm -r $1 >> $seqres.full 2>&1
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs > /dev/null 2>&1 || _fail "mkfs failed"
+_scratch_mount
+
+# Check that an empty subvolume can be deleted by rmdir
+create_subvol $SCRATCH_MNT/sub1
+rmdir_subvol $SCRATCH_MNT/sub1 || \
+   echo "rmdir should delete an empty subvolume"
+
+# Check that non-empty subvolume cannot be deleted by rmdir
+create_subvol $SCRATCH_MNT/sub2
+touch $SCRATCH_MNT/sub2/file
+rmdir_subvol $SCRATCH_MNT/sub2 && \
+   echo "rmdir should fail for non-empty subvolume"
+rm $SCRATCH_MNT/sub2/file
+rmdir_subvol $SCRATCH_MNT/sub2 || \
+   echo "rmdir should delete an empty subvolume"
+
+# Check that read-only empty subvolume can be deleted by rmdir
+create_subvol $SCRATCH_MNT/sub3
+create_snapshot -r $SCRATCH_MNT/sub3 $SCRATCH_MNT/snap
+$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sub3 ro true >> $seqres.full 2>&1
+rmdir_subvol $SCRATCH_MNT/sub3 || \
+   echo "rmdir should delete an empty subvolume"
+rmdir_subvol $SCRATCH_MNT/snap || \
+   echo "rmdir should delete a readonly empty subvolume"
+
+# Check that the default subvolume cannot be deleted by rmdir
+create_subvol $SCRATCH_MNT/sub4
+subvolid=$(_btrfs_get_subvolid $SCRATCH_MNT sub4)
+$BTRFS_UTIL_PROG subvolume set-default $subvolid $SCRATCH_MNT \
+   >> $seqres.full 2>&1
+rmdir_subvol $SCRATCH_MNT/sub4 && \
+   echo "rmdir should fail for the default subvolume"
+
+# Check that subvolume stub (created by snapshot) can be deleted by rmdir
+# (Note: this has been always allowed)
+create_subvol $SCRATCH_MNT/sub5
+create_subvol $SCRATCH_MNT/sub5/sub6
+create_snapshot $SCRATCH_MNT/sub5 $SCRATCH_MNT/snap2
+rmdir $SCRATCH_MNT/snap2/sub6 || \
+   echo "rmdir should delete a subvolume stub (ino number == 2)"
+
+# Check that rm -r works for both non-snapshot subvolume and snapshot
+create_subvol $SCRATCH_MNT/sub7
+mkdir $SCRATCH_MNT/sub7/dir
+create_subvol $SCRATCH_MNT/sub7/dir/sub8
+touch $SCRATCH_MNT/sub7/dir/sub8/file
+
+create_snapshot $SCRATCH_MNT/sub7 $SCRATCH_MNT/snap3
+create_snapshot -r $SCRATCH_MNT/sub7 $SCRATCH_MNT/snap4
+

Re: [PATCH] btrfs: Add test that checks rmdir(2) can delete a subvolume

2018-04-27 Thread Misono Tomohiro
On 2018/04/27 13:40, Eryu Guan wrote:
> On Thu, Apr 19, 2018 at 04:23:35PM +0900, Misono Tomohiro wrote:
>> Add btrfs test that checks "rmdir" or "rm -r" command can delete a
>> subvolume like an ordinary drectory.
>>
>> This behavior has been restricted long time but becomes allowed by
>> following patch in the kernel:
>>   btrfs: Allow rmdir(2) to delete an empty subvolume
>>
>> Signed-off-by: Tomohiro Misono 
> 
> Looks fine to me overall from fstests' perspective of view, some minor
> issues inline.
> 
> But I'd like a Reviewed-by tag from btrfs developers too.
> 
>> ---
>>  tests/btrfs/159 | 140 
>> 
>>  tests/btrfs/159.out |   2 +
>>  tests/btrfs/group   |   1 +
>>  3 files changed, 143 insertions(+)
>>  create mode 100755 tests/btrfs/159
>>  create mode 100644 tests/btrfs/159.out
>>
>> diff --git a/tests/btrfs/159 b/tests/btrfs/159
>> new file mode 100755
>> index ..2d830502
>> --- /dev/null
>> +++ b/tests/btrfs/159
>> @@ -0,0 +1,140 @@
>> +#! /bin/bash
>> +# FS QA Test btrfs/159
>> +#
>> +# QA test that checks rmdir(2) works for subvolumes like ordinary 
>> directories.
>> +#
>> +# This behavior has been restricted long time but becomes allowed by 
>> following
>> +# patch in the kernel:
>> +#   btrfs: Allow rmdir(2) to delete an empty subvolume
>> +#
>> +#---
>> +# Copyright (c) 2018 Fujitsu. All Rights Reserved.
>> +#
>> +# This program is free software; you can redistribute it and/or
>> +# modify it under the terms of the GNU General Public License as
>> +# published by the Free Software Foundation.
>> +#
>> +# This program is distributed in the hope that it would be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program; if not, write the Free Software Foundation,
>> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>> +#---
>> +#
>> +
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1# failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +cd /
>> +rm -f $tmp.*
>> +}
>> +
>> +_create_subvol()
> 
> There's no need to name these local helper functions with the leading
> underscore.
> 
>> +{
>> +$BTRFS_UTIL_PROG subvolume create $1 2>&1 | \
>> +_filter_scratch >> $seqres.full
> 
> No need to do _filter_scratch if the output won't go to stdout/stderr,
> it's fine or even recommended to not filter the output when dumping them
> to $seqres.full for debug purpose.
> 
> There're a few other places that _filter_scratch can be dropped.
> 
>> +}
>> +
>> +_create_snapshot()
>> +{
>> +$BTRFS_UTIL_PROG subvolume snapshot $@ 2>&1 | \
>> +_filter_scratch >> $seqres.full
>> +}
>> +
>> +_rmdir_subvol()
>> +{
>> +rmdir $1 2>&1 | _filter_scratch
> 
> I notice that we don't expect any output from the test script ("Silence
> is golden" in .out file), so I don't think it's necessary to do
> redirection and filter here, just do "rmdir $1" and any error message
> would break golden image and fail the test.
> 
>> +return ${PIPESTATUS[0]}
> 
> If you only care about the return value, just dump the rmdir output to
> $seqres.full or /dev/null.
> 
>> +}
>> +
>> +_rm_r_subvol() {
>> +rm -r $1 2>&1 | _filter_scratch
>> +return ${PIPESTATUS[0]}
> 
> Same here.
> 
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +. ./common/btrfs
> 
> No need to source common/btrfs, it's already sourced by common/rc based
> on current $FSTYP.
> 
>> +
>> +# remove previous $seqres.full before test
>> +rm -f $seqres.full
>> +
>> +# real QA test starts here
>> +
>> +# Modify as appropriate.
>> +_supported_fs btrfs
>> +_supported_os Linux
>> +_require_scratch
>> +
>> +_scratch_mkfs > /dev/null 2>&1 || _fail "mkfs failed"
>> +_scratch_mount
>> +
>> +# Check that an empty subvolume can be deleted by rmdir
>> +_create_subvol $SCRATCH_MNT/sub1
>> +_rmdir_subvol $SCRATCH_MNT/sub1
>> +
>> +# Check that non-empty subvolume cannot be deleted by rmdir
>> +_create_subvol $SCRATCH_MNT/sub2
>> +touch $SCRATCH_MNT/sub2/file
>> +_rmdir_subvol $SCRATCH_MNT/sub2 >> $seqres.full && \
>> +_fail "rmdir should fail for non-empty subvolume"
> 
> Just echo the error message, instead of exiting the test, test could
> continue. Or even simpler, don't rely on the return status of
> _rmdir_subvol, just put any expected output from _rmdir_subvol to .out
> file, e.g.
> 
> _rmdir_subvol $SCRATCH_MNT/sub2 | _filter_scratch
> 
> So that, 

[PATCH] btrfs: balance dirty metadata pages in btrfs_finish_ordered_io

2018-04-27 Thread Ethan Lien
We should balance dirty metadata pages at the end of
btrfs_finish_ordered_io, since a small, unmergeable random write can
potentially produce dirty metadata which is multiple times larger than
the data itself. For example, a small, unmergeable 4KiB write may
produce:

16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

Although we do call balance dirty pages in write side, but in the
buffered write path, most metadata are dirtied only after we reach the
dirty background limit (which by far onlys counts dirty data pages) and
wakeup the flusher thread. If there are many small, unmergeable random
writes spread in a large btree, we'll find a burst of dirty pages
exceeds the dirty_bytes limit after we wakeup the flusher thread - which
is not what we expect. In our machine, it caused out-of-memory problem
since a page cannot be dropped if it is marked dirty.

Someone may worry about we may sleep in btrfs_btree_balance_dirty(), but
since we do btrfs_finish_ordered_io in a separate worker, it will not
stop the flusher consuming dirty pages. Also, we use different worker for
metadata writeback endio, sleep in btrfs_finish_ordered_io help us
throttle the size of dirty metadata pages.

Signed-off-by: Ethan Lien 
---
 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..d09433ab126e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3151,6 +3151,8 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
/* once for the tree */
btrfs_put_ordered_extent(ordered_extent);
 
+   btrfs_btree_balance_dirty(fs_info);
+
return ret;
 }
 
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add balance args info during start and resume

2018-04-27 Thread Nikolay Borisov


On 26.04.2018 11:01, Anand Jain wrote:
> Balance args info is an important information to be reviewed on the
> system under audit. So this patch adds that.

Would have been good to add example output to the change log.

> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/volumes.c | 153 
> +++--
>  1 file changed, 150 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e688c993197f..3d47b36579b3 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -138,6 +138,32 @@ const char *get_raid_name(enum btrfs_raid_types type)
>   return btrfs_raid_array[type].raid_name;
>  }
>  
> +static void get_all_raid_names(u64 bg_flags, char *raid_types)
> +{
> + int i;
> + bool found = false;
> +
> + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> + if (bg_flags & btrfs_raid_array[i].bg_flag) {
> + if (found) {
> + strcat(raid_types, "|");
> + strcat(raid_types, 
> btrfs_raid_array[i].raid_name);
> + } else {
> + found = true;
> + sprintf(raid_types, "%s", 
> btrfs_raid_array[i].raid_name);
> + }
> + }
> + }
> + if (bg_flags & BTRFS_AVAIL_ALLOC_BIT_SINGLE) {
> + if (found) {
> + strcat(raid_types, "|");
> + strcat(raid_types, "single");
> + } else {
> + sprintf(raid_types, "%s", "single");
> + }
> + }
> +}
> +
>  static int init_first_rw_device(struct btrfs_trans_handle *trans,
>   struct btrfs_fs_info *fs_info);
>  static int btrfs_relocate_sys_chunks(struct btrfs_fs_info *fs_info);
> @@ -3770,6 +3796,121 @@ static inline int validate_convert_profile(struct 
> btrfs_balance_args *bctl_arg,
>(bctl_arg->target & ~allowed)));
>  }
>  
> +static void get_balance_args(struct btrfs_balance_args *bargs, char *args)
> +{
> + char value[64];
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_PROFILES) {
> + strcat(args, "profiles=");
> + get_all_raid_names(bargs->profiles, value);
> + strcat(args, value);
> + strcat(args, " ");
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_USAGE) {
> + snprintf(value, 64, "usage=%llu ", bargs->usage);
> + strcat(args, value);
+   if (bg_flags & BTRFS_AVAIL_ALLOC_BIT_SINGLE) {
+   if (found) {
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_USAGE_RANGE) {
> + snprintf(value, 64, "usage_min=%u usage_max=%u ",
> + bargs->usage_min, bargs->usage_max);
> + strcat(args, value);
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_DEVID) {
> + snprintf(value, 64, "devid=%llu ", bargs->devid);
> + strcat(args, value);
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_DRANGE) {
> + snprintf(value, 64, "pstart=%llu pend=%llu ",
> +  bargs->pstart, bargs->pend);
> + strcat(args, value);
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_VRANGE) {
> + snprintf(value, 64, "vstart=%llu vend %llu ",
> +  bargs->vstart, bargs->vend);
> + strcat(args, value);
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_LIMIT) {
> + snprintf(value, 64, "limit=%llu ", bargs->limit);
> + strcat(args, value);
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_LIMIT_RANGE) {
> + snprintf(value, 64, "limit_min=%u limit_max=%u ",
> +  bargs->limit_min, bargs->limit_max);
> + strcat(args, value);
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_STRIPES_RANGE) {
> + snprintf(value, 64, "stripes_min=%u stripes_max=%u ",
> +  bargs->stripes_min, bargs->stripes_max);
> + strcat(args, value);
> + }
> +
> + if (bargs->flags & BTRFS_BALANCE_ARGS_CONVERT) {
> + int index = btrfs_bg_flags_to_raid_index(bargs->target);
> + snprintf(value, 64, "convert=%s ",
> +  get_raid_name(index));
> + strcat(args, value);
> + }
> +

Having all those ifs as independent statements means we can potentially
have each and every value present, I haven't really counted how long the
string could potentially get. Is it not possible to convert some of them
into if/else if constructs or is it indeed possible to have all of them
present at once?

> + /* If space was the last char remove it */
> + if (strlen(args) && (args[strlen(args) - 1] == ' '))
> + args[strlen(args) - 1] = '\0';
> +}
> +
> +static void print_balance_start_or_resume(struct btrfs_fs_info *fs_info)
>