date:20180321

[PATCH V2] Btrfs: replace: cache rbio when rebuild data on missing device

2018-03-21 Thread Liu Bo

Rebuild on missing device is as same as recover, after it's done, rbio
has data which is consistent with on-disk data, so it can be cached to
avoid further reads.

Signed-off-by: Liu Bo 
Signed-off-by: Liu Bo 
---
v2: Add comments to explain why REBUILD needs to be merged.

 fs/btrfs/raid56.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index fcfc20de2df3..154318249265 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1987,7 +1987,13 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
kfree(pointers);
 
 cleanup_io:
-   if (rbio->operation == BTRFS_RBIO_READ_REBUILD) {
+   /*
+* Similar to READ_REBUILD, REBUILD_MISSING at this point also has a
+* valid rbio which is consistent with ondisk content, thus such a valid
+* rbio can be cached to avoid further disk reads.
+*/
+   if (rbio->operation == BTRFS_RBIO_READ_REBUILD ||
+   rbio->operation == BTRFS_RBIO_REBUILD_MISSING) {
/*
 * - In case of two failures, where rbio->failb != -1:
 *
@@ -2008,8 +2014,6 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
else
clear_bit(RBIO_CACHE_READY_BIT, >flags);
 
-   rbio_orig_end_io(rbio, err);
-   } else if (rbio->operation == BTRFS_RBIO_REBUILD_MISSING) {
rbio_orig_end_io(rbio, err);
} else if (err == BLK_STS_OK) {
rbio->faila = -1;
-- 
2.14.1.40.g8e62ba1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Out of space and incorrect size reported

2018-03-21 Thread Shane Walton

Unfortunately this didn’t seem to correct the problem.  Please see below:

> uname -a 
Linux rockstor 4.12.4-1.el7.elrepo.x86_64 #1 SMP Thu Jul 27 20:03:28 EDT 2017 
x86_64 x86_64 x86_64 GNU/Linux

> btrfs —version
btrfs-progs v4.12

> btrfs fi df -H /mnt2/pool_homes
Data, RAID1: total=257.70GB, used=257.46GB
System, RAID1: total=8.39MB, used=65.54kB
Metadata, RAID1: total=7.52GB, used=6.35GB
GlobalReserve, single: total=498.27MB, used=0.00B

> btrfs fi show /mnt2/pool_homes
Label: 'pool_homes'  uuid: 0987930f-8c9c-49cc-985e-de6383863070
Total devices 2 FS bytes used 245.69GiB
devid1 size 465.76GiB used 247.01GiB path /dev/sda
devid2 size 465.76GiB used 247.01GiB path /dev/sdb

rockstor mounts everything over and over, even if I manually unmount, so I did 
the following:

> umount /mnt2/pool_homes; mount -o clear_cache /dev/sda /mnt2/pool_home

dmesg shows the following:

[ 3473.848389] BTRFS info (device sda): use no compression
[ 3473.848393] BTRFS info (device sda): disk space caching is enabled
[ 3473.848394] BTRFS info (device sda): has skinny extents
[ 3548.337574] BTRFS info (device sda): force clearing of disk cache
[ 3548.337578] BTRFS info (device sda): disk space caching is enabled
[ 3548.337580] BTRFS info (device sda): has skinny extents

Any help is appreciated!

> On Mar 21, 2018, at 5:56 PM, Hugo Mills  wrote:
> 
> On Wed, Mar 21, 2018 at 09:53:39PM +, Shane Walton wrote:
>>> uname -a
>> Linux rockstor 4.4.5-1.el7.elrepo.x86_64 #1 SMP Thu Mar 10 11:45:51 EST 2016 
>> x86_64 x86_64 x86_64 GNU/Linux
>> 
>>> btrfs —version 
>> btrfs-progs v4.4.1
>> 
>>> btrfs fi df /mnt2/pool_homes
>> Data, RAID1: total=240.00GiB, used=239.78GiB
>> System, RAID1: total=8.00MiB, used=64.00KiB
>> Metadata, RAID1: total=8.00GiB, used=5.90GiB
>> GlobalReserve, single: total=512.00MiB, used=59.31MiB
>> 
>>> btrfs filesystem show /mnt2/pool_homes
>> Label: 'pool_homes'  uuid: 0987930f-8c9c-49cc-985e-de6383863070
>>  Total devices 2 FS bytes used 245.75GiB
>>  devid1 size 465.76GiB used 248.01GiB path /dev/sda
>>  devid2 size 465.76GiB used 248.01GiB path /dev/sdb
>> 
>> Why is the line above "Data, RAID1: total=240.00GiB, used=239.78GiB” almost 
>> full and limited to 240 GiB when there is I have 2x 500 GB HDD?  This is all 
>> create/implemented with the Rockstor platform and it says the “share” should 
>> be 400 GB.
>> 
>> What can I do to make this larger or closer to the full size of 465 GiB 
>> (minus the System and Metadata overhead)?
> 
>   Most likely, you need to ugrade your kernel to get past the known
> bug (fixed in about 4.6 or so, if I recall correctly), and then mount
> with -o clear_cache to force the free space cache to be rebuilt.
> 
>   Hugo.
> 
> -- 
> Hugo Mills | Q: What goes, "Pieces of seven! Pieces of seven!"?
> hugo@... carfax.org.uk | A: A parroty error.
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |

N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

grub_probe/grub-mkimage does not find all drives in BTRFS RAID1

2018-03-21 Thread Matthew Hawn

This is almost definitely a bug in GRUB, but I wanted to get the btrfs mailing 
list opinion first.

Symptoms:
I have a btrfs raid1 /boot and root filesystem.  Ever since I replaced a drive, 
when I run the grub utilities to create my grub.cfg and install to boot sector, 
it only recognizes one of the drives.

$ sudo grub-probe /boot/grub -t device
/dev/mapper/VG_BTRFS2-LV_BOOT2

$ sudo grub-probe /boot/grub -t bios_hints
lvmid/gEfhOx-J9hr-8tkA-OgjD-Aqqu-XR2T-sFB4me/oNnMDp-Rit5-P0qs-QZlf-bQQe-tZU7-Wwmz8z

This also prevents boot if the above drive is disconnected. Grub error in 
locating 
lvmid/gEfhOx-J9hr-8tkA-OgjD-Aqqu-XR2T-sFB4me/oNnMDp-Rit5-P0qs-QZlf-bQQe-tZU7-Wwmz8z

Boot works fine if both drives, or only the above drive is present.

Before drive replacement, the above commands returned both drives that were 
part of the RAID1 mirror.  I never tried booting with a device disconnected, 
but both showed up in my grub.cfg.   Replacement was not standard since the 
prior drive was developing bad sectors, but had not failed. Replacement was 
done by adding a third disk to the mirror, then removing the 1st disk.

Probable Cause:
To determine the boot drive, grub-probe and grub-mkimage make several ioctl  in 
osdep/linux/getroot.c: grub_find_root_devices_from_btrfs
A call to BTRFS_IOC_FS_INFO gets the max_id and num_devices.  It then iterates 
from 1 to max_id, calling BTRFS_IOC_DEV_INFO to get the path.  

For my system, max_id = 3 and num_devices = 2.  Requesting BTRFS_IOC_DEV_INFO 
for device 1 yields a "No Such Device".  Instead of continuing on to device 2 
and 3 (which return without error), grub treats all ioctl errors as fatal, 
exits the btrfs specific code with a failure, then falls back to generic linux 
code that only detects the single drive. 

So, is this a grub bug? If so, any suggestions before I submit to the grub-bug 
list?  Also, as I wait until a fix is published (or I rebuild grub with my own 
patch), any ideas to workaround this?


Info:
Kernel:  4.15.0-12-generic #13-Ubuntu SMP (based on 4.15.7 mainline)
btrfs-progs: 4.15.1-1
Grub:  2.02~beta2-36ubuntu3.17 

$ btrfs fi show
Label: none  uuid: 84c8e78b-9d7f-4451-966d-3c25154e89b8
Total devices 2 FS bytes used 22.16GiB
devid2 size 100.00GiB used 25.03GiB path 
/dev/mapper/VG_BTRFS2-LV_ROOT2
devid3 size 100.00GiB used 25.03GiB path 
/dev/mapper/VG_BTRFS3-LV_ROOT3


Label: none  uuid: 059ab98f-eb63-471d-b099-6561baf39040
Total devices 2 FS bytes used 61.04GiB
devid2 size 200.00GiB used 62.03GiB path 
/dev/mapper/VG_BTRFS2-LV_HOME2
devid3 size 200.00GiB used 62.03GiB path 
/dev/mapper/VG_BTRFS3-LV_HOME3


Label: none  uuid: ffe8b1a0-030c-42c2-94f5-b7e8e54b1439
Total devices 2 FS bytes used 342.04MiB
devid2 size 1.00GiB used 693.62MiB path 
/dev/mapper/VG_BTRFS2-LV_BOOT2
devid3 size 1.00GiB used 693.62MiB path 
/dev/mapper/VG_BTRFS3-LV_BOOT3--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/6] Fix long standing -EOPNOTSUPP problem caused by large inline extent

2018-03-21 Thread Qu Wenruo



On 2018年03月21日 23:51, David Sterba wrote:
> On Tue, Mar 20, 2018 at 02:42:23PM +0800, Qu Wenruo wrote:
>> The patch is based on v4.15.1, and is designed to replace the old patch
>> in devel branch.
>>
>> Kernel doesn't support dropping range inside inline extent, and prevents
>> such thing happening by limiting max inline extent size to
>> min(max_inline, sectorsize - 1) in cow_file_range_inline().
>>
>> However btrfs-progs only inherit the BTRFS_MAX_INLINE_DATA_SIZE() macro,
>> which doesn't have sectorsize check.
>> And since btrfs-progs defaults to 16K nodesize, above macro allows large
>> inline extent over 15K size.
>>
>> This leads to unexpected kernel behavior.
>>
>> The bug exists in several parts of btrfs-progs, any tool which creates
>> file extent is involved, including:
>> 1) btrfs-convert
>> 2) mkfs --rootdir
>>
>> This patchset fixes the problems in convert (both ext2 and reiserfs),
>> mkfs --rootdir, then add check support for both original and lowmem
>> mode, and finally adds 2 test cases, one for mkfs and one for convert.
>>
>> For mkfs test case, it can already be exposed by misc/002, but a
>> pin-point test case will be much better.
>>
>> changelog:
>> v2:
>>   Don't modify BTRFS_MAX_INLINE_DATA_SIZE(), but add extra check to
>>   callers who create file extents.
>> v3:
>>   Merge fixes for convert.
>>   Add real commit message for convert fixes.
>>   Use $TEST_TOP to replace cooperate with stand alone test cases.
>>   Use for loops to make the new test case shorter.
>>
>> Qu Wenruo (6):
>>   btrfs-progs: convert: Fix inline file extent creation condition
>>   btrfs-progs: mkfs/rootdir: Fix inline extent creation check
>>   btrfs-progs: check/original mode: Check inline extent size
>>   btrfs-progs: check/lowmem mode: Check inline extent size
>>   btrfs-progs: test/convert: Add test case for invalid large inline data
>> extent
>>   btrfs-progs: test/mkfs: Add test case for rootdir inline extent size
> 
> Applied, thanks. I had to fix the test, fallocate may fail, so a file of
> given size is created directly.

The fix looks good, and I learn a new trick.

But I'm wondering how could it fail.

Nowadays /tmp should be tmpfs created by systemd, which supports fallocate.
Or is the CI things doesn't follow this?

Thanks,
Qu

> 
> I hope this bug is fixed. My plan was to release 4.15.2 with just
> bugfixes but as this took some time to fix, there likely will be no
> minor release before 4.16 as kernel is about to be released next week.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature

Re: Crashes running btrfs scrub

2018-03-21 Thread Qu Wenruo



On 2018年03月22日 01:13, Liu Bo wrote:
> On Tue, Mar 20, 2018 at 7:01 PM, Qu Wenruo  wrote:
>>
>>
>> On 2018年03月21日 01:44, Mike Stevens wrote:
>>>
> 30 devices is really not that much, heck you get 90 disks top load JBOD
> storage chassis these days and BTRFS does sound like an attractive choice
> for things like that.
>>>
 So Mike's case is, that both metadata and data are configured as
 raid6, and the operations, balance and scrub, that he tried, need to
 set the existing block group as readonly (in order to avoid any
 further changes being applied during operations are running), then we
 got into the place where another system chunk is needed.
>>>
 However, I think it'd be better to have some warnings about this when
 doing a) mkfs.btrfs -mraid6, b) btrfs device add.
>>>
 David, any idea?
>>>
>>> I'll certainly vote for a warning, I would have set this up differently had 
>>> I been aware.
>>>
>>> My filesystem check seems to have returned successfully:
>>>
>>> [root@auswscs9903] ~ # btrfs check --readonly /dev/sdb
>>> Checking filesystem on /dev/sdb
>>> UUID: 77afc2bb-f7a8-4ce9-9047-c031f7571150
>>> checking extents
>>> checking free space cache
>>> checking fs roots
>>> checking csums
>>> checking root refs
>>> found 97926270238720 bytes used err is 0
>>> total csum bytes: 95395030288
>>> total tree bytes: 201223503872
>>> total fs tree bytes: 84484636672
>>> total extent tree bytes: 7195869184
>>> btree space waste bytes: 29627784154
>>> file data blocks allocated: 97756261568512
>>>
>>> I've remounted the filesystem and I can at least touch a file.  I'm 
>>> restarting the rsync that was running when it originally went read only.
>>> What is the next step if it drops back to r/o?
>>
>> As already mentioned, if you're using tons of disks and RAID0/10/5/6 as
>> metadata profile, you can just convert your metadata (or just system) to
>> RAID1/DUP.
>>
>> Then there will be more than enough space for system chunk array.
>>
> 
> It's chicken & egg, balance seems to be the only way to switch raid
> profiles however users are stuck here because balance is aborted due
> to failing to allocate an extra system chunk.

Skip_balance to abort current balance and do the new convert.
Since convert will allocate new chunk in new profile, raid1 sys chunk
should be able to fit into superblock.

Thanks,
Qu

> 
> thanks,
> liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature

Re: spurious full btrfs corruption

2018-03-21 Thread Christoph Anton Mitterer

Just some addition on this:

On Fri, 2018-03-16 at 01:03 +0100, Christoph Anton Mitterer wrote:
> The issue that newer btrfs-progs/kernel don't restore anything at all
> from my corrupted fs:

4.13.3 seems to be already buggy...

4.7.3 works, but interestingly btrfs-find-super seems to hang on it
forever with 100% CPU but apparently no disc IO (works in later
versions, where it finishes in a few seconds).

Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Out of space and incorrect size reported

2018-03-21 Thread Hugo Mills

On Wed, Mar 21, 2018 at 09:53:39PM +, Shane Walton wrote:
> > uname -a
> Linux rockstor 4.4.5-1.el7.elrepo.x86_64 #1 SMP Thu Mar 10 11:45:51 EST 2016 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> > btrfs —version 
> btrfs-progs v4.4.1
> 
> > btrfs fi df /mnt2/pool_homes
> Data, RAID1: total=240.00GiB, used=239.78GiB
> System, RAID1: total=8.00MiB, used=64.00KiB
> Metadata, RAID1: total=8.00GiB, used=5.90GiB
> GlobalReserve, single: total=512.00MiB, used=59.31MiB
> 
> > btrfs filesystem show /mnt2/pool_homes
> Label: 'pool_homes'  uuid: 0987930f-8c9c-49cc-985e-de6383863070
>   Total devices 2 FS bytes used 245.75GiB
>   devid1 size 465.76GiB used 248.01GiB path /dev/sda
>   devid2 size 465.76GiB used 248.01GiB path /dev/sdb
> 
> Why is the line above "Data, RAID1: total=240.00GiB, used=239.78GiB” almost 
> full and limited to 240 GiB when there is I have 2x 500 GB HDD?  This is all 
> create/implemented with the Rockstor platform and it says the “share” should 
> be 400 GB.
> 
> What can I do to make this larger or closer to the full size of 465 GiB 
> (minus the System and Metadata overhead)?

   Most likely, you need to ugrade your kernel to get past the known
bug (fixed in about 4.6 or so, if I recall correctly), and then mount
with -o clear_cache to force the free space cache to be rebuilt.

   Hugo.

-- 
Hugo Mills | Q: What goes, "Pieces of seven! Pieces of seven!"?
hugo@... carfax.org.uk | A: A parroty error.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature

Out of space and incorrect size reported

2018-03-21 Thread Shane Walton

> uname -a
Linux rockstor 4.4.5-1.el7.elrepo.x86_64 #1 SMP Thu Mar 10 11:45:51 EST 2016 
x86_64 x86_64 x86_64 GNU/Linux

> btrfs —version 
btrfs-progs v4.4.1

> btrfs fi df /mnt2/pool_homes
Data, RAID1: total=240.00GiB, used=239.78GiB
System, RAID1: total=8.00MiB, used=64.00KiB
Metadata, RAID1: total=8.00GiB, used=5.90GiB
GlobalReserve, single: total=512.00MiB, used=59.31MiB

> btrfs filesystem show /mnt2/pool_homes
Label: 'pool_homes'  uuid: 0987930f-8c9c-49cc-985e-de6383863070
Total devices 2 FS bytes used 245.75GiB
devid1 size 465.76GiB used 248.01GiB path /dev/sda
devid2 size 465.76GiB used 248.01GiB path /dev/sdb

Why is the line above "Data, RAID1: total=240.00GiB, used=239.78GiB” almost 
full and limited to 240 GiB when there is I have 2x 500 GB HDD?  This is all 
create/implemented with the Rockstor platform and it says the “share” should be 
400 GB.

What can I do to make this larger or closer to the full size of 465 GiB (minus 
the System and Metadata overhead)?

Thanks!

Shane

Re: [PATCH] btrfs: Allow non-privileged user to delete empty subvolume by default

2018-03-21 Thread Goffredo Baroncelli

On 03/21/2018 12:47 PM, Austin S. Hemmelgarn wrote:
> I agree as well, with the addendum that I'd love to see a new ioctl that does 
> proper permissions checks.  While letting rmdir(2) work for an empty 
> subvolume with the appropriate permissions would be great (it will let rm -r 
> work correctly), it doesn't address the usefulness of being able to just 
> `btrfs subvolume delete` and not have to wait for the command to finish 
> before you can reuse the name.

How this could work ? 

If you want to check all the subvolumes files permissions, this will require 
some time: you need to traverse all the subvolume-filesystem; and only if all 
the checks are passed, you can delete the subvolume.

Unfortunately I think that only two options exist:
- don't check permissions, and you can quick remove a subvolume
- check all the permissions, i.e. check all the files permissions, and only if 
all the permissions are OK, you can delete the subvolume. However this cannot 
be a "quick" subvolume delete

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of RAID5/6

2018-03-21 Thread Menion

I am on 4.15.5 :)
Yes I agree that Journaling is better on the same array,  still should
be unit failure tolerant, so maybe it should go in a RAID1 scheme.
Will a raid56 array built with older kernel be compatible with the new
forecoming code?
Bye

2018-03-21 18:24 GMT+01:00 Liu Bo :
> On Wed, Mar 21, 2018 at 9:50 AM, Menion  wrote:
>> Hi all
>> I am trying to understand the status of RAID5/6 in BTRFS
>> I know that there are some discussion ongoing on the RFC patch
>> proposed by Liu bo
>> But it seems that everything stopped last summary. Also it mentioned
>> about a "separate disk for journal", does it mean that the final
>> implementation of RAID5/6 will require a dedicated HDD for the
>> journaling?
>
> Thanks for the interest on btrfs and raid56.
>
> The patch set is to plug write hole, which is very rare in practice, tbh.
> The feedback is to use existing space instead of another dedicate
> "fast device" as the journal in order to get some extent of raid
> protection.  I'd need some time to pick it up.
>
> With that being said, we have several data reconstruction fixes for
> raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
> upstream kernel or some distros which do kernel updates frequently,
> the most important one is
>
> 8810f7517a3b Btrfs: make raid6 rebuild retry more
> https://patchwork.kernel.org/patch/10091755/
>
> AFAIK, no other data corruptions showed up.
>
> thanks,
> liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of RAID5/6

2018-03-21 Thread Christoph Anton Mitterer

Hey.

Some things would IMO be nice to get done/clarified (i.e. documented in
the Wiki and manpages) from users'/admin's  POV:

Some basic questions:
- Starting with which kernels (including stable kernel versions) does
it contain the fixes for the bigger issues from some time ago?

- Exactly what does not work yet (only the write hole?)?
  What's the roadmap for such non-working things?

- Ideally some explicit confirmations of what's considered to work,
  like:
  - compression+raid?
  - rebuild / replace of devices?
  - changing raid lvls?
  - repairing data (i.e. picking the right block according to csums in
case of silent data corruption)?
  - scrub (and scrub+repair)?
  - anything to consider with raid when doing snapshots, send/receive
or defrag?
  => and for each of these: for which raid levels?

  Perhaps also confirmation for previous issues:
  - I vaguely remember there were issues with either device delete or
replace and that one of them was possibly super-slow?
  - I also remember there were cases in which a fs could end up in
permanent read-only state?


- Clarifying questions on what is expected to work and how things are
  expected to behave, e.g.:
  - Can one plug a device (without deleting/removing it first) just
under operation and will btrfs survive it?
  - If an error is found (e.g. silent data corruption based on csums),
when will it repair (fix = write the repaired data) the data?
On the read that finds the bad data?
Only on scrub (i.e. do users need to regularly run scrubs)? 
  - What happens if error cannot be repaired, e.g. no csum information
or all blocks bad?
EIO? Or are there cases where it gives no EIO (I guess at least in
nodatacow case)
  - What happens if data cannot be fixed (i.e. trying to write the
repaired block again fails)?
And if the repaired block is written, will it be immediately
checked again (to find cases of blocks that give different results
again)?
  - Will a scrub check only the data on "one" device... or will it
check all the copies (or parity blocks) on all devices in the raid?
  - Does a fsck check all devices or just one?
  - Does a balance implicitly contain a scrub?
  - If a rebuild/repair/reshape is performed... can these be
interrupted? What if they are forcibly interrupted (power loss)?


- Explaining common workflows:
  - Replacing a faulty or simply an old disk.
How to stop btrfs from using a device (without bricking the fs)?
How to do the rebuild.
  - Best practices, like: should one do regular balances (and if so, as
asked above, do these include the scrubs, so basically: is it
enough to do one of them)
  - How to grow/shrink raid btrfs... and if this is done... how to
replicate the data already on the fs to the newly added disks (or
is this done automatically - and if so, how to see that it's
finished)?
  - What will actually trigger repairs? (i.e. one wants to get silent
block errors fixed ASAP and not only when the data is read - and
when it's possibly to late)
  - In the rebuild/repair phase (e.g. one replaces a device): Can one
somehow give priority to the rebuild/repair? (e.g. in case of a
degraded raid, one may want to get that solved ASAP and rather slow
down other reads or stop them completely.
  - Is there anything to notice when btrfs raid is placed above dm-
crypt from a security PoV?
With MD raid that wasn't much of a problem as it's typically placed
below dm-crypt... but btrfs raid would need to be placed above it.
So maybe there are some known attacks against crypto modes, if
equal (RAID 1 / 10) or similar/equal (RAID 5/6) data is written
above multiple crypto devices? (Probably something one would need
to ask their experts).


- Maintenance tools
  - How to get the status of the RAID? (Querying kernel logs is IMO
rather a bad way for this)
This includes:
- Is the raid degraded or not?
- Are scrubs/repairs/rebuilds/reshapes in progress and how far are
  they? (Reshape would be: if the raid level is changed or the raid
  grown/shrinked: has all data been replicated enough to be
  "complete" for the desired raid lvl/number of devices/size?
   - What should one regularly do? scrubs? balance? How often?
 Do we get any automatic (but configurable) tools for this?
   - There should be support in commonly used tools, e.g. Icinga/Nagios
 check_raid
   - Ideally there should also be some desktop notification tool, which
 tells about raid (and btrfs errors in general) as small
 installations with raids typically run no Icinga/Nagios but rely
 on e.g. email or gui notifications.

I think especially for such tools it's important that these are
maintained by upstream (and yes I know you guys are rather fs
developers not)... but since these tools are so vital, having them done
3rd party can easily lead to the situation where something changes in

Re: [PATCH] btrfs-progs: mkfs: add uuid and otime to ROOT_ITEM of FS_TREE

2018-03-21 Thread David Sterba

On Mon, Mar 19, 2018 at 05:16:42PM +0900, Misono, Tomohiro wrote:
> Currently, the top-level subvolume lacks the UUID. As a result, both
> non-snapshot subvolume and snapshot of top-level subvolume do not have
> Parent UUID and cannot be distinguisued. Therefore "fi show" of
> top-level lists all the subvolumes which lacks the UUID in
> "Snapshot(s)" filed.  Also, it lacks the otime information.
> 
> Fix this by adding the UUID and otime at the mkfs time.  As a
> consequence, snapshots of top-level subvolume now have a Parent UUID and
> UUID tree will create an entry for top-level subvolume at mount time.
> This should not cause the problem for current kernel, but user program
> which relies on the empty Parent UUID may be affected by this change.
> 
> Signed-off-by: Tomohiro Misono 

So this adds uuid, ctime and otime to FS_TEEE but also to UUID_TREE and
DATA_RELOC_TREE. This is harmelss, but would be nice to mention in the
changelog, I'll apply the patch add that. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of RAID5/6

2018-03-21 Thread Liu Bo

On Wed, Mar 21, 2018 at 9:50 AM, Menion  wrote:
> Hi all
> I am trying to understand the status of RAID5/6 in BTRFS
> I know that there are some discussion ongoing on the RFC patch
> proposed by Liu bo
> But it seems that everything stopped last summary. Also it mentioned
> about a "separate disk for journal", does it mean that the final
> implementation of RAID5/6 will require a dedicated HDD for the
> journaling?

Thanks for the interest on btrfs and raid56.

The patch set is to plug write hole, which is very rare in practice, tbh.
The feedback is to use existing space instead of another dedicate
"fast device" as the journal in order to get some extent of raid
protection.  I'd need some time to pick it up.

With that being said, we have several data reconstruction fixes for
raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
upstream kernel or some distros which do kernel updates frequently,
the most important one is

8810f7517a3b Btrfs: make raid6 rebuild retry more
https://patchwork.kernel.org/patch/10091755/

AFAIK, no other data corruptions showed up.

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crashes running btrfs scrub

2018-03-21 Thread Liu Bo

On Tue, Mar 20, 2018 at 7:01 PM, Qu Wenruo  wrote:
>
>
> On 2018年03月21日 01:44, Mike Stevens wrote:
>>
 30 devices is really not that much, heck you get 90 disks top load JBOD
 storage chassis these days and BTRFS does sound like an attractive choice
 for things like that.
>>
>>> So Mike's case is, that both metadata and data are configured as
>>> raid6, and the operations, balance and scrub, that he tried, need to
>>> set the existing block group as readonly (in order to avoid any
>>> further changes being applied during operations are running), then we
>>> got into the place where another system chunk is needed.
>>
>>> However, I think it'd be better to have some warnings about this when
>>> doing a) mkfs.btrfs -mraid6, b) btrfs device add.
>>
>>> David, any idea?
>>
>> I'll certainly vote for a warning, I would have set this up differently had 
>> I been aware.
>>
>> My filesystem check seems to have returned successfully:
>>
>> [root@auswscs9903] ~ # btrfs check --readonly /dev/sdb
>> Checking filesystem on /dev/sdb
>> UUID: 77afc2bb-f7a8-4ce9-9047-c031f7571150
>> checking extents
>> checking free space cache
>> checking fs roots
>> checking csums
>> checking root refs
>> found 97926270238720 bytes used err is 0
>> total csum bytes: 95395030288
>> total tree bytes: 201223503872
>> total fs tree bytes: 84484636672
>> total extent tree bytes: 7195869184
>> btree space waste bytes: 29627784154
>> file data blocks allocated: 97756261568512
>>
>> I've remounted the filesystem and I can at least touch a file.  I'm 
>> restarting the rsync that was running when it originally went read only.
>> What is the next step if it drops back to r/o?
>
> As already mentioned, if you're using tons of disks and RAID0/10/5/6 as
> metadata profile, you can just convert your metadata (or just system) to
> RAID1/DUP.
>
> Then there will be more than enough space for system chunk array.
>

It's chicken & egg, balance seems to be the only way to switch raid
profiles however users are stuck here because balance is aborted due
to failing to allocate an extra system chunk.

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Beautify owner when printing leaf/nodes

2018-03-21 Thread David Sterba

On Tue, Mar 20, 2018 at 08:52:22PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年03月20日 16:45, Nikolay Borisov wrote:
> > Currently we print the raw values of the owner field of leaf/nodes.
> > This can result in output like the following:
> > 
> > leaf 30490624 items 2 free space 16061 generation 4 owner 
> > 18446744073709551607
> > 
> > With the patch applied the same leaf looks like:
> > 
> > leaf 30490624 items 2 free space 16061 generation 4 owner DATA_RELOC_TREE
> > 
> > Signed-off-by: Nikolay Borisov 
> 
> Looks good.
> 
> Reviewed-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Status of RAID5/6

2018-03-21 Thread Menion

Hi all
I am trying to understand the status of RAID5/6 in BTRFS
I know that there are some discussion ongoing on the RFC patch
proposed by Liu bo
But it seems that everything stopped last summary. Also it mentioned
about a "separate disk for journal", does it mean that the final
implementation of RAID5/6 will require a dedicated HDD for the
journaling?
Bye
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: inspect-dump-tree: add '-f|--follow' options to print all children tree blocks for '-b'

2018-03-21 Thread David Sterba

On Tue, Mar 20, 2018 at 01:19:21PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年03月20日 02:44, David Sterba wrote:
> > On Wed, Mar 14, 2018 at 09:05:38AM +0800, Qu Wenruo wrote:
> >> When debuging with "btrfs inspect dump-tree", it's not that handy if we
> >> want to iterate all child tree blocks starting from a specified block.
> >>
> >> -b can only print a single block, while without -b "btrfs inspect 
> >> dump-tree"
> >> will need extra tree roots fulfilled to continue, which is not possible
> >> for some damaged filesystem.
> >>
> >> Add a new option '-f|--follow' to iterate a sub-tree starting from block
> >> specified by '-b|--block', so we would have less limitation.
> >>
> >> Signed-off-by: Qu Wenruo 
> > 
> > I've dropped the shot option for now and updated the changelog. While
> > '-f' is probably a good choice, I'd postpone adding that if we want to
> > use it for other purposes.
> 
> I'm fine with this.
> 
> Just curious if this means we should also double check the short option
> usage or this is just for "-f"?

You don't need to add a short option from the beginning, this can be
added later when we can decide by the usage if it's worth.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/6] Fix long standing -EOPNOTSUPP problem caused by large inline extent

2018-03-21 Thread David Sterba

On Tue, Mar 20, 2018 at 02:42:23PM +0800, Qu Wenruo wrote:
> The patch is based on v4.15.1, and is designed to replace the old patch
> in devel branch.
> 
> Kernel doesn't support dropping range inside inline extent, and prevents
> such thing happening by limiting max inline extent size to
> min(max_inline, sectorsize - 1) in cow_file_range_inline().
> 
> However btrfs-progs only inherit the BTRFS_MAX_INLINE_DATA_SIZE() macro,
> which doesn't have sectorsize check.
> And since btrfs-progs defaults to 16K nodesize, above macro allows large
> inline extent over 15K size.
> 
> This leads to unexpected kernel behavior.
> 
> The bug exists in several parts of btrfs-progs, any tool which creates
> file extent is involved, including:
> 1) btrfs-convert
> 2) mkfs --rootdir
> 
> This patchset fixes the problems in convert (both ext2 and reiserfs),
> mkfs --rootdir, then add check support for both original and lowmem
> mode, and finally adds 2 test cases, one for mkfs and one for convert.
> 
> For mkfs test case, it can already be exposed by misc/002, but a
> pin-point test case will be much better.
> 
> changelog:
> v2:
>   Don't modify BTRFS_MAX_INLINE_DATA_SIZE(), but add extra check to
>   callers who create file extents.
> v3:
>   Merge fixes for convert.
>   Add real commit message for convert fixes.
>   Use $TEST_TOP to replace cooperate with stand alone test cases.
>   Use for loops to make the new test case shorter.
> 
> Qu Wenruo (6):
>   btrfs-progs: convert: Fix inline file extent creation condition
>   btrfs-progs: mkfs/rootdir: Fix inline extent creation check
>   btrfs-progs: check/original mode: Check inline extent size
>   btrfs-progs: check/lowmem mode: Check inline extent size
>   btrfs-progs: test/convert: Add test case for invalid large inline data
> extent
>   btrfs-progs: test/mkfs: Add test case for rootdir inline extent size

Applied, thanks. I had to fix the test, fallocate may fail, so a file of
given size is created directly.

I hope this bug is fixed. My plan was to release 4.15.2 with just
bugfixes but as this took some time to fix, there likely will be no
minor release before 4.16 as kernel is about to be released next week.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-21 Thread Eryu Guan

On Wed, Mar 21, 2018 at 02:22:29PM +0200, Amir Goldstein wrote:
> > +
> > +_log_writes_mount
> > +$FSSTRESS_PROG $fsstress_args > /dev/null 2>&1
> 
> You should run fsstress with run_check() so output will go to $seqres.full
> this way if you are able to catch a bug, you can take the random seed
> from fsstress output and repeat the same event sequence, which
> doesn't guaranty, but can increase the chances of reproducing the bug.

I suggested dropping run_check, as I don't think we care about the
fsstress return value here (and I always try to avoid new run_check
usage), but I agree that it might be useful to save the fsstress output
to $seqres.full.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: wipe all copies of the stale superblock

2018-03-21 Thread Nikolay Borisov



On 21.03.2018 12:19, Anand Jain wrote:
> Recovering from the other copies of the superblock is fundamental
> to BTRFS, which provides resilient against single LBA failure in
nit: s/resilient/resilience.
> the DUP group profile.

Furthermore, the number of superblock copies doesn't really depend on
the group profile of either the data or metadata so the second part of
that statement is incorrect. All in all I don't think this paragraph
really brings any value so I suggest removing it altogether.

> 
> Further, in the test case [1] it shows a good but stale superblock
> at copy#2. This will lead to confusion during auto/manual recovery.
> So strictly speaking if a device has three copies of the superblock
> and if we have permission to wipe it (-f option) then we have to wipe
> all the copies of the superblock.

The problem is not really described in human-readable format. The issue
is that if we re-create a smaller btrfs filesystem on a driver which
contained a previous filesystem then it's possible to have old fs and
new fs superblocks to co-exist. Also the title of the patch needs
rewording to put the changes in context i.e. :

"Wipe possible superblock stale copies on mkfs" or some such.

> 
> If there is any objection to writing beyond mkfs.btrfs -b , then
> we could fail the mkfs/dev-add/dev-replace operation and ask the user
> the wipe using dd command manually, as anyway there is no use of keeping
> only the copy#2 when the new FS has been written.

This sentence also doesn't belong in the changelog, you are asking of
our opinion which is fine, but put it under the scissors line in that case.

> 
> Test case: Note that copy#2 fsid is different from primary and copy#1
> fsid.
> 
> mkfs.btrfs -qf /dev/mapper/vg-lv && \
> mkfs.btrfs -qf -b1G /dev/mapper/vg-lv && \
> btrfs in dump-super -a /dev/mapper/vg-lv | grep '.fsid|superblock:'
> 
> superblock: bytenr=65536, device=/dev/mapper/vg-lv
> dev_item.fsid ebc67d01-7fc5-43f0-90b4-d1925002551e [match]
> superblock: bytenr=67108864, device=/dev/mapper/vg-lv
> dev_item.fsid ebc67d01-7fc5-43f0-90b4-d1925002551e [match]
> superblock: bytenr=274877906944, device=/dev/mapper/vg-lv
> dev_item.fsid b97a9206-593b-4933-a424-c6a6ee23fe7c [match]
> 
> Signed-off-by: Anand Jain 
> ---
>  Hope with this we can patch the kernel to auto recover from the
>  failed primary SB. In the earlier discussion on that, I think we
>  are scrutinizing the wrong side (kernel) of the problem.
>  Also, we need to fail the mount if all the copies of the SB do
>  not have the same fsid.
> 
>  utils.c | 35 +++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/utils.c b/utils.c
> index 00020e1d6bdf..9c027f77d9c1 100644
> --- a/utils.c
> +++ b/utils.c
> @@ -365,6 +365,41 @@ int btrfs_prepare_device(int fd, const char *file, u64 
> *block_count_ret,
>   return 1;
>   }
>  
> + /*
> +  * Check for the BTRFS SB copies up until btrfs_device_size() and zero
> +  * it. So that kernel (or user for the manual recovery) don't have to
> +  * confuse with the stale SB copy during recovery.
> +  */
> + if (block_count != btrfs_device_size(fd, )) {
> + for (i = 1; i < BTRFS_SUPER_MIRROR_MAX; i++) {
> + struct btrfs_super_block *disk_super;
> + char buf[BTRFS_SUPER_INFO_SIZE];
> + disk_super = (struct btrfs_super_block *)buf;
> +
> + /* Already zeroed above */
> + if (btrfs_sb_offset(i) < block_count)
> + continue;
> +
> + /* Beyond actual disk size */
> + if (btrfs_sb_offset(i) >= btrfs_device_size(fd, ))
> + continue;
> +
> + /* Does not contain any stale SB */
> + if (btrfs_read_dev_super(fd, disk_super,
> +  btrfs_sb_offset(i), 0))
> + continue;
> +
> + ret = zero_dev_clamped(fd, btrfs_sb_offset(i),
> + BTRFS_SUPER_INFO_SIZE,
> + btrfs_device_size(fd, ));
> + if (ret < 0) {
> + error("failed to zero device '%s' bytenr %llu: 
> %s",
> + file, btrfs_sb_offset(i), 
> strerror(-ret));
> + return 1;
> + }
> + }
> + }

A better place for this code is really inside btrfs_wipe_existing_sb.
Furthermore looking at libblkid it supports a way to wipe multiple
superblocks by way of: blkid_do_wipe. I.e
https://mirrors.edge.kernel.org/pub/linux/utils/util-linux/v2.21/libblkid-docs/libblkid-Low-level-tags.html#blkid-do-wipe

Apparently this functionality has been in libblkid for quite some time
according to this blog post -

Re: [PATCH 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-21 Thread Amir Goldstein

On Wed, Mar 21, 2018 at 10:01 AM, Qu Wenruo  wrote:
> Basic test case which triggers fsstress with dm-log-writes, and then
> check the fs after each FUA writes.
> With needed infrastructure and special handlers for journal based fs.
>
> Signed-off-by: Qu Wenruo 
> ---
> Unfortunately, neither xfs nor ext4 survies this test for even single
> success, while btrfs survives.
> (Although not what I want, I'm just trying my luck
> to reproduce a serious btrfs corruption situation)
>
> Although btrfs may be the fastest fs for the test, since it has fixed
> small amount of write in mkfs and almost nothing to replay, it still
> takes about 240s~300s to finish (the same level using snapshot).
>
> It would take longer time for ext4 for its large amount of write during
> mkfs, if it can survive the test in the first space.
>
> As a comparison, btrfs takes about 5 seconds to replay log, mount,
> unmount and run fsck at the end of the test.
> But for ext4, it already takes about 5 seconds to do the same thing
> before triggering fsck error.
>
> Fsck fail for ext4:
> _check_generic_filesystem: filesystem on /dev/mapper/test-scratch1 is 
> inconsistent
> *** fsck.ext4 output ***
> fsck from util-linux 2.31.1
> e2fsck 1.43.8 (1-Jan-2018)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 131076 extent tree (at level 1) could be shorter.  Fix? no
>
> Inode 131262, i_size is 0, should be 258048.  Fix? no
>
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
>
> For xfs:
> _check_xfs_filesystem: filesystem on /dev/mapper/test-scratch1 is 
> inconsistent (r)
> *** xfs_repair -n output ***
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - zero log...
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan (but don't clear) agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> - agno = 1
> - agno = 2
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> - agno = 3
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 3
> - agno = 2
> entry "lb" in shortform directory 8409188 references free inode 8409190
> would have junked entry "lb" in directory inode 8409188
> bad symlink header ino 8409190, file block 0, disk block 1051147
> problem with symbolic link in inode 8409190
> would have cleared inode 8409190
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
> - traversing filesystem ...
> entry "lb" in shortform directory inode 8409188 points to free inode 8409190
> would junk entry
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify link counts...
> Maximum metadata LSN (1:396) is ahead of log (1:63).
> Would format log to cycle 4.
> No modify flag set, skipping filesystem flush and exiting.
>
> And special note for XFS guys, I also hit an XFS internal metadata
> warning during journal replay:
> [ 7901.423659] XFS (dm-4): Starting recovery (logdev: internal)
> [ 7901.577511] XFS (dm-4): Metadata corruption detected at 
> xfs_dinode_verify+0x467/0x570 [xfs], inode 0x805067 dinode
> [ 7901.580529] XFS (dm-4): Unmount and run xfs_repair
> [ 7901.581901] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
> [ 7901.583205] b8963f41: 49 4e a1 ff 03 02 00 00 00 00 00 00 00 00 00 
> 00  IN..
> [ 7901.584659] f35a50e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00  
> [ 7901.586075] 386eea9e: 5a b2 0e 69 0a f3 43 27 5a b2 0e 69 0a f3 43 
> 27  Z..i..C'Z..i..C'
> [ 7901.587561] ac636661: 5a b2 0e 69 0d 92 bc 00 00 00 00 00 00 00 00 
> 00  Z..i
> [ 7901.588969] d75f9093: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00  
> [ 7901.590475] d2af5688: 00 00 00 02 00 00 00 00 00 00 00 00 84 d7 6a 
> e9  ..j.
> [ 7901.591907] e8dd8211: ff ff ff ff 34 93 a9 3a 00 00 00 00 00 00 00 
> 04  4..:
> [ 7901.593319] b7610e4e: 00 00 00 01 00 00 00 45 00 00 00 00 00 00 00 
> 00  ...E
>
> ---
>  common/dmlogwrites|  56 +++
>  tests/generic/481 | 104 
> ++
>  tests/generic/481.out |   2 +
>  tests/generic/group   |   1 +
>  4 files changed, 163 insertions(+)
>  create mode 100755 tests/generic/481
>  create mode 100644 tests/generic/481.out
>
> diff --git a/common/dmlogwrites

Re: [PATCH] btrfs: Allow non-privileged user to delete empty subvolume by default

2018-03-21 Thread Austin S. Hemmelgarn


On 2018-03-21 03:46, Nikolay Borisov wrote:



On 20.03.2018 22:06, Goffredo Baroncelli wrote:

On 03/20/2018 07:45 AM, Misono, Tomohiro wrote:

Deletion of subvolume by non-privileged user is completely restricted
by default because we can delete a subvolume even if it is not empty
and may cause data loss. In other words, when user_subvol_rm_allowed
mount option is used, a user can delete a subvolume containing the
directory which cannot be deleted directly by the user.

However, there should be no harm to allow users to delete empty subvolumes
when rmdir(2) would have been allowed if they were normal directories.
This patch allows deletion of empty subvolume by default.


Instead of modifying the ioctl, what about allowing rmdir(2) to work for an 
_empty_ subvolume (and all the permission check are satisfied) ?


I'm inclined to agree with Goffredo. user_subvol_rm_allowed flag really
looks like a hack ontop of the ioctl. I'd rather we modify the generic
behavior.
I agree as well, with the addendum that I'd love to see a new ioctl that 
does proper permissions checks.  While letting rmdir(2) work for an 
empty subvolume with the appropriate permissions would be great (it will 
let rm -r work correctly), it doesn't address the usefulness of being 
able to just `btrfs subvolume delete` and not have to wait for the 
command to finish before you can reuse the name.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Remove obsolete function declaration

2018-03-21 Thread David Sterba

On Wed, Mar 21, 2018 at 09:03:41AM +0200, Nikolay Borisov wrote:
> This function was removed in 24bc2843edd5 ("btrfs:
> Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()") what
> remains is a defunct declaration. Remove it.
> 
> Signed-off-by: Nikolay Borisov 
> ---
> David, 
> 
> This is a fixlet to the aforementioned patch which is only in your misc-next, 
> I'd suggest just squash this fixlet into the original one. 

Yes, makes sense. Done.

If that happens in the future again, sending a separate mail (not
necessarily a full blown patch with changelog etc) works fine, namely
for patches that are deep in the devel queue like that one. So, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: wipe all copies of the stale superblock

2018-03-21 Thread Anand Jain

Recovering from the other copies of the superblock is fundamental
to BTRFS, which provides resilient against single LBA failure in
the DUP group profile.

Further, in the test case [1] it shows a good but stale superblock
at copy#2. This will lead to confusion during auto/manual recovery.
So strictly speaking if a device has three copies of the superblock
and if we have permission to wipe it (-f option) then we have to wipe
all the copies of the superblock.

If there is any objection to writing beyond mkfs.btrfs -b , then
we could fail the mkfs/dev-add/dev-replace operation and ask the user
the wipe using dd command manually, as anyway there is no use of keeping
only the copy#2 when the new FS has been written.

Test case: Note that copy#2 fsid is different from primary and copy#1
fsid.

mkfs.btrfs -qf /dev/mapper/vg-lv && \
mkfs.btrfs -qf -b1G /dev/mapper/vg-lv && \
btrfs in dump-super -a /dev/mapper/vg-lv | grep '.fsid|superblock:'

superblock: bytenr=65536, device=/dev/mapper/vg-lv
dev_item.fsid   ebc67d01-7fc5-43f0-90b4-d1925002551e [match]
superblock: bytenr=67108864, device=/dev/mapper/vg-lv
dev_item.fsid   ebc67d01-7fc5-43f0-90b4-d1925002551e [match]
superblock: bytenr=274877906944, device=/dev/mapper/vg-lv
dev_item.fsid   b97a9206-593b-4933-a424-c6a6ee23fe7c [match]

Signed-off-by: Anand Jain 
---
 Hope with this we can patch the kernel to auto recover from the
 failed primary SB. In the earlier discussion on that, I think we
 are scrutinizing the wrong side (kernel) of the problem.
 Also, we need to fail the mount if all the copies of the SB do
 not have the same fsid.

 utils.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/utils.c b/utils.c
index 00020e1d6bdf..9c027f77d9c1 100644
--- a/utils.c
+++ b/utils.c
@@ -365,6 +365,41 @@ int btrfs_prepare_device(int fd, const char *file, u64 
*block_count_ret,
return 1;
}
 
+   /*
+* Check for the BTRFS SB copies up until btrfs_device_size() and zero
+* it. So that kernel (or user for the manual recovery) don't have to
+* confuse with the stale SB copy during recovery.
+*/
+   if (block_count != btrfs_device_size(fd, )) {
+   for (i = 1; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+   struct btrfs_super_block *disk_super;
+   char buf[BTRFS_SUPER_INFO_SIZE];
+   disk_super = (struct btrfs_super_block *)buf;
+
+   /* Already zeroed above */
+   if (btrfs_sb_offset(i) < block_count)
+   continue;
+
+   /* Beyond actual disk size */
+   if (btrfs_sb_offset(i) >= btrfs_device_size(fd, ))
+   continue;
+
+   /* Does not contain any stale SB */
+   if (btrfs_read_dev_super(fd, disk_super,
+btrfs_sb_offset(i), 0))
+   continue;
+
+   ret = zero_dev_clamped(fd, btrfs_sb_offset(i),
+   BTRFS_SUPER_INFO_SIZE,
+   btrfs_device_size(fd, ));
+   if (ret < 0) {
+   error("failed to zero device '%s' bytenr %llu: 
%s",
+   file, btrfs_sb_offset(i), 
strerror(-ret));
+   return 1;
+   }
+   }
+   }
+
*block_count_ret = block_count;
return 0;
 }
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Remove obsolete function declaration

2018-03-21 Thread David Sterba

On Wed, Mar 21, 2018 at 03:16:25PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年03月21日 15:03, Nikolay Borisov wrote:
> > This function was removed in 24bc2843edd5 ("btrfs:
> > Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()") what
> > remains is a defunct declaration. Remove it.
> 
> Nice find, thanks for get this.
> 
> 
> But strangely, I just fetched David's repo, but didn't find that commit
> hash anywhere.
> 
> I'm fetching from git.kernel.org repo, or is that repo only updated when
> pushing to Linus?

The workflow hasn't changed for a few months, but after the last round
of complaints and surprises where are the development patches, I've
updated the wiki page:

https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories#Kernel_module

If there's information missing or unclear, please let me know.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/8] btrfs: Open-code add_delayed_data_ref

2018-03-21 Thread Nikolay Borisov

Now that the initialization part and the critical section code have
been split it's a lot easier to open code add_delayed_data_ref. Do
so in the following manner:

1. The common init function is put immediately after memory-to-be-init
is allocated, followed by the specific data ref initialization.

2. The only piece of code that remains in the critical section is
insert_delayed_ref call.

3. Tracing and memory freeing code is moved outside of the critical
section.

No functional changes, just an overall shorter critical section.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 63 ++
 1 file changed, 22 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 54bed13ac8f9..8d8a86784fd1 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -709,45 +709,6 @@ static void init_delayed_ref_common(struct btrfs_fs_info 
*fs_info,
 }
 
 /*
- * helper to insert a delayed data ref into the rbtree.
- */
-static noinline void
-add_delayed_data_ref(struct btrfs_trans_handle *trans,
-struct btrfs_delayed_ref_head *head_ref,
-struct btrfs_delayed_ref_node *ref, u64 bytenr,
-u64 num_bytes, u64 parent, u64 ref_root, u64 owner,
-u64 offset, int action)
-{
-   struct btrfs_delayed_data_ref *full_ref;
-   struct btrfs_delayed_ref_root *delayed_refs;
-   u8 ref_type;
-   int ret;
-
-
-   delayed_refs = >transaction->delayed_refs;
-   full_ref = btrfs_delayed_node_to_data_ref(ref);
-   if (parent)
-   ref_type = BTRFS_SHARED_DATA_REF_KEY;
-   else
-   ref_type = BTRFS_EXTENT_DATA_REF_KEY;
-
-   init_delayed_ref_common(trans->fs_info, ref, bytenr, num_bytes,
-   ref_root, action, ref_type);
-   full_ref->root = ref_root;
-   full_ref->parent = parent;
-   full_ref->objectid = owner;
-   full_ref->offset = offset;
-
-   trace_add_delayed_data_ref(trans->fs_info, ref, full_ref,
-  action == BTRFS_ADD_DELAYED_EXTENT ?
-  BTRFS_ADD_DELAYED_REF : action);
-
-   ret = insert_delayed_ref(trans, delayed_refs, head_ref, ref);
-   if (ret > 0)
-   kmem_cache_free(btrfs_delayed_data_ref_cachep, full_ref);
-}
-
-/*
  * add a delayed tree ref.  This does all of the accounting required
  * to make sure the delayed ref is eventually processed before this
  * transaction commits.
@@ -846,11 +807,25 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
int qrecord_inserted;
+   int ret;
+   u8 ref_type;
 
ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
if (!ref)
return -ENOMEM;
 
+   if (parent)
+   ref_type = BTRFS_SHARED_DATA_REF_KEY;
+   else
+   ref_type = BTRFS_EXTENT_DATA_REF_KEY;
+   init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
+   ref_root, action, ref_type);
+   ref->root = ref_root;
+   ref->parent = parent;
+   ref->objectid = owner;
+   ref->offset = offset;
+
+
head_ref = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
if (!head_ref) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
@@ -882,10 +857,16 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
action, 1, _inserted,
old_ref_mod, new_ref_mod);
 
-   add_delayed_data_ref(trans, head_ref, >node, bytenr, num_bytes,
-parent, ref_root, owner, offset, action);
+   ret = insert_delayed_ref(trans, delayed_refs, head_ref, >node);
spin_unlock(_refs->lock);
 
+   trace_add_delayed_data_ref(trans->fs_info, >node, ref,
+  action == BTRFS_ADD_DELAYED_EXTENT ?
+  BTRFS_ADD_DELAYED_REF : action);
+   if (ret > 0)
+   kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
+
+
if (qrecord_inserted)
return btrfs_qgroup_trace_extent_post(fs_info, record);
return 0;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/8] btrfs: Open-code add_delayed_tree_ref

2018-03-21 Thread Nikolay Borisov

Now that the initialization part and the critical section code have
been split it's a lot easier to open code add_delayed_tree_ref. Do
so in the following manner:

1. The commin init code is put immediately after memory-to-be-init is
allocate, followed by the ref-specific member initialization.

2. The only piece of code that remains in the critical section is
insert_delayed_ref call.

3. Tracing and memory freeing code is put outside of the critical
section as well.

The only real change here is an overall shorter critical section when
dealing with delayed tree refs. From functional point of view - the
code is unchanged.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 66 --
 1 file changed, 21 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 7ec2282d2fe0..54bed13ac8f9 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -709,49 +709,6 @@ static void init_delayed_ref_common(struct btrfs_fs_info 
*fs_info,
 }
 
 /*
- * helper to insert a delayed tree ref into the rbtree.
- */
-static noinline void
-add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
-struct btrfs_trans_handle *trans,
-struct btrfs_delayed_ref_head *head_ref,
-struct btrfs_delayed_ref_node *ref, u64 bytenr,
-u64 num_bytes, u64 parent, u64 ref_root, int level,
-int action)
-{
-   struct btrfs_delayed_tree_ref *full_ref;
-   struct btrfs_delayed_ref_root *delayed_refs;
-   u8 ref_type;
-   int ret;
-
-   delayed_refs = >transaction->delayed_refs;
-   full_ref = btrfs_delayed_node_to_tree_ref(ref);
-   if (parent)
-   ref_type = BTRFS_SHARED_BLOCK_REF_KEY;
-   else
-   ref_type = BTRFS_TREE_BLOCK_REF_KEY;
-
-   init_delayed_ref_common(fs_info, ref, bytenr, num_bytes, ref_root,
-   action, ref_type);
-   full_ref->root = ref_root;
-   full_ref->parent = parent;
-   full_ref->level = level;
-
-   trace_add_delayed_tree_ref(fs_info, ref, full_ref,
-  action == BTRFS_ADD_DELAYED_EXTENT ?
-  BTRFS_ADD_DELAYED_REF : action);
-
-   ret = insert_delayed_ref(trans, delayed_refs, head_ref, ref);
-
-   /*
-* XXX: memory should be freed at the same level allocated.
-* But bad practice is anywhere... Follow it now. Need cleanup.
-*/
-   if (ret > 0)
-   kmem_cache_free(btrfs_delayed_tree_ref_cachep, full_ref);
-}
-
-/*
  * helper to insert a delayed data ref into the rbtree.
  */
 static noinline void
@@ -807,12 +764,25 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
int qrecord_inserted;
+   int ret;
+   u8 ref_type = parent ? BTRFS_SHARED_BLOCK_REF_KEY :
+   BTRFS_TREE_BLOCK_REF_KEY;
 
BUG_ON(extent_op && extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
if (!ref)
return -ENOMEM;
 
+   if (parent)
+   ref_type = BTRFS_SHARED_BLOCK_REF_KEY;
+   else
+   ref_type = BTRFS_TREE_BLOCK_REF_KEY;
+   init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
+   ref_root, action, ref_type);
+   ref->root = ref_root;
+   ref->parent = parent;
+   ref->level = level;
+
head_ref = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
if (!head_ref)
goto free_ref;
@@ -838,10 +808,16 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
_inserted, old_ref_mod,
new_ref_mod);
 
-   add_delayed_tree_ref(fs_info, trans, head_ref, >node, bytenr,
-num_bytes, parent, ref_root, level, action);
+
+   ret = insert_delayed_ref(trans, delayed_refs, head_ref, >node);
spin_unlock(_refs->lock);
 
+   trace_add_delayed_tree_ref(fs_info, >node, ref,
+  action == BTRFS_ADD_DELAYED_EXTENT ?
+  BTRFS_ADD_DELAYED_REF : action);
+   if (ret > 0)
+   kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
+
if (qrecord_inserted)
btrfs_qgroup_trace_extent_post(fs_info, record);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8/8] btrfs: split delayed ref head initialization and addition

2018-03-21 Thread Nikolay Borisov

add_delayed_ref_head really performed 2 independent operations -
initialisting the ref head and adding it to a list. Now that the init
part is in a separate function let's complete the separation between
both operations. This results in a lot simpler interface for
add_delayed_ref_head since the function now deals solely with either
adding the newly initialised delayed ref head or merging it into an
existing delayed ref head. This results in vastly simplified function
signature since 5 arguments are dropped. The only other thing worth
mentioning is that due to this split the WARN_ON catching reinit of
existing. In this patch the condition is extended such that:

  qrecord && head_ref->qgroup_ref_root && head_ref->qgroup_reserved

is added. This is done because the two qgroup_* prefixed member are
set only if both ref_root and reserved are passed. So functionally
it's equivalent to the old WARN_ON and allows to remove the two args
from add_delayed_ref_head.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 38 +-
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 0198d2e3ec05..227c1336cc02 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -619,8 +619,7 @@ static noinline struct btrfs_delayed_ref_head *
 add_delayed_ref_head(struct btrfs_trans_handle *trans,
 struct btrfs_delayed_ref_head *head_ref,
 struct btrfs_qgroup_extent_record *qrecord,
-u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
-int action, int is_data, int *qrecord_inserted_ret,
+int action, int *qrecord_inserted_ret,
 int *old_ref_mod, int *new_ref_mod)
 {
struct btrfs_delayed_ref_head *existing;
@@ -628,8 +627,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
int qrecord_inserted = 0;
 
delayed_refs = >transaction->delayed_refs;
-   init_delayed_ref_head(head_ref, qrecord, bytenr, num_bytes, ref_root,
- reserved, action, is_data);
+
/* Record qgroup extent info if provided */
if (qrecord) {
if (btrfs_qgroup_trace_extent_nolock(trans->fs_info,
@@ -644,7 +642,9 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
existing = htree_insert(_refs->href_root,
_ref->href_node);
if (existing) {
-   WARN_ON(ref_root && reserved && existing->qgroup_ref_root
+   WARN_ON(qrecord && head_ref->qgroup_ref_root
+   && head_ref->qgroup_reserved
+   && existing->qgroup_ref_root
&& existing->qgroup_reserved);
update_existing_head_ref(delayed_refs, existing, head_ref,
 old_ref_mod);
@@ -657,8 +657,8 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
} else {
if (old_ref_mod)
*old_ref_mod = 0;
-   if (is_data && head_ref->ref_mod < 0)
-   delayed_refs->pending_csums += num_bytes;
+   if (head_ref->is_data && head_ref->ref_mod < 0)
+   delayed_refs->pending_csums += head_ref->num_bytes;
delayed_refs->num_heads++;
delayed_refs->num_heads_ready++;
atomic_inc(_refs->num_entries);
@@ -668,6 +668,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
*qrecord_inserted_ret = qrecord_inserted;
if (new_ref_mod)
*new_ref_mod = head_ref->total_ref_mod;
+
return head_ref;
 }
 
@@ -769,6 +770,8 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
goto free_head_ref;
}
 
+   init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
+ ref_root, 0, action, false);
head_ref->extent_op = extent_op;
 
delayed_refs = >transaction->delayed_refs;
@@ -778,11 +781,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
 * insert both the head node and the new ref without dropping
 * the spin lock
 */
-   head_ref = add_delayed_ref_head(trans, head_ref, record, bytenr,
-   num_bytes, 0, 0, action, 0,
-   _inserted, old_ref_mod,
-   new_ref_mod);
-
+   head_ref = add_delayed_ref_head(trans, head_ref, record,
+   action, _inserted,
+   old_ref_mod, new_ref_mod);
 
ret = insert_delayed_ref(trans, delayed_refs, head_ref, >node);
spin_unlock(_refs->lock);
@@ -857,6 +858,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
}
}
 
+

[PATCH 6/8] btrfs: Introduce init_delayed_ref_head

2018-03-21 Thread Nikolay Borisov

add_delayed_ref_head implements the logic to both initialize a head_ref
structure as well as perform the necessary operations to add it to the
delayed ref machinery. This has resulted in a very cumebrsome interface
with loads of parameters and code, which at first glance, looks very
unwieldy. Begin untangling it by first extracting the initialization
only code in its own function. It's more or less verbatim copy of the
first part of add_delayed_ref_head.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 68 ++
 1 file changed, 68 insertions(+)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 8d8a86784fd1..47b1a534f95f 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -542,6 +542,74 @@ update_existing_head_ref(struct btrfs_delayed_ref_root 
*delayed_refs,
spin_unlock(>lock);
 }
 
+
+static void init_delayed_ref_head(struct btrfs_delayed_ref_head *head_ref,
+ struct btrfs_qgroup_extent_record *qrecord,
+ u64 bytenr, u64 num_bytes, u64 ref_root,
+ u64 reserved, int action, bool is_data)
+{
+
+   int count_mod = 1;
+   int must_insert_reserved = 0;
+
+   /* If reserved is provided, it must be a data extent. */
+   BUG_ON(!is_data && reserved);
+
+   /*
+* the head node stores the sum of all the mods, so dropping a ref
+* should drop the sum in the head node by one.
+*/
+   if (action == BTRFS_UPDATE_DELAYED_HEAD)
+   count_mod = 0;
+   else if (action == BTRFS_DROP_DELAYED_REF)
+   count_mod = -1;
+
+   /*
+* BTRFS_ADD_DELAYED_EXTENT means that we need to update
+* the reserved accounting when the extent is finally added, or
+* if a later modification deletes the delayed ref without ever
+* inserting the extent into the extent allocation tree.
+* ref->must_insert_reserved is the flag used to record
+* that accounting mods are required.
+*
+* Once we record must_insert_reserved, switch the action to
+* BTRFS_ADD_DELAYED_REF because other special casing is not required.
+*/
+   if (action == BTRFS_ADD_DELAYED_EXTENT)
+   must_insert_reserved = 1;
+   else
+   must_insert_reserved = 0;
+
+
+   refcount_set(_ref->refs, 1);
+   head_ref->bytenr = bytenr;
+   head_ref->num_bytes = num_bytes;
+   head_ref->ref_mod = count_mod;
+   head_ref->must_insert_reserved = must_insert_reserved;
+   head_ref->is_data = is_data;
+   head_ref->ref_tree = RB_ROOT;
+   INIT_LIST_HEAD(_ref->ref_add_list);
+   RB_CLEAR_NODE(_ref->href_node);
+   head_ref->processing = 0;
+   head_ref->total_ref_mod = count_mod;
+   head_ref->qgroup_reserved = 0;
+   head_ref->qgroup_ref_root = 0;
+   spin_lock_init(_ref->lock);
+   mutex_init(_ref->mutex);
+
+
+   if (qrecord) {
+   if (ref_root && reserved) {
+   head_ref->qgroup_ref_root = ref_root;
+   head_ref->qgroup_reserved = reserved;
+   }
+
+   qrecord->bytenr = bytenr;
+   qrecord->num_bytes = num_bytes;
+   qrecord->old_roots = NULL;
+   }
+}
+
 /*
  * helper function to actually insert a head node into the rbtree.
  * this does all the dirty work in terms of maintaining the correct
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/8] Delayed refs cleanups/streamlining

2018-03-21 Thread Nikolay Borisov

This patchkit aims to simplify and streamline the logic involved in 
initialising and adding delayed refs/delayed head structures which deal with 
modification of the extent tree. Currently the logic for init and add was 
contained in one function for each type of structure. This resulted in very 
awkward interface with shitloads of arguments, this in turn made it really hard
to understand the gist of the code. This series rectifies the situation by 
extracting common parts and using those rather than open coding duplicated
logic for every type of delayed ref (tree or data ref). 

The first 5 patches deal with the delayed_ref structs. Each patch is 
incremental and makes the code bisectable. The last tree factor out the init 
code for delayed_ref_head into a separate function and begin to use it. The 
final completely splits the two. 

The net result is both a cleaner interface as well as somewhat reduced 
critical section under delayed_refs->lock spinlock. 

Nikolay Borisov (8):
  btrfs: Factor out common delayed refs init code
  btrfs: Use init_delayed_ref_common in add_delayed_tree_ref
  btrfs: Use init_delayed_ref_common in add_delayed_data_ref
  btrfs: Open-code add_delayed_tree_ref
  btrfs: Open-code add_delayed_data_ref
  btrfs: Introduce init_delayed_ref_head
  btrfs: Use init_delayed_ref_head in add_delayed_ref_head
  btrfs: split delayed ref head initialization and addition

 fs/btrfs/delayed-ref.c | 239 -
 1 file changed, 119 insertions(+), 120 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/8] btrfs: Use init_delayed_ref_common in add_delayed_tree_ref

2018-03-21 Thread Nikolay Borisov

Use the newly introduced common helper.  No functional changes

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 35 +++
 1 file changed, 11 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index d41d979f277b..4e009b3f0ec9 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -721,38 +721,25 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 {
struct btrfs_delayed_tree_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
-   u64 seq = 0;
+   u8 ref_type;
int ret;
 
-   if (action == BTRFS_ADD_DELAYED_EXTENT)
-   action = BTRFS_ADD_DELAYED_REF;
-
-   if (is_fstree(ref_root))
-   seq = atomic64_read(_info->tree_mod_seq);
delayed_refs = >transaction->delayed_refs;
-
-   /* first set the basic ref node struct up */
-   refcount_set(>refs, 1);
-   ref->bytenr = bytenr;
-   ref->num_bytes = num_bytes;
-   ref->ref_mod = 1;
-   ref->action = action;
-   ref->is_head = 0;
-   ref->in_tree = 1;
-   ref->seq = seq;
-   RB_CLEAR_NODE(>ref_node);
-   INIT_LIST_HEAD(>add_list);
-
full_ref = btrfs_delayed_node_to_tree_ref(ref);
-   full_ref->parent = parent;
-   full_ref->root = ref_root;
if (parent)
-   ref->type = BTRFS_SHARED_BLOCK_REF_KEY;
+   ref_type = BTRFS_SHARED_BLOCK_REF_KEY;
else
-   ref->type = BTRFS_TREE_BLOCK_REF_KEY;
+   ref_type = BTRFS_TREE_BLOCK_REF_KEY;
+
+   init_delayed_ref_common(fs_info, ref, bytenr, num_bytes, ref_root,
+   action, ref_type);
+   full_ref->root = ref_root;
+   full_ref->parent = parent;
full_ref->level = level;
 
-   trace_add_delayed_tree_ref(fs_info, ref, full_ref, action);
+   trace_add_delayed_tree_ref(fs_info, ref, full_ref,
+  action == BTRFS_ADD_DELAYED_EXTENT ?
+  BTRFS_ADD_DELAYED_REF : action);
 
ret = insert_delayed_ref(trans, delayed_refs, head_ref, ref);
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/8] btrfs: Use init_delayed_ref_head in add_delayed_ref_head

2018-03-21 Thread Nikolay Borisov

Use the newly introduced function when initialising the head_ref in
add_delayed_ref_head. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 62 --
 1 file changed, 4 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 47b1a534f95f..0198d2e3ec05 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -625,68 +625,14 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 {
struct btrfs_delayed_ref_head *existing;
struct btrfs_delayed_ref_root *delayed_refs;
-   int count_mod = 1;
-   int must_insert_reserved = 0;
int qrecord_inserted = 0;
 
-   /* If reserved is provided, it must be a data extent. */
-   BUG_ON(!is_data && reserved);
-
-   /*
-* the head node stores the sum of all the mods, so dropping a ref
-* should drop the sum in the head node by one.
-*/
-   if (action == BTRFS_UPDATE_DELAYED_HEAD)
-   count_mod = 0;
-   else if (action == BTRFS_DROP_DELAYED_REF)
-   count_mod = -1;
-
-   /*
-* BTRFS_ADD_DELAYED_EXTENT means that we need to update
-* the reserved accounting when the extent is finally added, or
-* if a later modification deletes the delayed ref without ever
-* inserting the extent into the extent allocation tree.
-* ref->must_insert_reserved is the flag used to record
-* that accounting mods are required.
-*
-* Once we record must_insert_reserved, switch the action to
-* BTRFS_ADD_DELAYED_REF because other special casing is not required.
-*/
-   if (action == BTRFS_ADD_DELAYED_EXTENT)
-   must_insert_reserved = 1;
-   else
-   must_insert_reserved = 0;
-
delayed_refs = >transaction->delayed_refs;
-
-   refcount_set(_ref->refs, 1);
-   head_ref->bytenr = bytenr;
-   head_ref->num_bytes = num_bytes;
-   head_ref->ref_mod = count_mod;
-   head_ref->must_insert_reserved = must_insert_reserved;
-   head_ref->is_data = is_data;
-   head_ref->ref_tree = RB_ROOT;
-   INIT_LIST_HEAD(_ref->ref_add_list);
-   RB_CLEAR_NODE(_ref->href_node);
-   head_ref->processing = 0;
-   head_ref->total_ref_mod = count_mod;
-   head_ref->qgroup_reserved = 0;
-   head_ref->qgroup_ref_root = 0;
-   spin_lock_init(_ref->lock);
-   mutex_init(_ref->mutex);
-
+   init_delayed_ref_head(head_ref, qrecord, bytenr, num_bytes, ref_root,
+ reserved, action, is_data);
/* Record qgroup extent info if provided */
if (qrecord) {
-   if (ref_root && reserved) {
-   head_ref->qgroup_ref_root = ref_root;
-   head_ref->qgroup_reserved = reserved;
-   }
-
-   qrecord->bytenr = bytenr;
-   qrecord->num_bytes = num_bytes;
-   qrecord->old_roots = NULL;
-
-   if(btrfs_qgroup_trace_extent_nolock(trans->fs_info,
+   if (btrfs_qgroup_trace_extent_nolock(trans->fs_info,
delayed_refs, qrecord))
kfree(qrecord);
else
@@ -711,7 +657,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
} else {
if (old_ref_mod)
*old_ref_mod = 0;
-   if (is_data && count_mod < 0)
+   if (is_data && head_ref->ref_mod < 0)
delayed_refs->pending_csums += num_bytes;
delayed_refs->num_heads++;
delayed_refs->num_heads_ready++;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/8] btrfs: Factor out common delayed refs init code

2018-03-21 Thread Nikolay Borisov

THe majority of the init code for struct btrfs_delayed_ref_node is
duplicated in add_delayed_data_ref and add_delayed_tree_ref. Factor
out the common bits in init_delayed_ref_common. This function is
going to be used in future patches to clean that up. No functional
changes

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 51 ++
 1 file changed, 51 insertions(+)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index f9b236c82f27..d41d979f277b 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -658,6 +658,57 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 }
 
 /*
+ * init_delayed_ref_common - Initialize the structure which represents a
+ *  modification to a an extent.
+ *
+ * @fs_info:Internal to the mounted filesystem mount structure.
+ *
+ * @ref:   The structure which is going to be initialized.
+ *
+ * @bytenr:The logical address of the extent for which a modification is
+ * going to be recorded.
+ *
+ * @num_bytes:  Size of the extent whose modification is being recorded.
+ *
+ * @ref_root:  The id of the root where this modification has originated, this
+ * can be either one of the well-known metadata trees or the
+ * subvolume id which references this extent.
+ *
+ * @action:Can be one of BTRFS_ADD_DELAYED_REF/BTRFS_DROP_DELAYED_REF or
+ * BTRFS_ADD_DELAYED_EXTENT
+ *
+ * @ref_type:  Holds the type of the extent which is being recorded, can be
+ * one of BTRFS_SHARED_BLOCK_REF_KEY/BTRFS_TREE_BLOCK_REF_KEY
+ * when recording a metadata extent or BTRFS_SHARED_DATA_REF_KEY/
+ * BTRFS_EXTENT_DATA_REF_KEY when recording data extent
+ */
+static void init_delayed_ref_common(struct btrfs_fs_info *fs_info,
+   struct btrfs_delayed_ref_node *ref,
+   u64 bytenr, u64 num_bytes, u64 ref_root,
+   int action, u8 ref_type)
+{
+   u64 seq = 0;
+
+   if (action == BTRFS_ADD_DELAYED_EXTENT)
+   action = BTRFS_ADD_DELAYED_REF;
+
+   if (is_fstree(ref_root))
+   seq = atomic64_read(_info->tree_mod_seq);
+
+   refcount_set(>refs, 1);
+   ref->bytenr = bytenr;
+   ref->num_bytes = num_bytes;
+   ref->ref_mod = 1;
+   ref->action = action;
+   ref->is_head = 0;
+   ref->in_tree = 1;
+   ref->seq = seq;
+   ref->type = ref_type;
+   RB_CLEAR_NODE(>ref_node);
+   INIT_LIST_HEAD(>add_list);
+}
+
+/*
  * helper to insert a delayed tree ref into the rbtree.
  */
 static noinline void
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/8] btrfs: Use init_delayed_ref_common in add_delayed_data_ref

2018-03-21 Thread Nikolay Borisov

Use the newly introduced helper and remove the duplicate code.
No functional changes

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 34 ++
 1 file changed, 10 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 4e009b3f0ec9..7ec2282d2fe0 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -763,41 +763,27 @@ add_delayed_data_ref(struct btrfs_trans_handle *trans,
 {
struct btrfs_delayed_data_ref *full_ref;
struct btrfs_delayed_ref_root *delayed_refs;
-   u64 seq = 0;
+   u8 ref_type;
int ret;
 
-   if (action == BTRFS_ADD_DELAYED_EXTENT)
-   action = BTRFS_ADD_DELAYED_REF;
 
delayed_refs = >transaction->delayed_refs;
-
-   if (is_fstree(ref_root))
-   seq = atomic64_read(>fs_info->tree_mod_seq);
-
-   /* first set the basic ref node struct up */
-   refcount_set(>refs, 1);
-   ref->bytenr = bytenr;
-   ref->num_bytes = num_bytes;
-   ref->ref_mod = 1;
-   ref->action = action;
-   ref->is_head = 0;
-   ref->in_tree = 1;
-   ref->seq = seq;
-   RB_CLEAR_NODE(>ref_node);
-   INIT_LIST_HEAD(>add_list);
-
full_ref = btrfs_delayed_node_to_data_ref(ref);
-   full_ref->parent = parent;
-   full_ref->root = ref_root;
if (parent)
-   ref->type = BTRFS_SHARED_DATA_REF_KEY;
+   ref_type = BTRFS_SHARED_DATA_REF_KEY;
else
-   ref->type = BTRFS_EXTENT_DATA_REF_KEY;
+   ref_type = BTRFS_EXTENT_DATA_REF_KEY;
 
+   init_delayed_ref_common(trans->fs_info, ref, bytenr, num_bytes,
+   ref_root, action, ref_type);
+   full_ref->root = ref_root;
+   full_ref->parent = parent;
full_ref->objectid = owner;
full_ref->offset = offset;
 
-   trace_add_delayed_data_ref(trans->fs_info, ref, full_ref, action);
+   trace_add_delayed_data_ref(trans->fs_info, ref, full_ref,
+  action == BTRFS_ADD_DELAYED_EXTENT ?
+  BTRFS_ADD_DELAYED_REF : action);
 
ret = insert_delayed_ref(trans, delayed_refs, head_ref, ref);
if (ret > 0)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/2] btrfs: cleanup btrfs_check_super_csum() to check the csum_type to the type

2018-03-21 Thread Anand Jain

%csum_type is being checked to check the number of csum types we support,
which is 1 as of now. Instead just check if the type matches with the
only type we support, that is BTRFS_CSUM_TYPE_CRC32. And further adds
cleanup.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 38 +-
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1657d6aa4fa6..b9b435579527 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -399,32 +399,28 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
*fs_info,
struct btrfs_super_block *disk_sb =
(struct btrfs_super_block *)raw_disk_sb;
u16 csum_type = btrfs_super_csum_type(disk_sb);
-   int ret = 0;
-
-   if (csum_type == BTRFS_CSUM_TYPE_CRC32) {
-   u32 crc = ~(u32)0;
-   char result[sizeof(crc)];
-
-   /*
-* The super_block structure does not span the whole
-* BTRFS_SUPER_INFO_SIZE range, we expect that the unused space
-* is filled with zeros and is included in the checksum.
-*/
-   crc = btrfs_csum_data(raw_disk_sb + BTRFS_CSUM_SIZE,
-   crc, BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE);
-   btrfs_csum_final(crc, result);
-
-   if (memcmp(raw_disk_sb, result, sizeof(result)))
-   ret = 1;
-   }
+   u32 crc = ~(u32)0;
+   char result[sizeof(crc)];
 
if (csum_type >= ARRAY_SIZE(btrfs_csum_sizes)) {
btrfs_err(fs_info, "unsupported checksum algorithm %u",
-   csum_type);
-   ret = 1;
+ csum_type);
+   return 1;
}
 
-   return ret;
+   /*
+* The super_block structure does not span the whole
+* BTRFS_SUPER_INFO_SIZE range, we expect that the unused space
+* is filled with zeros and is included in the checksum.
+*/
+   crc = btrfs_csum_data(raw_disk_sb + BTRFS_CSUM_SIZE,
+ crc, BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE);
+   btrfs_csum_final(crc, result);
+
+   if (memcmp(raw_disk_sb, result, sizeof(result)))
+   return 1;
+
+   return 0;
 }
 
 /*
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/2] btrfs: return required error from btrfs_check_super_csum

2018-03-21 Thread Anand Jain

Return the required -EINVAL and -EUCLEAN from the function
btrfs_check_super_csum(). And more the error log into the
parent function.

Signed-off-by: Anand Jain 
---
 fs/btrfs/disk-io.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b9b435579527..062b3e0c7fd1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -392,9 +392,10 @@ static int verify_parent_transid(struct extent_io_tree 
*io_tree,
 /*
  * Return 0 if the superblock checksum type matches the checksum value of that
  * algorithm. Pass the raw disk superblock data.
+ * Otherwise: -EINVAL  if csum type is not found
+ *   -EUCLEAN if csum does not match
  */
-static int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
- char *raw_disk_sb)
+static int btrfs_check_super_csum(char *raw_disk_sb)
 {
struct btrfs_super_block *disk_sb =
(struct btrfs_super_block *)raw_disk_sb;
@@ -402,11 +403,8 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
*fs_info,
u32 crc = ~(u32)0;
char result[sizeof(crc)];
 
-   if (csum_type >= ARRAY_SIZE(btrfs_csum_sizes)) {
-   btrfs_err(fs_info, "unsupported checksum algorithm %u",
- csum_type);
-   return 1;
-   }
+   if (csum_type >= ARRAY_SIZE(btrfs_csum_sizes))
+   return -EINVAL;
 
/*
 * The super_block structure does not span the whole
@@ -418,7 +416,7 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
*fs_info,
btrfs_csum_final(crc, result);
 
if (memcmp(raw_disk_sb, result, sizeof(result)))
-   return 1;
+   return -EUCLEAN;
 
return 0;
 }
@@ -2570,9 +2568,14 @@ int open_ctree(struct super_block *sb,
 * We want to check superblock checksum, the type is stored inside.
 * Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k).
 */
-   if (btrfs_check_super_csum(fs_info, bh->b_data)) {
-   btrfs_err(fs_info, "superblock checksum mismatch");
-   err = -EINVAL;
+   err = btrfs_check_super_csum(bh->b_data);
+   if (err) {
+   if (err == -EINVAL)
+   pr_err("BTRFS error (device %pg): unsupported checksum 
algorithm",
+   fs_devices->latest_bdev);
+   else if (err == -EUCLEAN)
+   pr_err("BTRFS error (device %pg): superblock checksum 
mismatch",
+   fs_devices->latest_bdev);
brelse(bh);
goto fail_alloc;
}
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/2] Preparatory to add the csum check in the scan context

2018-03-21 Thread Anand Jain

v2->v3:
 1/2: Keep the check %(csum_type >= ARRAY_SIZE(btrfs_csum_sizes))
  because it is inline to support future expansion, further
  btrfs-progs is already copied it.
 2/2: Drop pr_err for default error which won't be possible.

v1->v2:
  Merge 2/3 and 3/3
  Use -EUCLEAN for csum mismatch
  Move the error logging to the parent function
  Drop the struct btrfs_fsinfo as the argument for
btrfs_check_super_csum()
  Do not make btrfs_check_super_csum() nonstatic here, it will be part
of the patchset which will use the csum in the scan context

Anand Jain (2):
  btrfs: cleanup btrfs_check_super_csum() to check the csum_type to the
type
  btrfs: return required error from btrfs_check_super_csum

 fs/btrfs/disk-io.c | 53 ++---
 1 file changed, 26 insertions(+), 27 deletions(-)

-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] fstests: log-writes: Add support for METADATA flag

2018-03-21 Thread Qu Wenruo

Signed-off-by: Qu Wenruo 
---
 src/log-writes/log-writes.c | 3 ++-
 src/log-writes/log-writes.h | 9 +
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
index a872429d..5dc22c24 100644
--- a/src/log-writes/log-writes.c
+++ b/src/log-writes/log-writes.c
@@ -130,7 +130,8 @@ struct flags_to_str_entry {
DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
-   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
+   DEFINE_LOG_FLAGS_STR_ENTRY(MARK),
+   DEFINE_LOG_FLAGS_STR_ENTRY(METADATA)
 };
 
 #define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
diff --git a/src/log-writes/log-writes.h b/src/log-writes/log-writes.h
index 35ca3583..75fb8ac0 100644
--- a/src/log-writes/log-writes.h
+++ b/src/log-writes/log-writes.h
@@ -20,10 +20,11 @@ typedef __u32 u32;
 /*
  * Constants copied from kernel file drivers/md/dm-log-writes.c
  */
-#define LOG_FLUSH_FLAG (1 << 0)
-#define LOG_FUA_FLAG (1 << 1)
-#define LOG_DISCARD_FLAG (1 << 2)
-#define LOG_MARK_FLAG (1 << 3)
+#define LOG_FLUSH_FLAG (1 << 0)
+#define LOG_FUA_FLAG   (1 << 1)
+#define LOG_DISCARD_FLAG   (1 << 2)
+#define LOG_MARK_FLAG  (1 << 3)
+#define LOG_METADATA_FLAG  (1 << 4)
 
 #define WRITE_LOG_VERSION 1
 #define WRITE_LOG_MAGIC 0x6a736677736872
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] fstests: generic: Check the fs after each FUA writes

2018-03-21 Thread Qu Wenruo

Basic test case which triggers fsstress with dm-log-writes, and then
check the fs after each FUA writes.
With needed infrastructure and special handlers for journal based fs.

Signed-off-by: Qu Wenruo 
---
Unfortunately, neither xfs nor ext4 survies this test for even single
success, while btrfs survives.
(Although not what I want, I'm just trying my luck
to reproduce a serious btrfs corruption situation)

Although btrfs may be the fastest fs for the test, since it has fixed
small amount of write in mkfs and almost nothing to replay, it still
takes about 240s~300s to finish (the same level using snapshot).

It would take longer time for ext4 for its large amount of write during
mkfs, if it can survive the test in the first space.

As a comparison, btrfs takes about 5 seconds to replay log, mount,
unmount and run fsck at the end of the test.
But for ext4, it already takes about 5 seconds to do the same thing
before triggering fsck error.

Fsck fail for ext4:
_check_generic_filesystem: filesystem on /dev/mapper/test-scratch1 is 
inconsistent
*** fsck.ext4 output ***
fsck from util-linux 2.31.1
e2fsck 1.43.8 (1-Jan-2018)
Pass 1: Checking inodes, blocks, and sizes
Inode 131076 extent tree (at level 1) could be shorter.  Fix? no

Inode 131262, i_size is 0, should be 258048.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

For xfs:
_check_xfs_filesystem: filesystem on /dev/mapper/test-scratch1 is inconsistent 
(r)
*** xfs_repair -n output ***
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
bad symlink header ino 8409190, file block 0, disk block 1051147
problem with symbolic link in inode 8409190
would have cleared inode 8409190
- agno = 3
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 3
- agno = 2
entry "lb" in shortform directory 8409188 references free inode 8409190
would have junked entry "lb" in directory inode 8409188
bad symlink header ino 8409190, file block 0, disk block 1051147
problem with symbolic link in inode 8409190
would have cleared inode 8409190
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
entry "lb" in shortform directory inode 8409188 points to free inode 8409190
would junk entry
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
Maximum metadata LSN (1:396) is ahead of log (1:63).
Would format log to cycle 4.
No modify flag set, skipping filesystem flush and exiting.

And special note for XFS guys, I also hit an XFS internal metadata
warning during journal replay:
[ 7901.423659] XFS (dm-4): Starting recovery (logdev: internal)
[ 7901.577511] XFS (dm-4): Metadata corruption detected at 
xfs_dinode_verify+0x467/0x570 [xfs], inode 0x805067 dinode
[ 7901.580529] XFS (dm-4): Unmount and run xfs_repair
[ 7901.581901] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
[ 7901.583205] b8963f41: 49 4e a1 ff 03 02 00 00 00 00 00 00 00 00 00 
00  IN..
[ 7901.584659] f35a50e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00  
[ 7901.586075] 386eea9e: 5a b2 0e 69 0a f3 43 27 5a b2 0e 69 0a f3 43 
27  Z..i..C'Z..i..C'
[ 7901.587561] ac636661: 5a b2 0e 69 0d 92 bc 00 00 00 00 00 00 00 00 
00  Z..i
[ 7901.588969] d75f9093: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00  
[ 7901.590475] d2af5688: 00 00 00 02 00 00 00 00 00 00 00 00 84 d7 6a 
e9  ..j.
[ 7901.591907] e8dd8211: ff ff ff ff 34 93 a9 3a 00 00 00 00 00 00 00 
04  4..:
[ 7901.593319] b7610e4e: 00 00 00 01 00 00 00 45 00 00 00 00 00 00 00 
00  ...E

---
 common/dmlogwrites|  56 +++
 tests/generic/481 | 104 ++
 tests/generic/481.out |   2 +
 tests/generic/group   |   1 +
 4 files changed, 163 insertions(+)
 create mode 100755 tests/generic/481
 create mode 100644 tests/generic/481.out

diff --git a/common/dmlogwrites b/common/dmlogwrites
index 467b872e..bf643a77 100644
--- a/common/dmlogwrites
+++ b/common/dmlogwrites
@@ -126,3 +126,59 @@ _log_writes_cleanup()
$UDEV_SETTLE_PROG >/dev/null 2>&1
_log_writes_remove
 }
+
+# Convert log writes mark to entry number
+# Result entry number is output to stdout,

[PATCH 1/3] fstests: log-writes: Add support to output human readable flags

2018-03-21 Thread Qu Wenruo

Also change the flag numeric output to hex.

Signed-off-by: Qu Wenruo 
---
 src/log-writes/log-writes.c | 70 -
 1 file changed, 63 insertions(+), 7 deletions(-)

diff --git a/src/log-writes/log-writes.c b/src/log-writes/log-writes.c
index 09391574..a872429d 100644
--- a/src/log-writes/log-writes.c
+++ b/src/log-writes/log-writes.c
@@ -120,6 +120,58 @@ int log_discard(struct log *log, struct log_write_entry 
*entry)
return 0;
 }
 
+#define DEFINE_LOG_FLAGS_STR_ENTRY(x)  \
+   {LOG_##x##_FLAG, #x}
+
+struct flags_to_str_entry {
+   u64 flags;
+   const char *str;
+} log_flags_table[] = {
+   DEFINE_LOG_FLAGS_STR_ENTRY(FLUSH),
+   DEFINE_LOG_FLAGS_STR_ENTRY(FUA),
+   DEFINE_LOG_FLAGS_STR_ENTRY(DISCARD),
+   DEFINE_LOG_FLAGS_STR_ENTRY(MARK)
+};
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#define LOG_FLAGS_BUF_SIZE 128
+/*
+ * Convert numeric flags to human readable flags.
+ * @flags: numeric flags
+ * @buf:   output buffer for human readable string.
+ * must have enough space (LOG_FLAGS_BUF_SIZE) to contain all
+ * the string
+ */
+static void entry_flags_to_str(u64 flags, char *buf)
+{
+   int empty = 1;
+   int left_len;
+   int i;
+
+   buf[0] = '\0';
+   for (i = 0; i < ARRAY_SIZE(log_flags_table); i++) {
+   if (flags & log_flags_table[i].flags) {
+   if (!empty)
+   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
+   empty = 0;
+   strncat(buf, log_flags_table[i].str, 
LOG_FLAGS_BUF_SIZE);
+   flags &= ~log_flags_table[i].flags;
+   }
+   }
+   if (flags) {
+   if (!empty)
+   strncat(buf, "|", LOG_FLAGS_BUF_SIZE);
+   empty = 0;
+   left_len = LOG_FLAGS_BUF_SIZE - strnlen(buf,
+   LOG_FLAGS_BUF_SIZE);
+   if (left_len > 0)
+   snprintf(buf + strnlen(buf, LOG_FLAGS_BUF_SIZE),
+left_len, "UNKNOWN.0x%llx", flags);
+   }
+   if (empty)
+   strncpy(buf, "NONE", LOG_FLAGS_BUF_SIZE);
+}
+
 /*
  * @log: the log we are replaying.
  * @entry: entry to be replayed.
@@ -179,6 +231,7 @@ int log_replay_next_entry(struct log *log, struct 
log_write_entry *entry,
size_t read_size = read_data ? log->sectorsize :
sizeof(struct log_write_entry);
char *buf;
+   char flags_buf[LOG_FLAGS_BUF_SIZE];
ssize_t ret;
off_t offset;
int skip = 0;
@@ -210,19 +263,20 @@ int log_replay_next_entry(struct log *log, struct 
log_write_entry *entry,
log->cur_pos += read_size;
}
 
+   flags = le64_to_cpu(entry->flags);
+   entry_flags_to_str(flags, flags_buf);
skip = log_should_skip(log, entry);
if (log_writes_verbose > 1 || (log_writes_verbose && !skip)) {
-   printf("%s %d@%llu: sector %llu, size %llu, flags %llu\n",
+   printf("%s %d@%llu: sector %llu, size %llu, flags 0x%llx(%s)\n",
   skip ? "skipping" : "replaying",
   (int)log->cur_entry - 1, log->cur_pos / log->sectorsize,
   (unsigned long long)le64_to_cpu(entry->sector),
   (unsigned long long)size,
-  (unsigned long long)le64_to_cpu(entry->flags));
+  (unsigned long long)flags, flags_buf);
}
if (!size)
return 0;
 
-   flags = le64_to_cpu(entry->flags);
if (flags & LOG_DISCARD_FLAG)
return log_discard(log, entry);
 
@@ -301,7 +355,7 @@ int log_seek_entry(struct log *log, u64 entry_num)
return -1;
}
if (log_writes_verbose > 1)
-   printf("seek entry %d@%llu: %llu, size %llu, flags 
%llu\n",
+   printf("seek entry %d@%llu: %llu, size %llu, flags 
0x%llx\n",
   (int)i, log->cur_pos / log->sectorsize,
   (unsigned long long)le64_to_cpu(entry.sector),
   (unsigned long 
long)le64_to_cpu(entry.nr_sectors),
@@ -339,6 +393,7 @@ int log_seek_next_entry(struct log *log, struct 
log_write_entry *entry,
size_t read_size = read_data ? log->sectorsize :
sizeof(struct log_write_entry);
u64 flags;
+   char flags_buf[LOG_FLAGS_BUF_SIZE];
ssize_t ret;
 
if (log->cur_entry >= log->nr_entries)
@@ -366,14 +421,15 @@ int log_seek_next_entry(struct log *log, struct 
log_write_entry *entry,
} else {
log->cur_pos += read_size;
}
+   flags = le64_to_cpu(entry->flags);
+   entry_flags_to_str(flags, flags_buf);
if (log_writes_verbose > 1)
-

Re: [PATCH] btrfs: Allow non-privileged user to delete empty subvolume by default

2018-03-21 Thread Nikolay Borisov



On 20.03.2018 22:06, Goffredo Baroncelli wrote:
> On 03/20/2018 07:45 AM, Misono, Tomohiro wrote:
>> Deletion of subvolume by non-privileged user is completely restricted
>> by default because we can delete a subvolume even if it is not empty
>> and may cause data loss. In other words, when user_subvol_rm_allowed
>> mount option is used, a user can delete a subvolume containing the
>> directory which cannot be deleted directly by the user.
>>
>> However, there should be no harm to allow users to delete empty subvolumes
>> when rmdir(2) would have been allowed if they were normal directories.
>> This patch allows deletion of empty subvolume by default.
> 
> Instead of modifying the ioctl, what about allowing rmdir(2) to work for an 
> _empty_ subvolume (and all the permission check are satisfied) ?

I'm inclined to agree with Goffredo. user_subvol_rm_allowed flag really
looks like a hack ontop of the ioctl. I'd rather we modify the generic
behavior.

> 
> 
> 
>>
>> Note that user_subvol_rm_allowed option requires write+exec permission
>> of the subvolume to be deleted, but they are not required for empty
>> subvolume.
>>
>> The comment in the code is also updated accordingly.
>>
>> Signed-off-by: Tomohiro Misono 
>> ---
>>  fs/btrfs/ioctl.c | 55 
>> +++
>>  1 file changed, 31 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 111ee282b777..838406a7a7f5 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -2366,36 +2366,43 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
>> file *file,
>>  dest = BTRFS_I(inode)->root;
>>  if (!capable(CAP_SYS_ADMIN)) {
>>  /*
>> - * Regular user.  Only allow this with a special mount
>> - * option, when the user has write+exec access to the
>> - * subvol root, and when rmdir(2) would have been
>> - * allowed.
>> + * By default, regular user is only allowed to delete
>> + * empty subvols when rmdir(2) would have been allowed
>> + * if they were normal directories.
>>   *
>> - * Note that this is _not_ check that the subvol is
>> - * empty or doesn't contain data that we wouldn't
>> + * If the mount option 'user_subvol_rm_allowed' is set,
>> + * it allows users to delete non-empty subvols when the
>> + * user has write+exec access to the subvol root and when
>> + * rmdir(2) would have been allowed (except the emptiness
>> + * check).
>> + *
>> + * Note that this option does _not_ check that if the subvol
>> + * is empty or doesn't contain data that the user wouldn't
>>   * otherwise be able to delete.
>>   *
>> - * Users who want to delete empty subvols should try
>> - * rmdir(2).
>> + * Users who want to delete empty subvols created by
>> + * snapshot (ino number == 2) can use rmdir(2).
>>   */
>> -err = -EPERM;
>> -if (!btrfs_test_opt(fs_info, USER_SUBVOL_RM_ALLOWED))
>> -goto out_dput;
>> +err = -ENOTEMPTY;
>> +if (inode->i_size != BTRFS_EMPTY_DIR_SIZE) {
>> +if (!btrfs_test_opt(fs_info, USER_SUBVOL_RM_ALLOWED))
>> +goto out_dput;
>>  
>> -/*
>> - * Do not allow deletion if the parent dir is the same
>> - * as the dir to be deleted.  That means the ioctl
>> - * must be called on the dentry referencing the root
>> - * of the subvol, not a random directory contained
>> - * within it.
>> - */
>> -err = -EINVAL;
>> -if (root == dest)
>> -goto out_dput;
>> +/*
>> + * Do not allow deletion if the parent dir is the same
>> + * as the dir to be deleted.  That means the ioctl
>> + * must be called on the dentry referencing the root
>> + * of the subvol, not a random directory contained
>> + * within it.
>> + */
>> +err = -EINVAL;
>> +if (root == dest)
>> +goto out_dput;
>>  
>> -err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
>> -if (err)
>> -goto out_dput;
>> +err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
>> +if (err)
>> +goto out_dput;
>> +}
>>  }
>>  
>>  /* check if subvolume may be deleted by a user */
>>
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to

Re: [PATCH] btrfs: Remove obsolete function declaration

2018-03-21 Thread Qu Wenruo



On 2018年03月21日 15:38, Nikolay Borisov wrote:
> 
> 
> On 21.03.2018 09:16, Qu Wenruo wrote:
>>
>>
>> On 2018年03月21日 15:03, Nikolay Borisov wrote:
>>> This function was removed in 24bc2843edd5 ("btrfs:
>>> Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()") what
>>> remains is a defunct declaration. Remove it.
>>
>> Nice find, thanks for get this.
>>
>>
>> But strangely, I just fetched David's repo, but didn't find that commit
>> hash anywhere.
>>
>> I'm fetching from git.kernel.org repo, or is that repo only updated when
>> pushing to Linus?
> 
> It's update only when david pushes and preps the pull reqs. I base all
> my development on his misc-next branch on github.

No wonder.

Thanks,
Qu

>>
>> Thanks,
>> Qu
>>
>>>
>>> Signed-off-by: Nikolay Borisov 
>>> ---
>>> David, 
>>>
>>> This is a fixlet to the aforementioned patch which is only in your 
>>> misc-next, 
>>> I'd suggest just squash this fixlet into the original one. 
>>>
>>>  fs/btrfs/ctree.h | 1 -
>>>  1 file changed, 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index fbe51edb9cac..03be51d77357 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -2801,7 +2801,6 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
>>> struct fstrim_range *range);
>>>  int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
>>>  int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans,
>>>  struct btrfs_fs_info *fs_info);
>>> -int __get_raid_index(u64 flags);
>>>  int btrfs_start_write_no_snapshotting(struct btrfs_root *root);
>>>  void btrfs_end_write_no_snapshotting(struct btrfs_root *root);
>>>  void btrfs_wait_for_snapshot_creation(struct btrfs_root *root);
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] btrfs: Remove obsolete function declaration

2018-03-21 Thread Nikolay Borisov



On 21.03.2018 09:16, Qu Wenruo wrote:
> 
> 
> On 2018年03月21日 15:03, Nikolay Borisov wrote:
>> This function was removed in 24bc2843edd5 ("btrfs:
>> Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()") what
>> remains is a defunct declaration. Remove it.
> 
> Nice find, thanks for get this.
> 
> 
> But strangely, I just fetched David's repo, but didn't find that commit
> hash anywhere.
> 
> I'm fetching from git.kernel.org repo, or is that repo only updated when
> pushing to Linus?

It's update only when david pushes and preps the pull reqs. I base all
my development on his misc-next branch on github.
> 
> Thanks,
> Qu
> 
>>
>> Signed-off-by: Nikolay Borisov 
>> ---
>> David, 
>>
>> This is a fixlet to the aforementioned patch which is only in your 
>> misc-next, 
>> I'd suggest just squash this fixlet into the original one. 
>>
>>  fs/btrfs/ctree.h | 1 -
>>  1 file changed, 1 deletion(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index fbe51edb9cac..03be51d77357 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -2801,7 +2801,6 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
>> struct fstrim_range *range);
>>  int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
>>  int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans,
>>   struct btrfs_fs_info *fs_info);
>> -int __get_raid_index(u64 flags);
>>  int btrfs_start_write_no_snapshotting(struct btrfs_root *root);
>>  void btrfs_end_write_no_snapshotting(struct btrfs_root *root);
>>  void btrfs_wait_for_snapshot_creation(struct btrfs_root *root);
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Remove obsolete function declaration

2018-03-21 Thread Qu Wenruo



On 2018年03月21日 15:03, Nikolay Borisov wrote:
> This function was removed in 24bc2843edd5 ("btrfs:
> Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()") what
> remains is a defunct declaration. Remove it.

Nice find, thanks for get this.


But strangely, I just fetched David's repo, but didn't find that commit
hash anywhere.

I'm fetching from git.kernel.org repo, or is that repo only updated when
pushing to Linus?

Thanks,
Qu

> 
> Signed-off-by: Nikolay Borisov 
> ---
> David, 
> 
> This is a fixlet to the aforementioned patch which is only in your misc-next, 
> I'd suggest just squash this fixlet into the original one. 
> 
>  fs/btrfs/ctree.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index fbe51edb9cac..03be51d77357 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2801,7 +2801,6 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct 
> fstrim_range *range);
>  int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
>  int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans,
>struct btrfs_fs_info *fs_info);
> -int __get_raid_index(u64 flags);
>  int btrfs_start_write_no_snapshotting(struct btrfs_root *root);
>  void btrfs_end_write_no_snapshotting(struct btrfs_root *root);
>  void btrfs_wait_for_snapshot_creation(struct btrfs_root *root);
> 



signature.asc
Description: OpenPGP digital signature

[PATCH] btrfs: Remove obsolete function declaration

2018-03-21 Thread Nikolay Borisov

This function was removed in 24bc2843edd5 ("btrfs:
Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()") what
remains is a defunct declaration. Remove it.

Signed-off-by: Nikolay Borisov 
---
David, 

This is a fixlet to the aforementioned patch which is only in your misc-next, 
I'd suggest just squash this fixlet into the original one. 

 fs/btrfs/ctree.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fbe51edb9cac..03be51d77357 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2801,7 +2801,6 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct 
fstrim_range *range);
 int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 int btrfs_delayed_refs_qgroup_accounting(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info);
-int __get_raid_index(u64 flags);
 int btrfs_start_write_no_snapshotting(struct btrfs_root *root);
 void btrfs_end_write_no_snapshotting(struct btrfs_root *root);
 void btrfs_wait_for_snapshot_creation(struct btrfs_root *root);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs: defer adding raid type kobject until after chunk relocation

2018-03-21 Thread Nikolay Borisov



On 20.03.2018 21:25, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Any time the first block group of a new type is created, we add a new
> kobject to sysfs to hold the attributes for that type.  Kobject-internal
> allocations always use GFP_KERNEL, making them prone to fs-reclaim races.
> While it appears as if this can occur any time a block group is created,
> the only times the first block group of a new type can be created in
> memory is at mount and when we create the first new block group during
> raid conversion.
> 
> This patch adds a new list to track pending kobject additions and then
> handles them after we do chunk relocation.  Between relocating the
> target chunk (or forcing allocation of a new chunk in the case of data)
> and removing the old chunk, we're in a safe place for fs-reclaim to
> occur.  We're holding the volume mutex, which is already held across
> page faults, and the delete_unused_bgs_mutex, which will only stall
> the cleaner thread.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  fs/btrfs/ctree.h   |  6 -
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/extent-tree.c | 59 
> +++---
>  fs/btrfs/sysfs.c   |  2 +-
>  fs/btrfs/volumes.c | 12 ++
>  5 files changed, 61 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index ffbb05aa66fa..75dbdf1bbead 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -381,8 +381,9 @@ struct btrfs_dev_replace {
>  
>  /* For raid type sysfs entries */
>  struct raid_kobject {
> - int raid_type;
> + u64 flags;
>   struct kobject kobj;
> + struct list_head list;
>  };
>  
>  struct btrfs_space_info {
> @@ -938,6 +939,8 @@ struct btrfs_fs_info {
>   int thread_pool_size;
>  
>   struct kobject *space_info_kobj;
> + struct list_head pending_raid_kobjs;
> + spinlock_t pending_raid_kobjs_lock; /* uncontended */
>  
>   u64 total_pinned;
>  
> @@ -2688,6 +2691,7 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, 
> u64 bytenr);
>  int btrfs_make_block_group(struct btrfs_trans_handle *trans,
>  struct btrfs_fs_info *fs_info, u64 bytes_used,
>  u64 type, u64 chunk_offset, u64 size);
> +void btrfs_add_raid_kobjects(struct btrfs_fs_info *fs_info);
>  struct btrfs_trans_handle *btrfs_start_trans_remove_block_group(
>   struct btrfs_fs_info *fs_info,
>   const u64 chunk_offset);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index eb6bb3169a9e..d5e1c2ff71ff 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2447,6 +2447,8 @@ int open_ctree(struct super_block *sb,
>   INIT_LIST_HEAD(_info->delayed_iputs);
>   INIT_LIST_HEAD(_info->delalloc_roots);
>   INIT_LIST_HEAD(_info->caching_block_groups);
> + INIT_LIST_HEAD(_info->pending_raid_kobjs);
> + spin_lock_init(_info->pending_raid_kobjs_lock);
>   spin_lock_init(_info->delalloc_root_lock);
>   spin_lock_init(_info->trans_lock);
>   spin_lock_init(_info->fs_roots_radix_lock);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 0e9a21230217..bb5368faa937 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9908,9 +9908,39 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
>   return 0;
>  }
>  
> +/* link_block_group will queue up kobjects to add when we're reclaim-safe */
> +void btrfs_add_raid_kobjects(struct btrfs_fs_info *fs_info)
> +{
> + struct btrfs_space_info *space_info;
> + struct raid_kobject *rkobj;
> + LIST_HEAD(list);
> + int index;
> + int ret = 0;
> +
> + spin_lock(_info->pending_raid_kobjs_lock);
> + list_splice_init(_info->pending_raid_kobjs, );
> + spin_unlock(_info->pending_raid_kobjs_lock);
> +
> + list_for_each_entry(rkobj, , list) {
> + space_info = __find_space_info(fs_info, rkobj->flags);
> + index = __get_raid_index(rkobj->flags);

This function is no more. It was refactored in 24bc2843edd5 ("btrfs:
Refactor __get_raid_index() to btrfs_bg_flags_to_raid_index()")


So I guess you should use btrfs_bg_flags_to_raid_index instead.

> +
> + ret = kobject_add(>kobj, _info->kobj,
> +   "%s", get_raid_name(index));
> + if (ret) {
> + kobject_put(>kobj);
> + break;
> + }
> + }
> + if (ret)
> + btrfs_warn(fs_info,
> +"failed to add kobject for block cache, ignoring");
> +}
> +
>  static void link_block_group(struct btrfs_block_group_cache *cache)
>  {
>   struct btrfs_space_info *space_info = cache->space_info;
> + struct btrfs_fs_info *fs_info = cache->fs_info;
>   int index = get_block_group_index(cache);
>   bool first = false;
>  
> @@ -9921,27 +9951,19 @@ static void

Re: [PATCH 1/2] btrfs: remove dead create_space_info calls

2018-03-21 Thread Nikolay Borisov



On 20.03.2018 21:25, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Since commit 2be12ef79 (btrfs: Separate space_info create/update), we've
> separated out the creation and updating of the space info structures.
> That commit was a straightforward refactoring of the two parts of
> update_space_info, but we can go a step further.  Since commits
> c59021f84 (Btrfs: fix OOPS of empty filesystem after balance) and
> b742bb82f (Btrfs: Link block groups of different raid types), we know
> that the space_info structures will be created at mount and there will
> only ever be, at most, three of them.
> 
> This patch cleans out the create_space_info calls after __find_space_info
> returns NULL since __find_space_info *can't* return NULL.
> 
> The initial cause for reviewing this was the kobject_add calls from
> create_space_info occuring in sites where fs-reclaim wasn't allowed.  Now
> we are certain they occur only early in the mount process and are safe.
> 
> Signed-off-by: Jeff Mahoney 

So when I did this cleanup I really wasn't sure under what conditions
space_info was allocated apart from mount time. I'm glad you were able
to prove it's not really done at any other time.

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h   |  6 +++---
>  fs/btrfs/extent-tree.c | 16 ++--
>  2 files changed, 5 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index da308774b8a4..ffbb05aa66fa 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -952,9 +952,9 @@ struct btrfs_fs_info {
>   struct btrfs_fs_devices *fs_devices;
>  
>   /*
> -  * the space_info list is almost entirely read only.  It only changes
> -  * when we add a new raid type to the FS, and that happens
> -  * very rarely.  RCU is used to protect it.
> +  * The space_info list is effectively read only after initial
> +  * setup.  It is populated at mount time and cleaned up after
> +  * all block groups are removed.  RCU is used to protect it.
>*/
>   struct list_head space_info;
>  
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 2d5e963fae88..0e9a21230217 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4603,11 +4603,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans,
>   return -ENOSPC;
>  
>   space_info = __find_space_info(fs_info, flags);
> - if (!space_info) {
> - ret = create_space_info(fs_info, flags, _info);
> - if (ret)
> - return ret;
> - }
> + ASSERT(space_info);
>  
>  again:
>   spin_lock(_info->lock);
> @@ -10255,15 +10251,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
> *trans,
>* with its ->space_info set.
>*/
>   cache->space_info = __find_space_info(fs_info, cache->flags);
> - if (!cache->space_info) {
> - ret = create_space_info(fs_info, cache->flags,
> ->space_info);
> - if (ret) {
> - btrfs_remove_free_space_cache(cache);
> - btrfs_put_block_group(cache);
> - return ret;
> - }
> - }
> + ASSERT(cache->space_info);
>  
>   ret = btrfs_add_block_group_cache(fs_info, cache);
>   if (ret) {
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] btrfs: return required error from btrfs_check_super_csum

2018-03-21 Thread Nikolay Borisov



On 21.03.2018 01:33, Anand Jain wrote:
> Return the required -EINVAL and -EUCLEAN from the function
> btrfs_check_super_csum(). And more the error log into the
> parent function.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/disk-io.c | 28 +---
>  1 file changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 582ed6af3c50..4c6de2743250 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -392,9 +392,10 @@ static int verify_parent_transid(struct extent_io_tree 
> *io_tree,
>  /*
>   * Return 0 if the superblock checksum type matches the checksum value of 
> that
>   * algorithm. Pass the raw disk superblock data.
> + * Otherwise: -EINVAL  if csum type is not found
> + * -EUCLEAN if csum does not match
>   */
> -static int btrfs_check_super_csum(struct btrfs_fs_info *fs_info,
> -   char *raw_disk_sb)
> +static int btrfs_check_super_csum(char *raw_disk_sb)
>  {
>   struct btrfs_super_block *disk_sb =
>   (struct btrfs_super_block *)raw_disk_sb;
> @@ -403,11 +404,8 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
> *fs_info,
>   char result[sizeof(crc)];
>  
>   /* We support csum type crc32 only as of now */
> - if (csum_type != BTRFS_CSUM_TYPE_CRC32) {
> - btrfs_err(fs_info, "unsupported checksum algorithm %u",
> -   csum_type);
> - return 1;
> - }
> + if (csum_type != BTRFS_CSUM_TYPE_CRC32)
> + return -EINVAL;
>  
>   /*
>* The super_block structure does not span the whole
> @@ -419,7 +417,7 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
> *fs_info,
>   btrfs_csum_final(crc, result);
>  
>   if (memcmp(raw_disk_sb, result, sizeof(result)))
> - return 1;
> + return -EUCLEAN;
>  
>   return 0;
>  }
> @@ -2571,9 +2569,17 @@ int open_ctree(struct super_block *sb,
>* We want to check superblock checksum, the type is stored inside.
>* Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k).
>*/
> - if (btrfs_check_super_csum(fs_info, bh->b_data)) {
> - btrfs_err(fs_info, "superblock checksum mismatch");
> - err = -EINVAL;
> + err = btrfs_check_super_csum(bh->b_data);
> + if (err) {
> + if (err == -EINVAL)
> + pr_err("BTRFS error (device %pg): unsupported checksum 
> algorithm",
> + fs_devices->latest_bdev);
> + else if (err == -EUCLEAN)
> + pr_err("BTRFS error (device %pg): superblock checksum 
> mismatch",
> + fs_devices->latest_bdev);
> + else
> + pr_err("BTRFS error (device %pg): checksum check failed 
> %d",
> + fs_devices->latest_bdev, err);

This is redundant, it's never going to be executed since only
EINVAL/EUCLEAN is returned.


>   brelse(bh);
>   goto fail_alloc;
>   }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/2] btrfs: cleanup btrfs_check_super_csum() to check the csum_type to the type

2018-03-21 Thread Nikolay Borisov



On 21.03.2018 01:33, Anand Jain wrote:
> %csum_type is being checked to check the number of csum types we support,
> which is 1 as of now. Instead just check if the type matches with the
> only type we support, that is BTRFS_CSUM_TYPE_CRC32. And further adds
> cleanup.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/disk-io.c | 41 +++--
>  1 file changed, 19 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 1657d6aa4fa6..582ed6af3c50 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -399,32 +399,29 @@ static int btrfs_check_super_csum(struct btrfs_fs_info 
> *fs_info,
>   struct btrfs_super_block *disk_sb =
>   (struct btrfs_super_block *)raw_disk_sb;
>   u16 csum_type = btrfs_super_csum_type(disk_sb);
> - int ret = 0;
> -
> - if (csum_type == BTRFS_CSUM_TYPE_CRC32) {
> - u32 crc = ~(u32)0;
> - char result[sizeof(crc)];
> -
> - /*
> -  * The super_block structure does not span the whole
> -  * BTRFS_SUPER_INFO_SIZE range, we expect that the unused space
> -  * is filled with zeros and is included in the checksum.
> -  */
> - crc = btrfs_csum_data(raw_disk_sb + BTRFS_CSUM_SIZE,
> - crc, BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE);
> - btrfs_csum_final(crc, result);
> -
> - if (memcmp(raw_disk_sb, result, sizeof(result)))
> - ret = 1;
> - }
> + u32 crc = ~(u32)0;
> + char result[sizeof(crc)];
>  
> - if (csum_type >= ARRAY_SIZE(btrfs_csum_sizes)) {
> + /* We support csum type crc32 only as of now */
> + if (csum_type != BTRFS_CSUM_TYPE_CRC32) {
>   btrfs_err(fs_info, "unsupported checksum algorithm %u",
> - csum_type);
> - ret = 1;
> +   csum_type);
> + return 1;
>   }
>  
> - return ret;
> + /*
> +  * The super_block structure does not span the whole
> +  * BTRFS_SUPER_INFO_SIZE range, we expect that the unused space
> +  * is filled with zeros and is included in the checksum.
> +  */
> + crc = btrfs_csum_data(raw_disk_sb + BTRFS_CSUM_SIZE,
> +   crc, BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE);
> + btrfs_csum_final(crc, result);
> +
> + if (memcmp(raw_disk_sb, result, sizeof(result)))
> + return 1;
> +
> + return 0;
>  }
>  
>  /*
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

49 matches

Mail list logo