"decompress failed" in 1-2 files always causes kernel oops, check/scrub pass

2018-05-11 Thread james harvey
100% reproducible, booting from disk, or even Arch installation ISO.
Kernel 4.16.7.  btrfs-progs v4.16.

Reading one of two journalctl files causes a kernel oops.  Initially
ran into it from "journalctl --list-boots", but cat'ing the file does
it too.  I believe this shows there's compressed data that is invalid,
but its btrfs checksum is invalid.  I've cat'ed every file on the
disk, and luckily have the problems narrowed down to only these 2
files in /var/log/journal.

This volume has always been mounted with lzo compression.

scrub has never found anything, and have ran it since the oops.

Found a user a few years ago who also ran into this, without
resolution, at:
https://www.spinics.net/lists/linux-btrfs/msg52218.html

1. Cat'ing a (non-essential) file shouldn't be able to bring down the system.

2. If this is infact invalid compressed data, there should be a way to
check for that.  Btrfs check and scrub pass.

Hardware is fine.  Passes memtest86+ in SMP mode.  Works fine on all
other files.



[  381.869940] BUG: unable to handle kernel paging request at 00390e50
[  381.870881] BTRFS: decompress failed
[  381.891775] IP: rebalance_domains+0x8a/0x2c0
[  381.891776] PGD 0 P4D 0
[  381.891780] Oops:  [#1] PREEMPT SMP PTI
[  381.891782] Modules linked in:
[  381.891784] BTRFS: decompress failed
[  381.891784]  8021q mrp wl(PO) btrfs dm_thin_pool ast
[  381.891788] BTRFS: decompress failed
[  381.891789]  dm_persistent_data dm_bio_prison dm_bufio libcrc32c
i2c_algo_bit crc32c_generic intel_rapl ttm sb_edac zstd_compress
drm_kms_helper xor x86_pkg_temp_thermal intel_powerclamp drm raid6_pq
raid1 agpgart coretemp md_mod cfg80211 syscopyarea sysfillrect
kvm_intel dm_mod sysimgblt kvm fb_sys_fops joydev irqbypass rfkill
iTCO_wdt iTCO_vendor_support crct10dif_pclmul ghash_clmulni_intel
ipmi_ssif rtc_cmos ipmi_si intel_cstate mei_me ipmi_devintf
intel_uncore ipmi_msghandler shpchp pcspkr mousedev input_leds
led_class psmouse intel_rapl_perf lpc_ich mei i2c_i801 evdev mac_hid
ip_tables x_tables overlay squashfs zstd_decompress xxhash loop isofs
sr_mod cdrom sd_mod
[  381.891835] BTRFS: decompress failed
[  381.891835]  hid_generic usbhid hid uas usb_storage
[  381.891838] BTRFS: decompress failed
[  381.891838]  serio_raw atkbd libps2 crc32_pclmul
[  381.891840] BTRFS: decompress failed
[  381.891841]  crc32c_intel isci ahci aesni_intel
[  381.891843] BTRFS: decompress failed
[  381.891843]  aes_x86_64 libsas libahci crypto_simd
[  381.891845] BTRFS: decompress failed
[  381.891845]  ehci_pci ehci_hcd cryptd glue_helper
[  381.891847] BTRFS: decompress failed
[  381.891847]  libata scsi_transport_sas e1000e mlx4_core usbcore ptp
pps_core scsi_mod usb_common devlink wmi i8042 serio
[  381.891855] CPU: 11 PID: 0 Comm: swapper/11 Tainted: P   O
   4.16.7-1-ARCH #1
[  381.891856] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./EP2C602, BIOS P1.80 12/09/2013
[  381.891858] RIP: 0010:rebalance_domains+0x8a/0x2c0
[  381.891859] RSP: 0018:8e6c5f2c3f08 EFLAGS: 00010206
[  381.891860] RAX:  RBX: 00390de8 RCX: 0005
[  381.891861] RDX: 00015ff2 RSI: 024d RDI: 00133340
[  381.891862] RBP: 00015ff4 R08:  R09: 0001
[  381.891863] R10:  R11:  R12: 0001
[  381.891863] R13:  R14: 0001 R15: 00bd7801f8e8a9c8
[  381.891865] FS:  () GS:8e6c5f2c()
knlGS:
[  381.891865] CS:  0010 DS:  ES:  CR0: 80050033
[  381.891866] CR2: 00390e50 CR3: 000e6100a004 CR4: 000606e0
[  381.891867] Call Trace:
[  381.891870]  
[  381.891875]  __do_softirq+0xf1/0x2e0
[  381.891880]  irq_exit+0xc9/0xe0
[  381.903429] BTRFS: decompress failed
[  381.916574]  smp_apic_timer_interrupt+0x73/0x160
[  381.916576]  apic_timer_interrupt+0xf/0x20
[  381.916578]  
[  381.916581] RIP: 0010:cpuidle_enter_state+0xb6/0x2e0
[  381.916582] RSP: 0018:939f863fbea8 EFLAGS: 0246 ORIG_RAX:
ff12
[  381.916583] RAX: 8e6c5f2c RBX: 0058e9388d6f RCX: 001f
[  381.916584] RDX: 0058e9388d6f RSI: 96e70d54 RDI: 96e70fb2
[  381.916585] RBP: 8e6c5f2ebe00 R08: 02044b2e9556 R09: 337b
[  381.916585] R10: 471b R11: 8e6c5f2e07c4 R12: 0003
[  381.916586] R13: 970ae338 R14: 0058e9215560 R15: 
[  381.916591]  ? cpuidle_enter_state+0x94/0x2e0
[  381.916593]  do_idle+0x193/0x1b0
[  381.916595]  cpu_startup_entry+0x6f/0x80
[  381.916599]  start_secondary+0x1a5/0x200
[  381.916602]  secondary_startup_64+0xa5/0xb0
[  381.916603] Code: 46 00 00 48 03 04 d5 40 f4
[  381.924842] BTRFS: decompress failed
[  381.937936] ee 96 48 8b 98 c0 09 00 00 48 85 db 0f 84 32 02 00 00
45 31 ff 45 31 f6 45 31 e4 48 8b 15 e6 aa f4 00 <48> 8b 43 68 48 39 53
70 79 2e 48 89 c2 41 be 01 00 00 00 48 c1

Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-11 Thread Duncan
Darrick J. Wong posted on Fri, 11 May 2018 17:06:34 -0700 as excerpted:

> On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
>> Right now we return EINVAL if a process does not have permission to dedupe a
>> file. This was an oversight on my part. EPERM gives a true description of
>> the nature of our error, and EINVAL is already used for the case that the
>> filesystem does not support dedupe.
>> 
>> Signed-off-by: Mark Fasheh 
>> ---
>>  fs/read_write.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 77986a2e2a3b..8edef43a182c 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
>> file_dedupe_range *same)
>>  info->status = -EINVAL;
>>  } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
>>   uid_eq(current_fsuid(), dst->i_uid))) {
>> -info->status = -EINVAL;
>> +info->status = -EPERM;
> 
> Hmm, are we allowed to change this aspect of the kabi after the fact?
> 
> Granted, we're only trading one error code for another, but will the
> existing users of this care?  xfs_io won't and I assume duperemove won't
> either, but what about bees? :)

>From the 0/2 cover-letter:

>>> This has also popped up in duperemove, mostly in the form of cryptic
>>> error messages. Because this is a code returned to userspace, I did
>>> check the other users of extent-same that I could find. Both 'bees'
>>> and 'rust-btrfs' do the same as duperemove and simply report the error
>>> (as they should).

> --D
> 
>>  } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>>  info->status = -EXDEV;
>>  } else if (S_ISDIR(dst->i_mode)) {
>> -- 
>> 2.15.1
>>

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-11 Thread Amir Goldstein
On Sat, May 12, 2018 at 3:06 AM, Darrick J. Wong
 wrote:
> On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
>> Right now we return EINVAL if a process does not have permission to dedupe a
>> file. This was an oversight on my part. EPERM gives a true description of
>> the nature of our error, and EINVAL is already used for the case that the
>> filesystem does not support dedupe.
>>
>> Signed-off-by: Mark Fasheh 
>> ---
>>  fs/read_write.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 77986a2e2a3b..8edef43a182c 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
>> file_dedupe_range *same)
>>   info->status = -EINVAL;
>>   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
>>uid_eq(current_fsuid(), dst->i_uid))) {
>> - info->status = -EINVAL;
>> + info->status = -EPERM;
>
> Hmm, are we allowed to change this aspect of the kabi after the fact?
>
> Granted, we're only trading one error code for another, but will the
> existing users of this care?  xfs_io won't and I assume duperemove won't
> either, but what about bees? :)
>

Relaxing -EINVAL is the common case with kabi.
Every new flag we add support for does that and is it also common
that a new flag we support is restricted for certain capabilities so
EINVAL is replaced with EPERM.
BTW, man page doesn't say anything about the is_admin case.

IMO EPERM makes perfect sense here and btw, we also return
EPERM from overlayfs if dst is on lower layer.

Mark,

Please be aware that these changes conflict with Miklos' dedupe-cleanup
patches, so I suggest you collaborate on that
https://marc.info/?l=linux-fsdevel&m=152568128031031&w=2

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfs: allow dedupe of user owned read-only files

2018-05-11 Thread Adam Borowski
On Fri, May 11, 2018 at 12:26:50PM -0700, Mark Fasheh wrote:
> The permission check in vfs_dedupe_file_range() is too coarse - We
> only allow dedupe of the destination file if the user is root, or
> they have the file open for write.
> 
> This effectively limits a non-root user from deduping their own
> read-only files. As file data during a dedupe does not change,
> this is unexpected behavior and this has caused a number of issue
> reports. For an example, see:
> 
> https://github.com/markfasheh/duperemove/issues/129
> 
> So change the check so we allow dedupe on the target if:
> 
> - the root or admin is asking for it
> - the owner of the file is asking for the dedupe
> - the process has write access

I submitted a similar patch in May 2016, yet it has never been applied
despite multiple pings, with no NAK.  My version allowed dedupe if:
- the root or admin is asking for it
- the file has w permission (on the inode -- ie, could have been opened rw)

There was a request to include in xfstests a test case for the ETXTBSY race
this patch fixes, but there's no reasonable way to make such a test case:
the race condition is not a bug, it's write-xor-exec working as designed.

Another idea discussed was about possibly just allowing everyone who can
open the file to deduplicate it, as the file contents are not modified in
any way.  Zygo Blaxell expressed a concern that it could be used by an
unprivileged user who can trigger a crash to abuse writeout bugs.

I like this new version better than mine: "root or owner or w" is more
Unixy than "could have been opened w".

> Signed-off-by: Mark Fasheh 
> ---
>  fs/read_write.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index c4eabbfc90df..77986a2e2a3b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -2036,7 +2036,8 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>  
>   if (info->reserved) {
>   info->status = -EINVAL;
> - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
> + } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
> +  uid_eq(current_fsuid(), dst->i_uid))) {
I had:
  + } else if (!(is_admin || !inode_permission(dst, MAY_WRITE))) {
>   info->status = -EINVAL;
>   } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>   info->status = -EXDEV;
> -- 

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 V2] hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs

2018-05-11 Thread Darrick J. Wong
On Fri, May 11, 2018 at 04:41:45PM +0200, David Sterba wrote:
> On Fri, May 11, 2018 at 09:36:09AM -0500, Eric Sandeen wrote:
> > On 5/11/18 9:32 AM, Chris Mason wrote:
> > > On 11 May 2018, at 10:10, David Sterba wrote:
> > > 
> > >> On Thu, May 10, 2018 at 08:16:09PM +0100, Al Viro wrote:
> > >>> On Thu, May 10, 2018 at 01:13:57PM -0500, Eric Sandeen wrote:
> >  Move the btrfs label ioctls up to the vfs for general use.
> > 
> >  This retains 256 chars as the maximum size through the interface, which
> >  is the btrfs limit and AFAIK exceeds any other filesystem's maximum
> >  label size.
> > 
> >  Signed-off-by: Eric Sandeen 
> >  Reviewed-by: Andreas Dilger 
> >  Reviewed-by: David Sterba 
> > >>>
> > >>> No objections (and it obviously ought to go through btrfs tree).
> > >>
> > >> I can take it through my tree, but Eric mentioned that there's a patch
> > >> for xfs that depends on it. In this case it would make sense to take
> > >> both patches at once via the xfs tree. There are no pending conflicting
> > >> changes in btrfs.
> > > 
> > > Probably easiest to just have a separate pull dedicated just for this 
> > > series.  That way it doesn't really matter which tree it goes through.
> > 
> > Actually, I just realized that the changes to include/uapi/linux/fs.h are 
> > completely
> > independent of any btrfs changes, right - there's nothing wrong w/ 
> > redefining
> > the common ioctl under a different name in btrfs.  So the fs.h patch could 
> > go first,
> > through the xfs tree since it'll be using it.
> > 
> > Once the common ioctl definition goes in, then btrfs can change to define 
> > its ioctls to
> > the common ioctls, or act on them directly as my patch did, etc.  Would 
> > that be
> > a better plan?  IOWs there's no urgent need to coordinate a btrfs change.
> 
> Agreed, I like that plan.

Ok, I'll await a new series with all the patches that Eric wants to
squeeze through the xfs tree.  I don't mind carrying the btrfs changes
too, so long as they're one-liners and the btrfs maintainers ack/rvb it.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: qgroup: Search commit root for rescan to avoid missing extent

2018-05-11 Thread Qu Wenruo


On 2018年05月12日 01:08, Jeff Mahoney wrote:
> On 5/3/18 3:20 AM, Qu Wenruo wrote:
>> When doing qgroup rescan using the following script (modified from
>> btrfs/017 test case), we can sometimes hit qgroup corruption.
>>
>> --
>> umount $dev &> /dev/null
>> umount $mnt &> /dev/null
>>
>> mkfs.btrfs -f -n 64k $dev
>> mount $dev $mnt
>>
>> extent_size=8192
>>
>> xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
>> btrfs subvolume snapshot $mnt $mnt/snap
>>
>> xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
>> xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
>> xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
>> btrfs quota enable $mnt
>>
>>  # -W is the new option to only wait rescan while not starting new one
>> btrfs quota rescan -W $mnt
>> btrfs qgroup show -prce $mnt
>>
>>  # Need to patch btrfs-progs to report qgroup mismatch as error
>> btrfs check $dev || _fail
>> --
>>
>> For fast machine, we can hit some corruption which missed accounting
>> tree blocks:
>> --
>> qgroupid rfer excl max_rfer max_excl parent  child
>>      --  -
>> 0/5   8.00KiB0.00B none none --- ---
>> 0/257 8.00KiB0.00B none none --- ---
>> --
>>
>> This is due to the fact that we're always searching commit root for
>> btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
>> from current transaction, not commit root.
>>
>> And if our tree blocks get modified in current transaction, we won't
>> find any owner in commit root, thus causing the corruption.
>>
>> Fix it by searching commit root for extent tree for
>> qgroup_rescan_leaf().
>>
>> Reported-by: Nikolay Borisov 
>> Signed-off-by: Qu Wenruo 
>> ---
>>
>> Please keep in mind that it is possible to hit another type of race
>> which double accounting tree blocks:
>> --
>> qgroupid rfer excl max_rfer max_excl parent  child
>>      --  -
>> 0/5  136.00KiB 128.00KiB none none --- ---
>> 0/257136.00KiB 128.00KiB none none --- ---
>> --
>> For this type of corruption, this patch could reduce the possibility,
>> but the root cause is race between transaction commit and qgroup rescan,
>> which needs to be addressed in another patch.
>> ---
>>  fs/btrfs/qgroup.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 4baa4ba2d630..829e8fe5c97e 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -2681,6 +2681,11 @@ static void btrfs_qgroup_rescan_worker(struct 
>> btrfs_work *work)
>>  path = btrfs_alloc_path();
>>  if (!path)
>>  goto out;
>> +/*
>> + * Rescan should only search for commit root, and any later difference
>> + * should be recorded by qgroup
>> + */
>> +path->search_commit_root = 1;
>>  
>>  err = 0;
>>  while (!err && !btrfs_fs_closing(fs_info)) {
>>
> 
> If we're searching the commit root here, do we need the tree mod
> sequence number dance in qgroup_rescan_leaf anymore?

No, so I'll remove it in next version.

Thanks for pointing this out,
Qu

> 
> -Jeff
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-11 Thread Darrick J. Wong
On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
> Right now we return EINVAL if a process does not have permission to dedupe a
> file. This was an oversight on my part. EPERM gives a true description of
> the nature of our error, and EINVAL is already used for the case that the
> filesystem does not support dedupe.
> 
> Signed-off-by: Mark Fasheh 
> ---
>  fs/read_write.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 77986a2e2a3b..8edef43a182c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>   info->status = -EINVAL;
>   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
>uid_eq(current_fsuid(), dst->i_uid))) {
> - info->status = -EINVAL;
> + info->status = -EPERM;

Hmm, are we allowed to change this aspect of the kabi after the fact?

Granted, we're only trading one error code for another, but will the
existing users of this care?  xfs_io won't and I assume duperemove won't
either, but what about bees? :)

--D

>   } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>   info->status = -EXDEV;
>   } else if (S_ISDIR(dst->i_mode)) {
> -- 
> 2.15.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfs: allow dedupe of user owned read-only files

2018-05-11 Thread Darrick J. Wong
On Fri, May 11, 2018 at 12:26:50PM -0700, Mark Fasheh wrote:
> The permission check in vfs_dedupe_file_range() is too coarse - We
> only allow dedupe of the destination file if the user is root, or
> they have the file open for write.
> 
> This effectively limits a non-root user from deduping their own
> read-only files. As file data during a dedupe does not change,
> this is unexpected behavior and this has caused a number of issue
> reports. For an example, see:
> 
> https://github.com/markfasheh/duperemove/issues/129
> 
> So change the check so we allow dedupe on the target if:
> 
> - the root or admin is asking for it
> - the owner of the file is asking for the dedupe
> - the process has write access
> 
> Signed-off-by: Mark Fasheh 

Sounds fine I guess
Acked-by: Darrick J. Wong 

--D

> ---
>  fs/read_write.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index c4eabbfc90df..77986a2e2a3b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -2036,7 +2036,8 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>  
>   if (info->reserved) {
>   info->status = -EINVAL;
> - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
> + } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
> +  uid_eq(current_fsuid(), dst->i_uid))) {
>   info->status = -EINVAL;
>   } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>   info->status = -EXDEV;
> -- 
> 2.15.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 3/4] ext4: add verifier check for symlink with append/immutable flags

2018-05-11 Thread Jan Kara
On Thu 10-05-18 16:13:58, Luis R. Rodriguez wrote:
> The Linux VFS does not allow a way to set append/immuttable
> attributes to symlinks, this is just not possible. If this is
> detected inform the user as the filesystem must be corrupted.
> 
> Signed-off-by: Luis R. Rodriguez 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext4/inode.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 37a2f7a2b66a..6acf0dd6b6e6 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4947,6 +4947,13 @@ struct inode *ext4_iget(struct super_block *sb, 
> unsigned long ino)
>   inode->i_op = &ext4_dir_inode_operations;
>   inode->i_fop = &ext4_dir_operations;
>   } else if (S_ISLNK(inode->i_mode)) {
> + /* VFS does not allow setting these so must be corruption */
> + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
> + EXT4_ERROR_INODE(inode,
> +   "immutable or append flags not allowed on symlinks");
> + ret = -EFSCORRUPTED;
> + goto bad_inode;
> + }
>   if (ext4_encrypted_inode(inode)) {
>   inode->i_op = &ext4_encrypted_symlink_inode_operations;
>   ext4_set_aops(inode);
> -- 
> 2.17.0
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 06:16:15PM +0100, Filipe Manana wrote:
> On Fri, May 11, 2018 at 5:49 PM, David Sterba  wrote:
> > On Fri, May 11, 2018 at 05:25:50PM +0100, Filipe Manana wrote:
> >> On Fri, May 11, 2018 at 4:57 PM, David Sterba  wrote:
> >> > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> >> > arrays can be 32KiB large. To avoid allocation failures due to
> >> > fragmented memory, use the allocation with fallback to vmalloc.
> >> >
> >> > Signed-off-by: David Sterba 
> >> > ---
> >> >
> >> > This depends on the patches that remove the 16MiB restriction in the
> >> > dedupe ioctl, but contextually can be applied to the current code too.
> >> >
> >> > https://patchwork.kernel.org/patch/10374941/
> >> >
> >> >  fs/btrfs/ioctl.c | 4 ++--
> >> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> >> > index b572e38b4b64..a7f517009cd7 100644
> >> > --- a/fs/btrfs/ioctl.c
> >> > +++ b/fs/btrfs/ioctl.c
> >> > @@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode *src, 
> >> > u64 loff, u64 olen,
> >> >  * locking. We use an array for the page pointers. Size of the 
> >> > array is
> >> >  * bounded by len, which is in turn bounded by 
> >> > BTRFS_MAX_DEDUPE_LEN.
> >> >  */
> >> > -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), 
> >> > GFP_KERNEL);
> >> > -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), 
> >> > GFP_KERNEL);
> >> > +   cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *), 
> >> > GFP_KERNEL);
> >> > +   cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *), 
> >> > GFP_KERNEL);
> >>
> >> Kvzalloc should take 2 parameters and not 3.
> >
> > And the right function is kvmalloc_array.
> >
> >> Also, aren't the corresponding kvfree() calls missing?
> >
> > Yes, thanks for catching it. The updated version:
> >
> > From: David Sterba 
> > Subject: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data
> >
> > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> > arrays can be 32KiB large. To avoid allocation failures due to
> > fragmented memory, use the allocation with fallback to vmalloc.
> >
> > Signed-off-by: David Sterba 
> > ---
> >  fs/btrfs/ioctl.c | 16 +---
> >  1 file changed, 9 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index b572e38b4b64..4fcfa05ed960 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -3178,12 +3178,13 @@ static int btrfs_extent_same(struct inode *src, u64 
> > loff, u64 olen,
> >  * locking. We use an array for the page pointers. Size of the 
> > array is
> >  * bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN.
> >  */
> > -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), 
> > GFP_KERNEL);
> > -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), 
> > GFP_KERNEL);
> > +   cmp.src_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> > +  GFP_KERNEL);
> > +   cmp.dst_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> > +  GFP_KERNEL);
> > if (!cmp.src_pages || !cmp.dst_pages) {
> > -   kfree(cmp.src_pages);
> > -   kfree(cmp.dst_pages);
> > -   return -ENOMEM;
> > +   ret = -ENOMEM;
> > +   goto out_free;
> > }
> >
> > if (same_inode)
> > @@ -3211,8 +3212,9 @@ static int btrfs_extent_same(struct inode *src, u64 
> > loff, u64 olen,
> > else
> > btrfs_double_inode_unlock(src, dst);
> >
> > -   kfree(cmp.src_pages);
> > -   kfree(cmp.dst_pages);
> > +out_free:
> > +   kvfree(cmp.src_pages);
> > +   kvfree(cmp.dst_pages);
> 
> kvfree() missing at btrfs_cmp_data_free() too.

Bah, no more quick fixes from me today. I'll drop the patch from the
branch and will have a fresh look another day.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] fs: add initial bh_result->b_private value to __blockdev_direct_IO()

2018-05-11 Thread Omar Sandoval
On Fri, May 11, 2018 at 09:32:28PM +0100, Al Viro wrote:
> On Fri, May 11, 2018 at 01:30:01PM -0700, Omar Sandoval wrote:
> > On Fri, May 11, 2018 at 09:05:38PM +0100, Al Viro wrote:
> > > On Thu, May 10, 2018 at 11:30:10PM -0700, Omar Sandoval wrote:
> > > >  do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> > > >   struct block_device *bdev, struct iov_iter *iter,
> > > >   get_block_t get_block, dio_iodone_t end_io,
> > > > - dio_submit_t submit_io, int flags)
> > > > + dio_submit_t submit_io, int flags, void *private)
> > > 
> > > Oh, dear...  That's what, 9 arguments?  I agree that the hack in question
> > > is obscene, but so is this ;-/
> > 
> > So looking at these one by one, obviously needed:
> > 
> > - iocb
> > - inode
> > - iter
> > 
> > bdev is almost always inode->i_sb->s_bdev, except for Btrfs :(
> > 
> > These could _maybe_ go in struct kiocb:
> > 
> > - flags could maybe be folded into ki_flags
> > - private could maybe go in iocb->private, but I haven't yet read
> >   through to figure out if we're already using iocb->private for direct
> >   I/O

Modifying kiocb isn't going to pan out, it's constructed way up in the
stack so that'd be a mess.

> > That leaves the callbacks, get_block, end_io, and submit_io. Perhaps we
> > can add those to inode_operations?
> 
> Or, perhaps, btrfs shouldn't be using the common helper?  The question
> is not where to stash the bits and pieces - it's how unreadable the callers
> are and how much boilerplate/hidden information is involved...

I need to call through to do_blockdev_direct_IO() eventually, I'm sure
no one wants me to reimplement the 200 lines in there :) so I'd be happy
to add a separate helper that only Btrfs uses, but if we're going to
call do_blockdev_direct_IO() eventually then we still need the 9
arguments in some form. Am I misunderstanding?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] fs: add initial bh_result->b_private value to __blockdev_direct_IO()

2018-05-11 Thread Al Viro
On Fri, May 11, 2018 at 01:30:01PM -0700, Omar Sandoval wrote:
> On Fri, May 11, 2018 at 09:05:38PM +0100, Al Viro wrote:
> > On Thu, May 10, 2018 at 11:30:10PM -0700, Omar Sandoval wrote:
> > >  do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> > > struct block_device *bdev, struct iov_iter *iter,
> > > get_block_t get_block, dio_iodone_t end_io,
> > > -   dio_submit_t submit_io, int flags)
> > > +   dio_submit_t submit_io, int flags, void *private)
> > 
> > Oh, dear...  That's what, 9 arguments?  I agree that the hack in question
> > is obscene, but so is this ;-/
> 
> So looking at these one by one, obviously needed:
> 
> - iocb
> - inode
> - iter
> 
> bdev is almost always inode->i_sb->s_bdev, except for Btrfs :(
> 
> These could _maybe_ go in struct kiocb:
> 
> - flags could maybe be folded into ki_flags
> - private could maybe go in iocb->private, but I haven't yet read
>   through to figure out if we're already using iocb->private for direct
>   I/O
> 
> That leaves the callbacks, get_block, end_io, and submit_io. Perhaps we
> can add those to inode_operations?

Or, perhaps, btrfs shouldn't be using the common helper?  The question
is not where to stash the bits and pieces - it's how unreadable the callers
are and how much boilerplate/hidden information is involved...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] fs: add initial bh_result->b_private value to __blockdev_direct_IO()

2018-05-11 Thread Omar Sandoval
On Fri, May 11, 2018 at 09:05:38PM +0100, Al Viro wrote:
> On Thu, May 10, 2018 at 11:30:10PM -0700, Omar Sandoval wrote:
> >  do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> >   struct block_device *bdev, struct iov_iter *iter,
> >   get_block_t get_block, dio_iodone_t end_io,
> > - dio_submit_t submit_io, int flags)
> > + dio_submit_t submit_io, int flags, void *private)
> 
> Oh, dear...  That's what, 9 arguments?  I agree that the hack in question
> is obscene, but so is this ;-/

So looking at these one by one, obviously needed:

- iocb
- inode
- iter

bdev is almost always inode->i_sb->s_bdev, except for Btrfs :(

These could _maybe_ go in struct kiocb:

- flags could maybe be folded into ki_flags
- private could maybe go in iocb->private, but I haven't yet read
  through to figure out if we're already using iocb->private for direct
  I/O

That leaves the callbacks, get_block, end_io, and submit_io. Perhaps we
can add those to inode_operations?

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 06/12] Btrfs: delete dead code in btrfs_orphan_commit_root()

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_orphan_commit_root() tries to delete an orphan item for a
subvolume in the tree root, but we don't actually insert that item in
the first place. See commit 0a0d4415e338 ("Btrfs: delete dead code in
btrfs_orphan_add()"). We can get rid of it.

Reviewed-by: Josef Bacik 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9ef20d28fa9e..84d7dd3a30f9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3302,7 +3302,6 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *block_rsv;
-   int ret;
 
if (atomic_read(&root->orphan_inodes) ||
root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
@@ -3323,17 +3322,6 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
root->orphan_block_rsv = NULL;
spin_unlock(&root->orphan_lock);
 
-   if (test_bit(BTRFS_ROOT_ORPHAN_ITEM_INSERTED, &root->state) &&
-   btrfs_root_refs(&root->root_item) > 0) {
-   ret = btrfs_del_orphan_item(trans, fs_info->tree_root,
-   root->root_key.objectid);
-   if (ret)
-   btrfs_abort_transaction(trans, ret);
-   else
-   clear_bit(BTRFS_ROOT_ORPHAN_ITEM_INSERTED,
- &root->state);
-   }
-
if (block_rsv) {
WARN_ON(block_rsv->size > 0);
btrfs_free_block_rsv(fs_info, block_rsv);
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 09/12] Btrfs: fix ENOSPC caused by orphan items reservations

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Currently, we keep space reserved for all inode orphan items until the
inode is evicted (i.e., all references to it are dropped). We hit an
issue where an application would keep a bunch of deleted files open (by
design) and thus keep a large amount of space reserved, causing ENOSPC
errors when other operations tried to reserve space. This long-standing
reservation isn't absolutely necessary for a couple of reasons:

- We can almost always make the reservation we need or steal from the
  global reserve for the orphan item
- If we can't, it's not the end of the world if we drop the orphan item
  on the floor and let the next mount clean it up

So, get rid of persistent reservation and just reserve space in
btrfs_evict_inode().

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 158 ---
 1 file changed, 38 insertions(+), 120 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0a753f3a3321..efa67284ebb6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3331,77 +3331,16 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
 /*
  * This creates an orphan entry for the given inode in case something goes 
wrong
  * in the middle of an unlink.
- *
- * NOTE: caller of this function should reserve 5 units of metadata for
- *  this function.
  */
 int btrfs_orphan_add(struct btrfs_trans_handle *trans,
-   struct btrfs_inode *inode)
+struct btrfs_inode *inode)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
-   struct btrfs_root *root = inode->root;
-   struct btrfs_block_rsv *block_rsv = NULL;
-   int reserve = 0;
int ret;
 
-   if (!root->orphan_block_rsv) {
-   block_rsv = btrfs_alloc_block_rsv(fs_info,
- BTRFS_BLOCK_RSV_TEMP);
-   if (!block_rsv)
-   return -ENOMEM;
-   }
-
-   if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
- &inode->runtime_flags))
-   reserve = 1;
-
-   spin_lock(&root->orphan_lock);
-   /* If someone has created ->orphan_block_rsv, be happy to use it. */
-   if (!root->orphan_block_rsv) {
-   root->orphan_block_rsv = block_rsv;
-   } else if (block_rsv) {
-   btrfs_free_block_rsv(fs_info, block_rsv);
-   block_rsv = NULL;
-   }
-
-   atomic_inc(&root->orphan_inodes);
-   spin_unlock(&root->orphan_lock);
-
-   /* grab metadata reservation from transaction handle */
-   if (reserve) {
-   ret = btrfs_orphan_reserve_metadata(trans, inode);
-   ASSERT(!ret);
-   if (ret) {
-   /*
-* dec doesn't need spin_lock as ->orphan_block_rsv
-* would be released only if ->orphan_inodes is
-* zero.
-*/
-   atomic_dec(&root->orphan_inodes);
-   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
- &inode->runtime_flags);
-   return ret;
-   }
-   }
-
-   /* insert an orphan item to track this unlinked file */
-   ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
-   if (ret) {
-   if (reserve) {
-   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
- &inode->runtime_flags);
-   btrfs_orphan_release_metadata(inode);
-   }
-   /*
-* btrfs_orphan_commit_root may race with us and set
-* ->orphan_block_rsv to zero, in order to avoid that,
-* decrease ->orphan_inodes after everything is done.
-*/
-   atomic_dec(&root->orphan_inodes);
-   if (ret != -EEXIST) {
-   btrfs_abort_transaction(trans, ret);
-   return ret;
-   }
+   ret = btrfs_insert_orphan_item(trans, inode->root, btrfs_ino(inode));
+   if (ret && ret != -EEXIST) {
+   btrfs_abort_transaction(trans, ret);
+   return ret;
}
 
return 0;
@@ -3414,24 +3353,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
 static int btrfs_orphan_del(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode)
 {
-   struct btrfs_root *root = inode->root;
-   int ret = 0;
-
-   if (trans)
-   ret = btrfs_del_orphan_item(trans, root, btrfs_ino(inode));
-
-   if (test_and_clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
-  &inode->runtime_flags))
-   btrfs_orphan_release_metadata(inode);
-
-   /*
-* btrfs_orphan_commit_root may race with us and set ->orphan_block_rsv
-   

[PATCH v4 12/12] Btrfs: reserve space for O_TMPFILE orphan item deletion

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but
it doesn't reserve space to do so. Even before the removal of the
orphan_block_rsv it wasn't using it.

Fixes: ef3b9af50bfa ("Btrfs: implement inode_operations callback tmpfile")
Reviewed-by: Filipe Manana 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5499b4e8a522..fefa665e64da 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6465,8 +6465,9 @@ static int btrfs_link(struct dentry *old_dentry, struct 
inode *dir,
 * 2 items for inode and inode ref
 * 2 items for dir items
 * 1 item for parent inode
+* 1 item for orphan item deletion if O_TMPFILE
 */
-   trans = btrfs_start_transaction(root, 5);
+   trans = btrfs_start_transaction(root, inode->i_nlink ? 5 : 6);
if (IS_ERR(trans)) {
err = PTR_ERR(trans);
trans = NULL;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 10/12] Btrfs: get rid of unused orphan infrastructure

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
  in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
  and does nothing else

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/ctree.h   |  8 
 fs/btrfs/disk-io.c |  9 -
 fs/btrfs/extent-tree.c | 38 -
 fs/btrfs/inode.c   | 43 +-
 fs/btrfs/transaction.c |  1 -
 6 files changed, 1 insertion(+), 99 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index cb7dc0aa4253..4807cde0313d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -20,7 +20,6 @@
  * new data the application may have written before commit.
  */
 #define BTRFS_INODE_ORDERED_DATA_CLOSE 0
-#define BTRFS_INODE_ORPHAN_META_RESERVED   1
 #define BTRFS_INODE_DUMMY  2
 #define BTRFS_INODE_IN_DEFRAG  3
 #define BTRFS_INODE_HAS_ASYNC_EXTENT   5
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2771cc56a622..51408de11af2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1219,9 +1219,6 @@ struct btrfs_root {
spinlock_t log_extents_lock[2];
struct list_head logged_list[2];
 
-   spinlock_t orphan_lock;
-   atomic_t orphan_inodes;
-   struct btrfs_block_rsv *orphan_block_rsv;
int orphan_cleanup_state;
 
spinlock_t inode_lock;
@@ -2764,9 +2761,6 @@ void btrfs_delalloc_release_space(struct inode *inode,
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
-int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
- struct btrfs_inode *inode);
-void btrfs_orphan_release_metadata(struct btrfs_inode *inode);
 int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
 struct btrfs_block_rsv *rsv,
 int nitems,
@@ -3238,8 +3232,6 @@ int btrfs_update_inode_fallback(struct btrfs_trans_handle 
*trans,
 int btrfs_orphan_add(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode);
 int btrfs_orphan_cleanup(struct btrfs_root *root);
-void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
- struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60caa68c3618..4a40bfdddabc 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1185,7 +1185,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
root->inode_tree = RB_ROOT;
INIT_RADIX_TREE(&root->delayed_nodes_tree, GFP_ATOMIC);
root->block_rsv = NULL;
-   root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
INIT_LIST_HEAD(&root->root_list);
@@ -1195,7 +1194,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
INIT_LIST_HEAD(&root->ordered_root);
INIT_LIST_HEAD(&root->logged_list[0]);
INIT_LIST_HEAD(&root->logged_list[1]);
-   spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
spin_lock_init(&root->delalloc_lock);
spin_lock_init(&root->ordered_extent_lock);
@@ -1216,7 +1214,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
atomic_set(&root->log_batch, 0);
-   atomic_set(&root->orphan_inodes, 0);
refcount_set(&root->refs, 1);
atomic_set(&root->will_be_snapshotted, 0);
root->log_transid = 0;
@@ -3674,8 +3671,6 @@ static void free_fs_root(struct btrfs_root *root)
 {
iput(root->ino_cache_inode);
WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
-   btrfs_free_block_rsv(root->fs_info, root->orphan_block_rsv);
-   root->orphan_block_rsv = NULL;
if (root->anon_dev)
free_anon_bdev(root->anon_dev);
if (root->subv_writers)
@@ -3766,7 +3761,6 @@ int btrfs_commit_super(struct btrfs_fs_info *fs_info)
 
 void close_ctree(struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_root *root = fs_info->tree_root;
int ret;
 
set_bit(BTRFS_FS_CLOSING_START, &fs_info->fla

[PATCH v4 07/12] Btrfs: don't return ino to ino cache if inode item removal fails

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
item will still be in the tree but we still return the ino to the ino
cache. That will blow up later when someone tries to allocate that ino,
so don't return it to the cache.

Fixes: 581bb050941b ("Btrfs: Cache free inode numbers in memory")
Reviewed-by: Josef Bacik 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 84d7dd3a30f9..ad4b7fb62f46 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5330,13 +5330,18 @@ void btrfs_evict_inode(struct inode *inode)
trans->block_rsv = rsv;
 
ret = btrfs_truncate_inode_items(trans, root, inode, 0, 0);
-   if (ret != -ENOSPC && ret != -EAGAIN)
+   if (ret) {
+   trans->block_rsv = &fs_info->trans_block_rsv;
+   btrfs_end_transaction(trans);
+   btrfs_btree_balance_dirty(fs_info);
+   if (ret != -ENOSPC && ret != -EAGAIN) {
+   btrfs_orphan_del(NULL, BTRFS_I(inode));
+   btrfs_free_block_rsv(fs_info, rsv);
+   goto no_delete;
+   }
+   } else {
break;
-
-   trans->block_rsv = &fs_info->trans_block_rsv;
-   btrfs_end_transaction(trans);
-   trans = NULL;
-   btrfs_btree_balance_dirty(fs_info);
+   }
}
 
btrfs_free_block_rsv(fs_info, rsv);
@@ -5345,12 +5350,8 @@ void btrfs_evict_inode(struct inode *inode)
 * Errors here aren't a big deal, it just means we leave orphan items
 * in the tree.  They will be cleaned up on the next mount.
 */
-   if (ret == 0) {
-   trans->block_rsv = root->orphan_block_rsv;
-   btrfs_orphan_del(trans, BTRFS_I(inode));
-   } else {
-   btrfs_orphan_del(NULL, BTRFS_I(inode));
-   }
+   trans->block_rsv = root->orphan_block_rsv;
+   btrfs_orphan_del(trans, BTRFS_I(inode));
 
trans->block_rsv = &fs_info->trans_block_rsv;
if (!(root == fs_info->tree_root ||
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 05/12] Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Now that we don't add orphan items for truncate, there can't be races on
adding or deleting an orphan item, so this bit is unnecessary.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/inode.c   | 76 +++---
 2 files changed, 20 insertions(+), 57 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 234bae55b85d..cb7dc0aa4253 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -23,7 +23,6 @@
 #define BTRFS_INODE_ORPHAN_META_RESERVED   1
 #define BTRFS_INODE_DUMMY  2
 #define BTRFS_INODE_IN_DEFRAG  3
-#define BTRFS_INODE_HAS_ORPHAN_ITEM4
 #define BTRFS_INODE_HAS_ASYNC_EXTENT   5
 #define BTRFS_INODE_NEEDS_FULL_SYNC6
 #define BTRFS_INODE_COPY_EVERYTHING7
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 110ccd40987e..9ef20d28fa9e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3354,7 +3354,6 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = NULL;
int reserve = 0;
-   bool insert = false;
int ret;
 
if (!root->orphan_block_rsv) {
@@ -3364,10 +3363,6 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
return -ENOMEM;
}
 
-   if (!test_and_set_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
- &inode->runtime_flags))
-   insert = true;
-
if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
  &inode->runtime_flags))
reserve = 1;
@@ -3381,8 +3376,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
block_rsv = NULL;
}
 
-   if (insert)
-   atomic_inc(&root->orphan_inodes);
+   atomic_inc(&root->orphan_inodes);
spin_unlock(&root->orphan_lock);
 
/* grab metadata reservation from transaction handle */
@@ -3398,36 +3392,28 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
atomic_dec(&root->orphan_inodes);
clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
  &inode->runtime_flags);
-   if (insert)
-   clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
- &inode->runtime_flags);
return ret;
}
}
 
/* insert an orphan item to track this unlinked file */
-   if (insert) {
-   ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
-   if (ret) {
-   if (reserve) {
-   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
- &inode->runtime_flags);
-   btrfs_orphan_release_metadata(inode);
-   }
-   /*
-* btrfs_orphan_commit_root may race with us and set
-* ->orphan_block_rsv to zero, in order to avoid that,
-* decrease ->orphan_inodes after everything is done.
-*/
-   atomic_dec(&root->orphan_inodes);
-   if (ret != -EEXIST) {
-   clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
- &inode->runtime_flags);
-   btrfs_abort_transaction(trans, ret);
-   return ret;
-   }
+   ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
+   if (ret) {
+   if (reserve) {
+   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
+ &inode->runtime_flags);
+   btrfs_orphan_release_metadata(inode);
+   }
+   /*
+* btrfs_orphan_commit_root may race with us and set
+* ->orphan_block_rsv to zero, in order to avoid that,
+* decrease ->orphan_inodes after everything is done.
+*/
+   atomic_dec(&root->orphan_inodes);
+   if (ret != -EEXIST) {
+   btrfs_abort_transaction(trans, ret);
+   return ret;
}
-   ret = 0;
}
 
return 0;
@@ -3441,14 +3427,9 @@ static int btrfs_orphan_del(struct btrfs_trans_handle 
*trans,
struct btrfs_inode *inode)
 {
struct btrfs_root *root = inode->root;
-   int delete_item = 0;
int ret = 0;
 
-   if (test_and_clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
-  &inode->runtime_flags))
-   delete_item = 1;
-
-   if (delete_item && trans)
+ 

[PATCH v4 03/12] Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_free_extent() can fail because of ENOMEM. There's no reason to
panic here, we can just abort the transaction.

Fixes: f4b9aa8d3b87 ("btrfs_truncate")
Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bfa0e094a60e..fa1da1991001 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4655,7 +4655,10 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
extent_num_bytes, 0,
btrfs_header_owner(leaf),
ino, extent_offset);
-   BUG_ON(ret);
+   if (ret) {
+   btrfs_abort_transaction(trans, ret);
+   break;
+   }
if (btrfs_should_throttle_delayed_refs(trans, fs_info))
btrfs_async_run_delayed_refs(fs_info,
trans->delayed_ref_updates * 2,
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 00/12] Btrfs: orphan and truncate fixes

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This is the fourth (and hopefully final) version of the orphan item
early ENOSPC and related fixes.

Changes since v3:

- Changed another stale comment in patch 1
- Moved BTRFS_INODE_ORPHAN_META_RESERVED flag removal to patch 10
  instead of patch 9
- Moved inode runtime flag renumbering to a separate patch (patch 11)
- Added some more reviewed-bys

Changes since v2:

- Add patch 5 to get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
- Move patch 10 to patch 6
- Got rid of patch 5; the bug goes away in the process of removing code
  for patches 9 and 10
- Rename patch 10 batch to what it was called in v1

Changes since v1:

- Added two extra cleanups, patches 10 and 11
- Added a forgotten clear of the orphan bit in patch 8
- Reworded titles of patches 6 and 9
- Added people's reviewed-bys

Cover letter from v1:

At Facebook we hit an early ENOSPC issue which we tracked down to the
reservations for orphan items of deleted-but-still-open files. The
primary function of this series is to fix that bug, but I ended up
uncovering a pile of other issues in the process, most notably that the
orphan items we create for truncate are useless.

I've also posted an xfstest that reproduces this bug.

Thanks!

*** BLURB HERE ***

Omar Sandoval (12):
  Btrfs: update stale comments referencing vmtruncate()
  Btrfs: fix error handling in btrfs_truncate_inode_items()
  Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()
  Btrfs: stop creating orphan items for truncate
  Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
  Btrfs: delete dead code in btrfs_orphan_commit_root()
  Btrfs: don't return ino to ino cache if inode item removal fails
  Btrfs: refactor btrfs_evict_inode() reserve refill dance
  Btrfs: fix ENOSPC caused by orphan items reservations
  Btrfs: get rid of unused orphan infrastructure
  Btrfs: renumber BTRFS_INODE_ runtime flags
  Btrfs: reserve space for O_TMPFILE orphan item deletion

 fs/btrfs/btrfs_inode.h  |  18 +-
 fs/btrfs/ctree.h|   8 -
 fs/btrfs/disk-io.c  |   9 -
 fs/btrfs/extent-tree.c  |  38 ---
 fs/btrfs/free-space-cache.c |   6 +-
 fs/btrfs/inode.c| 580 ++--
 fs/btrfs/transaction.c  |   1 -
 7 files changed, 172 insertions(+), 488 deletions(-)

-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 01/12] Btrfs: update stale comments referencing vmtruncate()

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Commit a41ad394a03b ("Btrfs: convert to the new truncate sequence")
changed btrfs_setsize() to call truncate_setsize() instead of
vmtruncate() but didn't update the comment above it. truncate_setsize()
never fails (the IS_SWAPFILE() check happens elsewhere), so remove the
comment.

Additionally, the comment above btrfs_page_mkwrite() references
vmtruncate(), but truncate_setsize() does the size write and page
locking now.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..0c644ad7e1cb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5106,7 +5106,6 @@ static int btrfs_setsize(struct inode *inode, struct 
iattr *attr)
if (ret)
return ret;
 
-   /* we don't support swapfiles, so vmtruncate shouldn't fail */
truncate_setsize(inode, newsize);
 
/* Disable nonlocked read DIO to avoid the end less truncate */
@@ -8868,8 +8867,8 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
  *
  * We are not allowed to take the i_mutex here so we have to play games to
  * protect against truncate races as the page could now be beyond EOF.  Because
- * vmtruncate() writes the inode size before removing pages, once we have the
- * page lock we can determine safely if the page is beyond EOF. If it is not
+ * truncate_setsize() writes the inode size before removing pages, once we have
+ * the page lock we can determine safely if the page is beyond EOF. If it is 
not
  * beyond EOF, then the page is guaranteed safe against truncation until we
  * unlock the page.
  */
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 11/12] Btrfs: renumber BTRFS_INODE_ runtime flags

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

We got rid of BTRFS_INODE_HAS_ORPHAN_ITEM and
BTRFS_INODE_ORPHAN_META_RESERVED, so we can renumber the flags to make
them consecutive again.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/btrfs_inode.h | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 4807cde0313d..bbbe7f308d68 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -20,14 +20,14 @@
  * new data the application may have written before commit.
  */
 #define BTRFS_INODE_ORDERED_DATA_CLOSE 0
-#define BTRFS_INODE_DUMMY  2
-#define BTRFS_INODE_IN_DEFRAG  3
-#define BTRFS_INODE_HAS_ASYNC_EXTENT   5
-#define BTRFS_INODE_NEEDS_FULL_SYNC6
-#define BTRFS_INODE_COPY_EVERYTHING7
-#define BTRFS_INODE_IN_DELALLOC_LIST   8
-#define BTRFS_INODE_READDIO_NEED_LOCK  9
-#define BTRFS_INODE_HAS_PROPS  10
+#define BTRFS_INODE_DUMMY  1
+#define BTRFS_INODE_IN_DEFRAG  2
+#define BTRFS_INODE_HAS_ASYNC_EXTENT   3
+#define BTRFS_INODE_NEEDS_FULL_SYNC4
+#define BTRFS_INODE_COPY_EVERYTHING5
+#define BTRFS_INODE_IN_DELALLOC_LIST   6
+#define BTRFS_INODE_READDIO_NEED_LOCK  7
+#define BTRFS_INODE_HAS_PROPS  8
 
 /* in memory btrfs inode */
 struct btrfs_inode {
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 08/12] Btrfs: refactor btrfs_evict_inode() reserve refill dance

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

The truncate loop in btrfs_evict_inode() does two things at once:

- It refills the temporary block reserve, potentially stealing from the
  global reserve or committing
- It calls btrfs_truncate_inode_items()

The tangle of continues hides the fact that these two steps are actually
separate. Split the first step out into a separate function both for
clarity and so that we can reuse it in a later patch.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 113 ++-
 1 file changed, 42 insertions(+), 71 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ad4b7fb62f46..0a753f3a3321 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5189,13 +5189,52 @@ static void evict_inode_truncate_pages(struct inode 
*inode)
spin_unlock(&io_tree->lock);
 }
 
+static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root 
*root,
+   struct btrfs_block_rsv 
*rsv,
+   u64 min_size)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+   int failures = 0;
+
+   for (;;) {
+   struct btrfs_trans_handle *trans;
+   int ret;
+
+   ret = btrfs_block_rsv_refill(root, rsv, min_size,
+BTRFS_RESERVE_FLUSH_LIMIT);
+
+   if (ret && ++failures > 2) {
+   btrfs_warn(fs_info,
+  "could not allocate space for a delete; will 
truncate on mount");
+   return ERR_PTR(-ENOSPC);
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans) || !ret)
+   return trans;
+
+   /*
+* Try to steal from the global reserve if there is space for
+* it.
+*/
+   if (!btrfs_check_space_for_delayed_refs(trans, fs_info) &&
+   !btrfs_block_rsv_migrate(global_rsv, rsv, min_size, 0))
+   return trans;
+
+   /* If not, commit and try again. */
+   ret = btrfs_commit_transaction(trans);
+   if (ret)
+   return ERR_PTR(ret);
+   }
+}
+
 void btrfs_evict_inode(struct inode *inode)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_trans_handle *trans;
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct btrfs_block_rsv *rsv, *global_rsv;
-   int steal_from_global = 0;
+   struct btrfs_block_rsv *rsv;
u64 min_size;
int ret;
 
@@ -5248,85 +5287,17 @@ void btrfs_evict_inode(struct inode *inode)
}
rsv->size = min_size;
rsv->failfast = 1;
-   global_rsv = &fs_info->global_block_rsv;
 
btrfs_i_size_write(BTRFS_I(inode), 0);
 
-   /*
-* This is a bit simpler than btrfs_truncate since we've already
-* reserved our space for our orphan item in the unlink, so we just
-* need to reserve some slack space in case we add bytes and update
-* inode item when doing the truncate.
-*/
while (1) {
-   ret = btrfs_block_rsv_refill(root, rsv, min_size,
-BTRFS_RESERVE_FLUSH_LIMIT);
-
-   /*
-* Try and steal from the global reserve since we will
-* likely not use this space anyway, we want to try as
-* hard as possible to get this to work.
-*/
-   if (ret)
-   steal_from_global++;
-   else
-   steal_from_global = 0;
-   ret = 0;
-
-   /*
-* steal_from_global == 0: we reserved stuff, hooray!
-* steal_from_global == 1: we didn't reserve stuff, boo!
-* steal_from_global == 2: we've committed, still not a lot of
-* room but maybe we'll have room in the global reserve this
-* time.
-* steal_from_global == 3: abandon all hope!
-*/
-   if (steal_from_global > 2) {
-   btrfs_warn(fs_info,
-  "Could not get space for a delete, will 
truncate on mount %d",
-  ret);
-   btrfs_orphan_del(NULL, BTRFS_I(inode));
-   btrfs_free_block_rsv(fs_info, rsv);
-   goto no_delete;
-   }
-
-   trans = btrfs_join_transaction(root);
+   trans = evict_refill_and_join(root, rsv, min_size);
if (IS_ERR(trans)) {
btrfs_orphan_del(NULL, BTRFS_I(inode));
btrfs_free_blo

[PATCH v4 02/12] Btrfs: fix error handling in btrfs_truncate_inode_items()

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_truncate_inode_items() uses two variables for error handling, ret
and err. These are not handled consistently, leading to a couple of
bugs.

- Errors from btrfs_del_items() are handled but not propagated to the
  caller
- If btrfs_run_delayed_refs() fails and aborts the transaction, we
  continue running

Just use ret everywhere and simplify things a bit, fixing both of these
issues.

Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling")
Fixes: 1262133b8d6f ("Btrfs: account for crcs in delayed ref processing")
Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 55 
 1 file changed, 28 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0c644ad7e1cb..bfa0e094a60e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4442,7 +4442,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
int pending_del_slot = 0;
int extent_type = -1;
int ret;
-   int err = 0;
u64 ino = btrfs_ino(BTRFS_I(inode));
u64 bytes_deleted = 0;
bool be_nice = false;
@@ -4494,22 +4493,19 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
 * up a huge file in a single leaf.  Most of the time that
 * bytes_deleted is > 0, it will be huge by the time we get here
 */
-   if (be_nice && bytes_deleted > SZ_32M) {
-   if (btrfs_should_end_transaction(trans)) {
-   err = -EAGAIN;
-   goto error;
-   }
+   if (be_nice && bytes_deleted > SZ_32M &&
+   btrfs_should_end_transaction(trans)) {
+   ret = -EAGAIN;
+   goto out;
}
 
-
path->leave_spinning = 1;
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
-   if (ret < 0) {
-   err = ret;
+   if (ret < 0)
goto out;
-   }
 
if (ret > 0) {
+   ret = 0;
/* there are no items in the tree for us to truncate, we're
 * done
 */
@@ -4620,7 +4616,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 * We have to bail so the last_size is set to
 * just before this extent.
 */
-   err = NEED_TRUNCATE_BLOCK;
+   ret = NEED_TRUNCATE_BLOCK;
break;
}
 
@@ -4687,7 +4683,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
pending_del_nr);
if (ret) {
btrfs_abort_transaction(trans, ret);
-   goto error;
+   break;
}
pending_del_nr = 0;
}
@@ -4698,8 +4694,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
trans->delayed_ref_updates = 0;
ret = btrfs_run_delayed_refs(trans,
   updates * 2);
-   if (ret && !err)
-   err = ret;
+   if (ret)
+   break;
}
}
/*
@@ -4707,8 +4703,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 * and let the transaction restart
 */
if (should_end) {
-   err = -EAGAIN;
-   goto error;
+   ret = -EAGAIN;
+   break;
}
goto search_again;
} else {
@@ -4716,32 +4712,37 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
}
}
 out:
-   if (pending_del_nr) {
-   ret = btrfs_del_items(trans, root, path, pending_del_slot,
+   if (ret >= 0 && pending_del_nr) {
+   int err;
+
+   err = btrfs_del_items(trans, root, path, pending_del_slot,
  pending_del_nr);
-   if (ret)
-   btrfs_abort_transaction(trans, ret);
+   if (err) {
+   btrfs_abort_transaction(trans, err);
+   ret = err;
+   }
}
-error:
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
ASSERT(last_size >= new

[PATCH v4 04/12] Btrfs: stop creating orphan items for truncate

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Currently, we insert an orphan item during a truncate so that if there's
a crash, we don't leak extents past the on-disk i_size. However, since
commit 7f4f6e0a3f6d ("Btrfs: only update disk_i_size as we remove
extents"), we keep disk_i_size in sync with the extent items as we
truncate, so orphan cleanup will never have any extents to remove. Don't
bother with the superfluous orphan item.

Reviewed-by: Josef Bacik 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/free-space-cache.c |   6 +-
 fs/btrfs/inode.c| 159 +++-
 2 files changed, 51 insertions(+), 114 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index e5b569bebc73..d5f80cb300be 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -253,10 +253,8 @@ int btrfs_truncate_free_space_cache(struct 
btrfs_trans_handle *trans,
truncate_pagecache(inode, 0);
 
/*
-* We don't need an orphan item because truncating the free space cache
-* will never be split across transactions.
-* We don't need to check for -EAGAIN because we're a free space
-* cache inode
+* We skip the throttling logic for free space cache inodes, so we don't
+* need to check for -EAGAIN.
 */
ret = btrfs_truncate_inode_items(trans, root, inode,
 0, BTRFS_EXTENT_DATA_KEY);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fa1da1991001..110ccd40987e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3341,8 +3341,8 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
 }
 
 /*
- * This creates an orphan entry for the given inode in case something goes
- * wrong in the middle of an unlink/truncate.
+ * This creates an orphan entry for the given inode in case something goes 
wrong
+ * in the middle of an unlink.
  *
  * NOTE: caller of this function should reserve 5 units of metadata for
  *  this function.
@@ -3405,7 +3405,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
}
}
 
-   /* insert an orphan item to track this unlinked/truncated file */
+   /* insert an orphan item to track this unlinked file */
if (insert) {
ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
if (ret) {
@@ -3434,8 +3434,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
 }
 
 /*
- * We have done the truncate/delete so we can go ahead and remove the orphan
- * item for this particular inode.
+ * We have done the delete so we can go ahead and remove the orphan item for
+ * this particular inode.
  */
 static int btrfs_orphan_del(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode)
@@ -3479,7 +3479,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
struct btrfs_trans_handle *trans;
struct inode *inode;
u64 last_objectid = 0;
-   int ret = 0, nr_unlink = 0, nr_truncate = 0;
+   int ret = 0, nr_unlink = 0;
 
if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEANUP_STARTED))
return 0;
@@ -3579,12 +3579,31 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
key.offset = found_key.objectid - 1;
continue;
}
+
}
+
/*
-* Inode is already gone but the orphan item is still there,
-* kill the orphan item.
+* If we have an inode with links, there are a couple of
+* possibilities. Old kernels (before v3.12) used to create an
+* orphan item for truncate indicating that there were possibly
+* extent items past i_size that needed to be deleted. In v3.12,
+* truncate was changed to update i_size in sync with the extent
+* items, but the (useless) orphan item was still created. Since
+* v4.18, we don't create the orphan item for truncate at all.
+*
+* So, this item could mean that we need to do a truncate, but
+* only if this filesystem was last used on a pre-v3.12 kernel
+* and was not cleanly unmounted. The odds of that are quite
+* slim, and it's a pain to do the truncate now, so just delete
+* the orphan item.
+*
+* It's also possible that this orphan item was supposed to be
+* deleted but wasn't. The inode number may have been reused,
+* but either way, we can delete the orphan item.
 */
-   if (ret == -ENOENT) {
+   if (ret == -ENOENT || inode->i_nlink) {
+   if (!ret)
+   iput(inode);
trans = btrfs_start_transaction(root, 1);
 

Re: [PATCH 1/3] fs: add initial bh_result->b_private value to __blockdev_direct_IO()

2018-05-11 Thread Al Viro
On Thu, May 10, 2018 at 11:30:10PM -0700, Omar Sandoval wrote:
>  do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> struct block_device *bdev, struct iov_iter *iter,
> get_block_t get_block, dio_iodone_t end_io,
> -   dio_submit_t submit_io, int flags)
> +   dio_submit_t submit_io, int flags, void *private)

Oh, dear...  That's what, 9 arguments?  I agree that the hack in question
is obscene, but so is this ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-11 Thread Mark Fasheh
Right now we return EINVAL if a process does not have permission to dedupe a
file. This was an oversight on my part. EPERM gives a true description of
the nature of our error, and EINVAL is already used for the case that the
filesystem does not support dedupe.

Signed-off-by: Mark Fasheh 
---
 fs/read_write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 77986a2e2a3b..8edef43a182c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
info->status = -EINVAL;
} else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
 uid_eq(current_fsuid(), dst->i_uid))) {
-   info->status = -EINVAL;
+   info->status = -EPERM;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
} else if (S_ISDIR(dst->i_mode)) {
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] vfs: allow dedupe of user owned read-only files

2018-05-11 Thread Mark Fasheh
The permission check in vfs_dedupe_file_range() is too coarse - We
only allow dedupe of the destination file if the user is root, or
they have the file open for write.

This effectively limits a non-root user from deduping their own
read-only files. As file data during a dedupe does not change,
this is unexpected behavior and this has caused a number of issue
reports. For an example, see:

https://github.com/markfasheh/duperemove/issues/129

So change the check so we allow dedupe on the target if:

- the root or admin is asking for it
- the owner of the file is asking for the dedupe
- the process has write access

Signed-off-by: Mark Fasheh 
---
 fs/read_write.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index c4eabbfc90df..77986a2e2a3b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2036,7 +2036,8 @@ int vfs_dedupe_file_range(struct file *file, struct 
file_dedupe_range *same)
 
if (info->reserved) {
info->status = -EINVAL;
-   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+   } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
+uid_eq(current_fsuid(), dst->i_uid))) {
info->status = -EINVAL;
} else if (file->f_path.mnt != dst_file->f_path.mnt) {
info->status = -EXDEV;
-- 
2.15.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] vfs: better dedupe permission check

2018-05-11 Thread Mark Fasheh
Hi,

The following patches fix a couple of issues with the permission
check we do in vfs_dedupe_file_range().

The first patch expands our check to allow dedupe of a readonly file
if the user owns it. Existing behavior is that we'll allow dedupe only
if:

- the user is an admin (root)
- the user has the file open for write

This makes it impossible for a user to dedupe their own file set
unless they do it as root, or ensure that all files have write
permission. There's a couple of duperemove bugs open for this:

https://github.com/markfasheh/duperemove/issues/129
https://github.com/markfasheh/duperemove/issues/86

The solution is simple - we allow dedupe of the target if the user
owns it. With that patch, a user can dedupe all of their files.

The 2nd patch fixes our return code for permission denied to be
EPERM. For some reason we're returning EINVAL - I think that's
probably my fault. At any rate, we need to be returning something
descriptive of the actual problem, otherwise callers see EINVAL and
can't really make a valid determination of what's gone wrong.

This has also popped up in duperemove, mostly in the form of cryptic
error messages. Because this is a code returned to userspace, I did
check the other users of extent-same that I could find. Both 'bees'
and 'rust-btrfs' do the same as duperemove and simply report the error
(as they should).

The patches are also available in git:

git pull https://github.com/markfasheh/linux dedupe-perms

Thanks,
  --Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

2018-05-11 Thread Timofey Titovets
пт, 11 мая 2018 г. в 20:32, Omar Sandoval :

> On Fri, May 11, 2018 at 06:49:16PM +0200, David Sterba wrote:
> > On Fri, May 11, 2018 at 05:25:50PM +0100, Filipe Manana wrote:
> > > On Fri, May 11, 2018 at 4:57 PM, David Sterba 
wrote:
> > > > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> > > > arrays can be 32KiB large. To avoid allocation failures due to
> > > > fragmented memory, use the allocation with fallback to vmalloc.
> > > >
> > > > Signed-off-by: David Sterba 
> > > > ---
> > > >
> > > > This depends on the patches that remove the 16MiB restriction in the
> > > > dedupe ioctl, but contextually can be applied to the current code
too.
> > > >
> > > > https://patchwork.kernel.org/patch/10374941/
> > > >
> > > >  fs/btrfs/ioctl.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > > index b572e38b4b64..a7f517009cd7 100644
> > > > --- a/fs/btrfs/ioctl.c
> > > > +++ b/fs/btrfs/ioctl.c
> > > > @@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode
*src, u64 loff, u64 olen,
> > > >  * locking. We use an array for the page pointers. Size of
the array is
> > > >  * bounded by len, which is in turn bounded by
BTRFS_MAX_DEDUPE_LEN.
> > > >  */
> > > > -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > > > -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > > > +   cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > > > +   cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > >
> > > Kvzalloc should take 2 parameters and not 3.
> >
> > And the right function is kvmalloc_array.
> >
> > > Also, aren't the corresponding kvfree() calls missing?
> >
> > Yes, thanks for catching it. The updated version:
> >
> > From: David Sterba 
> > Subject: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data
> >
> > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> > arrays can be 32KiB large. To avoid allocation failures due to
> > fragmented memory, use the allocation with fallback to vmalloc.
> >
> > Signed-off-by: David Sterba 
> > ---
> >  fs/btrfs/ioctl.c | 16 +---
> >  1 file changed, 9 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index b572e38b4b64..4fcfa05ed960 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -3178,12 +3178,13 @@ static int btrfs_extent_same(struct inode *src,
u64 loff, u64 olen,
> >* locking. We use an array for the page pointers. Size of the
array is
> >* bounded by len, which is in turn bounded by
BTRFS_MAX_DEDUPE_LEN.
> >*/
> > - cmp.src_pages = kcalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > - cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *),
GFP_KERNEL);
> > + cmp.src_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> > +GFP_KERNEL);
> > + cmp.dst_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> > +GFP_KERNEL);

> kcalloc() implies __GFP_ZERO, do we need that here?

AFAIK, yes, because:
btrfs_cmp_data_free():
...
pg = cmp->src_pages[i];
if (pg) {...}
..

And we will catch that, if errors happens in gather_extent_pages().

Thanks.
> >   if (!cmp.src_pages || !cmp.dst_pages) {
> > - kfree(cmp.src_pages);
> > - kfree(cmp.dst_pages);
> > - return -ENOMEM;
> > + ret = -ENOMEM;
> > + goto out_free;
> >   }
> >
> >   if (same_inode)
> > @@ -3211,8 +3212,9 @@ static int btrfs_extent_same(struct inode *src,
u64 loff, u64 olen,
> >   else
> >   btrfs_double_inode_unlock(src, dst);
> >
> > - kfree(cmp.src_pages);
> > - kfree(cmp.dst_pages);
> > +out_free:
> > + kvfree(cmp.src_pages);
> > + kvfree(cmp.dst_pages);
> >
> >   return ret;
> >  }
> > --
> > 2.16.2
> >


-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

2018-05-11 Thread Omar Sandoval
On Fri, May 11, 2018 at 06:49:16PM +0200, David Sterba wrote:
> On Fri, May 11, 2018 at 05:25:50PM +0100, Filipe Manana wrote:
> > On Fri, May 11, 2018 at 4:57 PM, David Sterba  wrote:
> > > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> > > arrays can be 32KiB large. To avoid allocation failures due to
> > > fragmented memory, use the allocation with fallback to vmalloc.
> > >
> > > Signed-off-by: David Sterba 
> > > ---
> > >
> > > This depends on the patches that remove the 16MiB restriction in the
> > > dedupe ioctl, but contextually can be applied to the current code too.
> > >
> > > https://patchwork.kernel.org/patch/10374941/
> > >
> > >  fs/btrfs/ioctl.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > index b572e38b4b64..a7f517009cd7 100644
> > > --- a/fs/btrfs/ioctl.c
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode *src, u64 
> > > loff, u64 olen,
> > >  * locking. We use an array for the page pointers. Size of the 
> > > array is
> > >  * bounded by len, which is in turn bounded by 
> > > BTRFS_MAX_DEDUPE_LEN.
> > >  */
> > > -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), 
> > > GFP_KERNEL);
> > > -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), 
> > > GFP_KERNEL);
> > > +   cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *), 
> > > GFP_KERNEL);
> > > +   cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *), 
> > > GFP_KERNEL);
> > 
> > Kvzalloc should take 2 parameters and not 3.
> 
> And the right function is kvmalloc_array.
> 
> > Also, aren't the corresponding kvfree() calls missing?
> 
> Yes, thanks for catching it. The updated version:
> 
> From: David Sterba 
> Subject: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data
> 
> The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> arrays can be 32KiB large. To avoid allocation failures due to
> fragmented memory, use the allocation with fallback to vmalloc.
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ioctl.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index b572e38b4b64..4fcfa05ed960 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3178,12 +3178,13 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
>* locking. We use an array for the page pointers. Size of the array is
>* bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN.
>*/
> - cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
> - cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
> + cmp.src_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> +GFP_KERNEL);
> + cmp.dst_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> +GFP_KERNEL);

kcalloc() implies __GFP_ZERO, do we need that here?

>   if (!cmp.src_pages || !cmp.dst_pages) {
> - kfree(cmp.src_pages);
> - kfree(cmp.dst_pages);
> - return -ENOMEM;
> + ret = -ENOMEM;
> + goto out_free;
>   }
>  
>   if (same_inode)
> @@ -3211,8 +3212,9 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
>   else
>   btrfs_double_inode_unlock(src, dst);
>  
> - kfree(cmp.src_pages);
> - kfree(cmp.dst_pages);
> +out_free:
> + kvfree(cmp.src_pages);
> + kvfree(cmp.dst_pages);
>  
>   return ret;
>  }
> -- 
> 2.16.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Btrfs: stop abusing current->journal_info for direct I/O

2018-05-11 Thread Omar Sandoval
On Fri, May 11, 2018 at 12:53:36PM +0300, Nikolay Borisov wrote:
> 
> 
> On 11.05.2018 09:30, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Hi, everyone,
> > 
> > Btrfs currently abuses current->journal_info in btrfs_direct_IO() in
> > order to pass around some state to get_block() and submit_io(). This
> > hack is ugly and unnecessary because the data we pass around is only
> > used in one call frame. Robbie Ko also pointed out [1] that it could
> > potentially cause a crash if we happen to end up in start_transaction()
> > (e.g., from memory reclaim calling into btrfs_evict_inode(), which can
> > start a transaction). I'm not convinced that Robbie's case can happen in
> > practice since we are using GFP_NOFS for allocations during direct I/O,
> > but either way it's fragile and nasty.
> 
> When I worked initially on btrfs-over-swap I managed to hit a case where
> ext4 stacked on top of btrfs would crash since btrfs will overwrite
> journal_info which was set by ext4. So this change is indeed welcome :)

Yup, that's what I originally made these patches for. Although my latest
idea for swap is to do something along the lines of Darrick's
iomap_swap_activate(): https://patchwork.kernel.org/patch/10376435/,
I'll be getting back to that soon.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/11] Btrfs: remove stale comment referencing vmtruncate()

2018-05-11 Thread Omar Sandoval
On Fri, May 11, 2018 at 12:19:43PM +0200, David Sterba wrote:
> On Fri, May 11, 2018 at 12:56:06AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Commit a41ad394a03b ("Btrfs: convert to the new truncate sequence")
> > changed vmtruncate() to truncate_setsize() but didn't update the comment
> > above it. truncate_setsize() never fails (the IS_SWAPFILE() check
> > happens elsewhere), so remove the comment.
> 
> There's one more mention of vmtruncate at btrfs_page_mkwrite, can you
> please remove it and review that the comment is not stale in other
> respects? Thanks.

Yup, I'll take a look at that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

2018-05-11 Thread Filipe Manana
On Fri, May 11, 2018 at 5:49 PM, David Sterba  wrote:
> On Fri, May 11, 2018 at 05:25:50PM +0100, Filipe Manana wrote:
>> On Fri, May 11, 2018 at 4:57 PM, David Sterba  wrote:
>> > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
>> > arrays can be 32KiB large. To avoid allocation failures due to
>> > fragmented memory, use the allocation with fallback to vmalloc.
>> >
>> > Signed-off-by: David Sterba 
>> > ---
>> >
>> > This depends on the patches that remove the 16MiB restriction in the
>> > dedupe ioctl, but contextually can be applied to the current code too.
>> >
>> > https://patchwork.kernel.org/patch/10374941/
>> >
>> >  fs/btrfs/ioctl.c | 4 ++--
>> >  1 file changed, 2 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> > index b572e38b4b64..a7f517009cd7 100644
>> > --- a/fs/btrfs/ioctl.c
>> > +++ b/fs/btrfs/ioctl.c
>> > @@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode *src, u64 
>> > loff, u64 olen,
>> >  * locking. We use an array for the page pointers. Size of the 
>> > array is
>> >  * bounded by len, which is in turn bounded by 
>> > BTRFS_MAX_DEDUPE_LEN.
>> >  */
>> > -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), 
>> > GFP_KERNEL);
>> > -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), 
>> > GFP_KERNEL);
>> > +   cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *), 
>> > GFP_KERNEL);
>> > +   cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *), 
>> > GFP_KERNEL);
>>
>> Kvzalloc should take 2 parameters and not 3.
>
> And the right function is kvmalloc_array.
>
>> Also, aren't the corresponding kvfree() calls missing?
>
> Yes, thanks for catching it. The updated version:
>
> From: David Sterba 
> Subject: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data
>
> The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> arrays can be 32KiB large. To avoid allocation failures due to
> fragmented memory, use the allocation with fallback to vmalloc.
>
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/ioctl.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index b572e38b4b64..4fcfa05ed960 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3178,12 +3178,13 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
>  * locking. We use an array for the page pointers. Size of the array 
> is
>  * bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN.
>  */
> -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
> -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
> +   cmp.src_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> +  GFP_KERNEL);
> +   cmp.dst_pages = kvmalloc_array(num_pages, sizeof(struct page *),
> +  GFP_KERNEL);
> if (!cmp.src_pages || !cmp.dst_pages) {
> -   kfree(cmp.src_pages);
> -   kfree(cmp.dst_pages);
> -   return -ENOMEM;
> +   ret = -ENOMEM;
> +   goto out_free;
> }
>
> if (same_inode)
> @@ -3211,8 +3212,9 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
> else
> btrfs_double_inode_unlock(src, dst);
>
> -   kfree(cmp.src_pages);
> -   kfree(cmp.dst_pages);
> +out_free:
> +   kvfree(cmp.src_pages);
> +   kvfree(cmp.dst_pages);

kvfree() missing at btrfs_cmp_data_free() too.

>
> return ret;
>  }
> --
> 2.16.2
>



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/11] Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM

2018-05-11 Thread Omar Sandoval
On Fri, May 11, 2018 at 06:51:30PM +0200, David Sterba wrote:
> On Fri, May 11, 2018 at 12:10:38PM -0400, Josef Bacik wrote:
> > I told him to do this, these flags aren't exposed anywhere are they?
> > They are in-kernel specific stuff, please tell me we aren't exposing
> > these via sysfs?
> 
> No worries, they're completely internal, just that shifting the number
> sequence does not need to be in this patch and should be made
> separately.

Sure, I can split out the renumbering into a separate patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: qgroup: Search commit root for rescan to avoid missing extent

2018-05-11 Thread Jeff Mahoney
On 5/3/18 3:20 AM, Qu Wenruo wrote:
> When doing qgroup rescan using the following script (modified from
> btrfs/017 test case), we can sometimes hit qgroup corruption.
> 
> --
> umount $dev &> /dev/null
> umount $mnt &> /dev/null
> 
> mkfs.btrfs -f -n 64k $dev
> mount $dev $mnt
> 
> extent_size=8192
> 
> xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
> btrfs subvolume snapshot $mnt $mnt/snap
> 
> xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
> xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
> xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
> btrfs quota enable $mnt
> 
>  # -W is the new option to only wait rescan while not starting new one
> btrfs quota rescan -W $mnt
> btrfs qgroup show -prce $mnt
> 
>  # Need to patch btrfs-progs to report qgroup mismatch as error
> btrfs check $dev || _fail
> --
> 
> For fast machine, we can hit some corruption which missed accounting
> tree blocks:
> --
> qgroupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5   8.00KiB0.00B none none --- ---
> 0/257 8.00KiB0.00B none none --- ---
> --
> 
> This is due to the fact that we're always searching commit root for
> btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
> from current transaction, not commit root.
> 
> And if our tree blocks get modified in current transaction, we won't
> find any owner in commit root, thus causing the corruption.
> 
> Fix it by searching commit root for extent tree for
> qgroup_rescan_leaf().
> 
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 
> ---
> 
> Please keep in mind that it is possible to hit another type of race
> which double accounting tree blocks:
> --
> qgroupid rfer excl max_rfer max_excl parent  child
>      --  -
> 0/5  136.00KiB 128.00KiB none none --- ---
> 0/257136.00KiB 128.00KiB none none --- ---
> --
> For this type of corruption, this patch could reduce the possibility,
> but the root cause is race between transaction commit and qgroup rescan,
> which needs to be addressed in another patch.
> ---
>  fs/btrfs/qgroup.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 4baa4ba2d630..829e8fe5c97e 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2681,6 +2681,11 @@ static void btrfs_qgroup_rescan_worker(struct 
> btrfs_work *work)
>   path = btrfs_alloc_path();
>   if (!path)
>   goto out;
> + /*
> +  * Rescan should only search for commit root, and any later difference
> +  * should be recorded by qgroup
> +  */
> + path->search_commit_root = 1;
>  
>   err = 0;
>   while (!err && !btrfs_fs_closing(fs_info)) {
> 

If we're searching the commit root here, do we need the tree mod
sequence number dance in qgroup_rescan_leaf anymore?

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 05/11] Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 12:10:38PM -0400, Josef Bacik wrote:
> I told him to do this, these flags aren't exposed anywhere are they?
> They are in-kernel specific stuff, please tell me we aren't exposing
> these via sysfs?

No worries, they're completely internal, just that shifting the number
sequence does not need to be in this patch and should be made
separately.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 05:25:50PM +0100, Filipe Manana wrote:
> On Fri, May 11, 2018 at 4:57 PM, David Sterba  wrote:
> > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> > arrays can be 32KiB large. To avoid allocation failures due to
> > fragmented memory, use the allocation with fallback to vmalloc.
> >
> > Signed-off-by: David Sterba 
> > ---
> >
> > This depends on the patches that remove the 16MiB restriction in the
> > dedupe ioctl, but contextually can be applied to the current code too.
> >
> > https://patchwork.kernel.org/patch/10374941/
> >
> >  fs/btrfs/ioctl.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index b572e38b4b64..a7f517009cd7 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode *src, u64 
> > loff, u64 olen,
> >  * locking. We use an array for the page pointers. Size of the 
> > array is
> >  * bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN.
> >  */
> > -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), 
> > GFP_KERNEL);
> > -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), 
> > GFP_KERNEL);
> > +   cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *), 
> > GFP_KERNEL);
> > +   cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *), 
> > GFP_KERNEL);
> 
> Kvzalloc should take 2 parameters and not 3.

And the right function is kvmalloc_array.

> Also, aren't the corresponding kvfree() calls missing?

Yes, thanks for catching it. The updated version:

From: David Sterba 
Subject: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
arrays can be 32KiB large. To avoid allocation failures due to
fragmented memory, use the allocation with fallback to vmalloc.

Signed-off-by: David Sterba 
---
 fs/btrfs/ioctl.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b572e38b4b64..4fcfa05ed960 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3178,12 +3178,13 @@ static int btrfs_extent_same(struct inode *src, u64 
loff, u64 olen,
 * locking. We use an array for the page pointers. Size of the array is
 * bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN.
 */
-   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
+   cmp.src_pages = kvmalloc_array(num_pages, sizeof(struct page *),
+  GFP_KERNEL);
+   cmp.dst_pages = kvmalloc_array(num_pages, sizeof(struct page *),
+  GFP_KERNEL);
if (!cmp.src_pages || !cmp.dst_pages) {
-   kfree(cmp.src_pages);
-   kfree(cmp.dst_pages);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out_free;
}
 
if (same_inode)
@@ -3211,8 +3212,9 @@ static int btrfs_extent_same(struct inode *src, u64 loff, 
u64 olen,
else
btrfs_double_inode_unlock(src, dst);
 
-   kfree(cmp.src_pages);
-   kfree(cmp.dst_pages);
+out_free:
+   kvfree(cmp.src_pages);
+   kvfree(cmp.dst_pages);
 
return ret;
 }
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

2018-05-11 Thread Filipe Manana
On Fri, May 11, 2018 at 4:57 PM, David Sterba  wrote:
> The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
> arrays can be 32KiB large. To avoid allocation failures due to
> fragmented memory, use the allocation with fallback to vmalloc.
>
> Signed-off-by: David Sterba 
> ---
>
> This depends on the patches that remove the 16MiB restriction in the
> dedupe ioctl, but contextually can be applied to the current code too.
>
> https://patchwork.kernel.org/patch/10374941/
>
>  fs/btrfs/ioctl.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index b572e38b4b64..a7f517009cd7 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode *src, u64 
> loff, u64 olen,
>  * locking. We use an array for the page pointers. Size of the array 
> is
>  * bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN.
>  */
> -   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
> -   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
> +   cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *), 
> GFP_KERNEL);
> +   cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *), 
> GFP_KERNEL);

Kvzalloc should take 2 parameters and not 3.
Also, aren't the corresponding kvfree() calls missing?

> if (!cmp.src_pages || !cmp.dst_pages) {
> kfree(cmp.src_pages);
> kfree(cmp.dst_pages);
> --
> 2.16.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/11] Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM

2018-05-11 Thread Josef Bacik
I told him to do this, these flags aren't exposed anywhere are they?
They are in-kernel specific stuff, please tell me we aren't exposing
these via sysfs?

Josef

On Fri, May 11, 2018 at 6:06 AM, David Sterba  wrote:
> On Fri, May 11, 2018 at 12:56:10AM -0700, Omar Sandoval wrote:
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -23,13 +23,12 @@
>>  #define BTRFS_INODE_ORPHAN_META_RESERVED 1
>>  #define BTRFS_INODE_DUMMY2
>>  #define BTRFS_INODE_IN_DEFRAG3
>> -#define BTRFS_INODE_HAS_ORPHAN_ITEM  4
>> -#define BTRFS_INODE_HAS_ASYNC_EXTENT 5
>> -#define BTRFS_INODE_NEEDS_FULL_SYNC  6
>> -#define BTRFS_INODE_COPY_EVERYTHING  7
>> -#define BTRFS_INODE_IN_DELALLOC_LIST 8
>> -#define BTRFS_INODE_READDIO_NEED_LOCK9
>> -#define BTRFS_INODE_HAS_PROPS10
>> +#define BTRFS_INODE_HAS_ASYNC_EXTENT 4
>> +#define BTRFS_INODE_NEEDS_FULL_SYNC  5
>> +#define BTRFS_INODE_COPY_EVERYTHING  6
>> +#define BTRFS_INODE_IN_DELALLOC_LIST 7
>> +#define BTRFS_INODE_READDIO_NEED_LOCK8
>> +#define BTRFS_INODE_HAS_PROPS9
>
> Please keep such changes minimal and only relevant to the purpose of the
> patch, in this case just remove the BTRFS_INODE_HAS_ORPHAN_ITEM .
>
> There will be a hole left in the sequence but this is not a problem and
> we're going to convert the defines to enums. The defines are prone to
> error if the nubmers get accidentally duplicated, like it happend not so
> long ago with the fsinfo::fs_flags.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data

2018-05-11 Thread David Sterba
The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the
arrays can be 32KiB large. To avoid allocation failures due to
fragmented memory, use the allocation with fallback to vmalloc.

Signed-off-by: David Sterba 
---

This depends on the patches that remove the 16MiB restriction in the
dedupe ioctl, but contextually can be applied to the current code too.

https://patchwork.kernel.org/patch/10374941/

 fs/btrfs/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b572e38b4b64..a7f517009cd7 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode *src, u64 loff, 
u64 olen,
 * locking. We use an array for the page pointers. Size of the array is
 * bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN.
 */
-   cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
-   cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
+   cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
+   cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *), GFP_KERNEL);
if (!cmp.src_pages || !cmp.dst_pages) {
kfree(cmp.src_pages);
kfree(cmp.dst_pages);
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior (possible bugs) in btrfs

2018-05-11 Thread Filipe Manana
On Mon, Apr 30, 2018 at 5:04 PM, Vijay Chidambaram  wrote:
> Hi,
>
> We found two more cases where the btrfs behavior is a little strange.
> In one case, an fsync-ed file goes missing after a crash. In the
> other, a renamed file shows up in both directories after a crash.
>
> Workload 1:
>
> mkdir A
> mkdir B
> mkdir A/C
> creat B/foo
> fsync B/foo
> link B/foo A/C/foo
> fsync A
> -- crash --
>
> Expected state after recovery:
> B B/foo A A/C exist
>
> What we find:
> Only B B/foo exist
>
> A is lost even after explicit fsync to A.
>
> Workload 2:
>
> mkdir A
> mkdir A/C
> rename A/C B
> touch B/bar
> fsync B/bar
> rename B/bar A/bar
> rename A B (replacing B with A at this point)
> fsync B/bar
> -- crash --
>
> Expected contents after recovery:
> A/bar
>
> What we find after recovery:
> A/bar
> B/bar
>
> We think this breaks rename's atomicity guarantee. bar should be
> present in either A or B, but now it is present in both.

I'll take a look at these, and all the other potential issues you
reported in other threads, next week and let you know.
Thanks.

>
> Thanks,
> Vijay
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstests: generic test for fsync of file with xattrs

2018-05-11 Thread fdmanana
From: Filipe Manana 

Test that xattrs are not lost after calling fsync multiple times with a
filesystem commit in between the fsync calls.

This test is motivated by a bug found in btrfs which is fixed by a patch
for the linux kernel titled:

  Btrfs: fix xattr loss after power failure

Signed-off-by: Filipe Manana 
---
 tests/generic/487 | 86 +++
 tests/generic/487.out | 11 +++
 tests/generic/group   |  1 +
 3 files changed, 98 insertions(+)
 create mode 100755 tests/generic/487
 create mode 100644 tests/generic/487.out

diff --git a/tests/generic/487 b/tests/generic/487
new file mode 100755
index ..328b5378
--- /dev/null
+++ b/tests/generic/487
@@ -0,0 +1,86 @@
+#! /bin/bash
+# FSQA Test No. 487
+#
+# Test that xattrs are not lost after calling fsync multiple times with a
+# filesystem commit in between the fsync calls.
+#
+#---
+#
+# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+. ./common/attr
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+_require_attrs
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+_mount_flakey
+
+touch $SCRATCH_MNT/foobar
+$SETFATTR_PROG -n user.xa1 -v qwerty $SCRATCH_MNT/foobar
+$SETFATTR_PROG -n user.xa2 -v 'hello world' $SCRATCH_MNT/foobar
+$SETFATTR_PROG -n user.xa3 -v test $SCRATCH_MNT/foobar
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
+
+# Call sync to commit all fileystem metadata.
+sync
+
+$XFS_IO_PROG -c "pwrite -S 0xea 0 64K" \
+-c "fsync" \
+$SCRATCH_MNT/foobar >>$seqres.full
+
+# Simulate a power failure and mount the filesystem to check that the xattrs
+# were not lost and neither was the data we wrote.
+_flakey_drop_and_remount
+echo "File xattrs after power failure:"
+$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foobar | _filter_scratch
+echo "File data after power failure:"
+od -t x1 $SCRATCH_MNT/foobar
+
+_unmount_flakey
+_cleanup_flakey
+
+status=0
+exit
diff --git a/tests/generic/487.out b/tests/generic/487.out
new file mode 100644
index ..44a119f8
--- /dev/null
+++ b/tests/generic/487.out
@@ -0,0 +1,11 @@
+QA output created by 487
+File xattrs after power failure:
+# file: SCRATCH_MNT/foobar
+user.xa1="qwerty"
+user.xa2="hello world"
+user.xa3="test"
+
+File data after power failure:
+000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea
+*
+020
diff --git a/tests/generic/group b/tests/generic/group
index 505383f7..c8f51ec2 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -489,3 +489,4 @@
 484 auto quick
 485 auto quick insert
 486 auto quick attr
+487 auto quick attr
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix xattr loss after power failure

2018-05-11 Thread fdmanana
From: Filipe Manana 

If a file has xattrs, we fsync it, to ensure we clear the flags
BTRFS_INODE_NEEDS_FULL_SYNC and BTRFS_INODE_COPY_EVERYTHING from its
inode, the current transaction commits and then we fsync it (without
either of those bits being set in its inode), we end up not logging
all its xattrs. This results in deleting all xattrs when replying the
log after a power failure.

Trivial reproducer

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/foobar
  $ setfattr -n user.xa -v qwerty /mnt/foobar
  $ xfs_io -c "fsync" /mnt/foobar

  $ sync

  $ xfs_io -c "pwrite -S 0xab 0 64K" /mnt/foobar
  $ xfs_io -c "fsync" /mnt/foobar
  

  $ mount /dev/sdb /mnt
  $ getfattr --absolute-names --dump /mnt/foobar
  
  $

So fix this by making sure all xattrs are logged if we log a file's inode
item and neither the flags BTRFS_INODE_NEEDS_FULL_SYNC nor
BTRFS_INODE_COPY_EVERYTHING were set in the inode.

Fixes: 36283bf777d9 ("Btrfs: fix fsync xattr loss in the fast fsync path")
Cc: 
Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 25b888df00c9..d656de8bec52 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4916,6 +4916,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
struct extent_map_tree *em_tree = &inode->extent_tree;
u64 logged_isize = 0;
bool need_log_inode_item = true;
+   bool xattrs_logged = false;
 
path = btrfs_alloc_path();
if (!path)
@@ -5217,6 +5218,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path);
if (err)
goto out_unlock;
+   xattrs_logged = true;
if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) {
btrfs_release_path(path);
btrfs_release_path(dst_path);
@@ -5229,6 +5231,11 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
btrfs_release_path(dst_path);
if (need_log_inode_item) {
err = log_inode_item(trans, log, dst_path, inode);
+   if (!err && !xattrs_logged) {
+   err = btrfs_log_all_xattrs(trans, root, inode, path,
+  dst_path);
+   btrfs_release_path(path);
+   }
if (err)
goto out_unlock;
}
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Any chance to get snapshot-aware defragmentation?

2018-05-11 Thread Niccolò Belli

Hi,
I'm waiting for this feature since years and initially it seemed like 
something which would have been worked on, sooner or later.
A long time had passed without any progress on this, so I would like to 
know if there is any technical limitation preventing this or if it's 
something which could possibly land in the near future.


Thanks,
Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 0/3] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction

2018-05-11 Thread David Sterba
On Wed, May 02, 2018 at 08:15:35AM +0300, Timofey Titovets wrote:
> At now btrfs_dedupe_file_range() restricted to 16MiB range for
> limit locking time and memory requirement for dedup ioctl()
> 
> For too big input range code silently set range to 16MiB
> 
> Let's remove that restriction by do iterating over dedup range.
> That's backward compatible and will not change anything for request
> less then 16MiB.
> 
> Changes:
>   v1 -> v2:
> - Refactor btrfs_cmp_data_prepare and btrfs_extent_same
> - Store memory of pages array between iterations
> - Lock inodes once, not on each iteration
> - Small inplace cleanups
>   v2 -> v3:
> - Split to several patches
> 
> Timofey Titovets (3):
>   Btrfs: split btrfs_extent_same() for simplification
>   Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
>   Btrfs: btrfs_extent_same() reuse cmp workspace

Looks good to me, thanks. I'll edit the changlogs a bit and add the
patches to 4.18 queue.

In the original code there's kcalloc for the array holding the page
pointers. This can grow up to 32kb if the full 16MiB range is used so
I'll add a patch that'll use kvmalloc (the vmalloc fallback) in case
there's no 32kib of contiguous memory.

IIRC the 16M limit is mentioned in manual pages, so this would need to
be fixed and documented how this is going to behave.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: qgroup: Finish rescan when hit the last leaf of extent tree

2018-05-11 Thread Jeff Mahoney
On 5/4/18 1:56 AM, Qu Wenruo wrote:
> Under the following case, qgroup rescan can double account cowed tree
> blocks:
> 
> In this case, extent tree only has one tree block.
> 
> -
> | transid=5 last committed=4
> | btrfs_qgroup_rescan_worker()
> | |- btrfs_start_transaction()
> | |  transid = 5
> | |- qgroup_rescan_leaf()
> ||- btrfs_search_slot_for_read() on extent tree
> |   Get the only extent tree block from commit root (transid = 4).
> |   Scan it, set qgroup_rescan_progress to the last
> |   EXTENT/META_ITEM + 1
> |   now qgroup_rescan_progress = A + 1.
> |
> | fs tree get CoWed, new tree block is at A + 16K
> | transid 5 get committed
> -
> | transid=6 last committed=5
> | btrfs_qgroup_rescan_worker()
> | btrfs_qgroup_rescan_worker()
> | |- btrfs_start_transaction()
> | |  transid = 5
> | |- qgroup_rescan_leaf()
> ||- btrfs_search_slot_for_read() on extent tree
> |   Get the only extent tree block from commit root (transid = 5).
> |   scan it using qgroup_rescan_progress (A + 1).
> |   found new tree block beyong A, and it's fs tree block,
> |   account it to increase qgroup numbers.
> -
> 
> In above case, tree block A, and tree block A + 16K get accounted twice,
> while qgroup rescan should stop when it already reach the last leaf,
> other than continue using its qgroup_rescan_progress.
> 
> Such case could happen by just looping btrfs/017 and with some
> possibility it can hit such double qgroup accounting problem.
> 
> Fix it by checking the path to determine if we should finish qgroup
> rescan, other than relying on next loop to exit.
> 
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/qgroup.c | 48 +--
>  1 file changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 829e8fe5c97e..2ee2d21d43ab 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2579,6 +2579,21 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
> *fs_info,
>   spin_unlock(&fs_info->qgroup_lock);
>  }
>  
> +/*
> + * Check if the leaf is the last leaf. Which means all node pointers
> + * are at their last position.
> + */
> +static bool is_last_leaf(struct btrfs_path *path)
> +{
> + int i;
> +
> + for (i = 1; i < BTRFS_MAX_LEVEL && path->nodes[i]; i++) {
> + if (path->slots[i] != btrfs_header_nritems(path->nodes[i]) - 1)
> + return false;
> + }
> + return true;
> +}
> +
>  /*
>   * returns < 0 on error, 0 when more leafs are to be scanned.
>   * returns 1 when done.
> @@ -2592,6 +2607,7 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
> struct btrfs_path *path,
>   struct ulist *roots = NULL;
>   struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem);
>   u64 num_bytes;
> + bool done;
>   int slot;
>   int ret;
>  
> @@ -2606,20 +2622,9 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
> struct btrfs_path *path,
>   fs_info->qgroup_rescan_progress.type,
>   fs_info->qgroup_rescan_progress.offset, ret);
>  
> - if (ret) {
> - /*
> -  * The rescan is about to end, we will not be scanning any
> -  * further blocks. We cannot unset the RESCAN flag here, because
> -  * we want to commit the transaction if everything went well.
> -  * To make the live accounting work in this phase, we set our
> -  * scan progress pointer such that every real extent objectid
> -  * will be smaller.
> -  */
> - fs_info->qgroup_rescan_progress.objectid = (u64)-1;
> - btrfs_release_path(path);
> - mutex_unlock(&fs_info->qgroup_rescan_lock);
> - return ret;
> - }
> + done = is_last_leaf(path);
> + if (ret)
> + goto finish;
>  
>   btrfs_item_key_to_cpu(path->nodes[0], &found,
> btrfs_header_nritems(path->nodes[0]) - 1);
> @@ -2665,8 +2670,23 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, 
> struct btrfs_path *path,
>   free_extent_buffer(scratch_leaf);
>   }
>   btrfs_put_tree_mod_seq(fs_info, &tree_mod_seq_elem);
> + if (done && !ret)
> + goto finish;

This causes a double unlock.  The lock was released prior to iterating
the leaf.  Otherwise, looks good.

-Jeff

>  
>   return ret;
> +finish:
> + /*
> +  * The rescan is about to end, we will not be scanning any
> +  * further blocks. We cannot unset the RESCAN flag here, because
> +  * we want to commit the transaction if everything went well.
> +  * To make the live accounting work in this phase, we set our
> +  * scan progress pointer such that every real extent objectid
> +  * will be smaller.
> +  */
> + fs_info->qgroup_rescan_progress.objectid = (u64)-1;
> + btrfs_release_path(path);
> + mutex_unlock(&fs_info->

Re: [PATCH 1/2 V2] hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 09:36:09AM -0500, Eric Sandeen wrote:
> On 5/11/18 9:32 AM, Chris Mason wrote:
> > On 11 May 2018, at 10:10, David Sterba wrote:
> > 
> >> On Thu, May 10, 2018 at 08:16:09PM +0100, Al Viro wrote:
> >>> On Thu, May 10, 2018 at 01:13:57PM -0500, Eric Sandeen wrote:
>  Move the btrfs label ioctls up to the vfs for general use.
> 
>  This retains 256 chars as the maximum size through the interface, which
>  is the btrfs limit and AFAIK exceeds any other filesystem's maximum
>  label size.
> 
>  Signed-off-by: Eric Sandeen 
>  Reviewed-by: Andreas Dilger 
>  Reviewed-by: David Sterba 
> >>>
> >>> No objections (and it obviously ought to go through btrfs tree).
> >>
> >> I can take it through my tree, but Eric mentioned that there's a patch
> >> for xfs that depends on it. In this case it would make sense to take
> >> both patches at once via the xfs tree. There are no pending conflicting
> >> changes in btrfs.
> > 
> > Probably easiest to just have a separate pull dedicated just for this 
> > series.  That way it doesn't really matter which tree it goes through.
> 
> Actually, I just realized that the changes to include/uapi/linux/fs.h are 
> completely
> independent of any btrfs changes, right - there's nothing wrong w/ redefining
> the common ioctl under a different name in btrfs.  So the fs.h patch could go 
> first,
> through the xfs tree since it'll be using it.
> 
> Once the common ioctl definition goes in, then btrfs can change to define its 
> ioctls to
> the common ioctls, or act on them directly as my patch did, etc.  Would that 
> be
> a better plan?  IOWs there's no urgent need to coordinate a btrfs change.

Agreed, I like that plan.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 V2] hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs

2018-05-11 Thread Eric Sandeen


On 5/11/18 9:32 AM, Chris Mason wrote:
> On 11 May 2018, at 10:10, David Sterba wrote:
> 
>> On Thu, May 10, 2018 at 08:16:09PM +0100, Al Viro wrote:
>>> On Thu, May 10, 2018 at 01:13:57PM -0500, Eric Sandeen wrote:
 Move the btrfs label ioctls up to the vfs for general use.

 This retains 256 chars as the maximum size through the interface, which
 is the btrfs limit and AFAIK exceeds any other filesystem's maximum
 label size.

 Signed-off-by: Eric Sandeen 
 Reviewed-by: Andreas Dilger 
 Reviewed-by: David Sterba 
>>>
>>> No objections (and it obviously ought to go through btrfs tree).
>>
>> I can take it through my tree, but Eric mentioned that there's a patch
>> for xfs that depends on it. In this case it would make sense to take
>> both patches at once via the xfs tree. There are no pending conflicting
>> changes in btrfs.
> 
> Probably easiest to just have a separate pull dedicated just for this series. 
>  That way it doesn't really matter which tree it goes through.

Actually, I just realized that the changes to include/uapi/linux/fs.h are 
completely
independent of any btrfs changes, right - there's nothing wrong w/ redefining
the common ioctl under a different name in btrfs.  So the fs.h patch could go 
first,
through the xfs tree since it'll be using it.

Once the common ioctl definition goes in, then btrfs can change to define its 
ioctls to
the common ioctls, or act on them directly as my patch did, etc.  Would that be
a better plan?  IOWs there's no urgent need to coordinate a btrfs change.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 V2] hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs

2018-05-11 Thread Chris Mason

On 11 May 2018, at 10:10, David Sterba wrote:


On Thu, May 10, 2018 at 08:16:09PM +0100, Al Viro wrote:

On Thu, May 10, 2018 at 01:13:57PM -0500, Eric Sandeen wrote:

Move the btrfs label ioctls up to the vfs for general use.

This retains 256 chars as the maximum size through the interface, 
which

is the btrfs limit and AFAIK exceeds any other filesystem's maximum
label size.

Signed-off-by: Eric Sandeen 
Reviewed-by: Andreas Dilger 
Reviewed-by: David Sterba 


No objections (and it obviously ought to go through btrfs tree).


I can take it through my tree, but Eric mentioned that there's a patch
for xfs that depends on it. In this case it would make sense to take
both patches at once via the xfs tree. There are no pending 
conflicting

changes in btrfs.


Probably easiest to just have a separate pull dedicated just for this 
series.  That way it doesn't really matter which tree it goes through.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/11] Btrfs: don't return ino to ino cache if inode item removal fails

2018-05-11 Thread Josef Bacik
On Fri, May 11, 2018 at 12:56:12AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
> item will still be in the tree but we still return the ino to the ino
> cache. That will blow up later when someone tries to allocate that ino,
> so don't return it to the cache.
> 
> Fixes: 581bb050941b ("Btrfs: Cache free inode numbers in memory")
> Signed-off-by: Omar Sandoval 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/11] Btrfs: delete dead code in btrfs_orphan_commit_root()

2018-05-11 Thread Josef Bacik
On Fri, May 11, 2018 at 12:56:11AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> btrfs_orphan_commit_root() tries to delete an orphan item for a
> subvolume in the tree root, but we don't actually insert that item in
> the first place. See commit 0a0d4415e338 ("Btrfs: delete dead code in
> btrfs_orphan_add()"). We can get rid of it.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/11] Btrfs: stop creating orphan items for truncate

2018-05-11 Thread Josef Bacik
On Fri, May 11, 2018 at 12:56:09AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Currently, we insert an orphan item during a truncate so that if there's
> a crash, we don't leak extents past the on-disk i_size. However, since
> commit 7f4f6e0a3f6d ("Btrfs: only update disk_i_size as we remove
> extents"), we keep disk_i_size in sync with the extent items as we
> truncate, so orphan cleanup will never have any extents to remove. Don't
> bother with the superfluous orphan item.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: Josef Bacik 

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 V2] fs: hoist BTRFS_IOC_[SG]ET_FSLABEL to vfs

2018-05-11 Thread David Sterba
On Thu, May 10, 2018 at 08:16:09PM +0100, Al Viro wrote:
> On Thu, May 10, 2018 at 01:13:57PM -0500, Eric Sandeen wrote:
> > Move the btrfs label ioctls up to the vfs for general use.
> > 
> > This retains 256 chars as the maximum size through the interface, which
> > is the btrfs limit and AFAIK exceeds any other filesystem's maximum
> > label size.
> > 
> > Signed-off-by: Eric Sandeen 
> > Reviewed-by: Andreas Dilger 
> > Reviewed-by: David Sterba 
> 
> No objections (and it obviously ought to go through btrfs tree).

I can take it through my tree, but Eric mentioned that there's a patch
for xfs that depends on it. In this case it would make sense to take
both patches at once via the xfs tree. There are no pending conflicting
changes in btrfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 4/4] btrfs: verify symlinks with append/immutable flags

2018-05-11 Thread David Sterba
On Thu, May 10, 2018 at 04:13:59PM -0700, Luis R. Rodriguez wrote:
> The Linux VFS does not allow a way to set append/immuttable
   ^^

Typo, in all 3 patches.

> attributes to symlinks, this is just not possible. If this is
> detected inform the user as the filesystem must be corrupted.
> 
> Signed-off-by: Luis R. Rodriguez 
> ---
>  fs/btrfs/inode.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index c4bdb597b323..d9c786be408c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3933,6 +3933,15 @@ static int btrfs_read_locked_inode(struct inode *inode)
>   inode->i_op = &btrfs_dir_inode_operations;
>   break;
>   case S_IFLNK:
> + /* VFS does not allow setting these so must be corruption */
> + if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) {
> + ret = -EUCLEAN;
> + btrfs_crit(fs_info,
> +   "corrupt symlink with append/immutable 
> ino=%llu root=%llu\n",

no "\n" and please un-indent the string so it fits 80 columns.

> +   btrfs_ino(BTRFS_I(inode)),
> +   root->root_key.objectid);
> + goto make_bad;

I found some error handling issues, before the switch, there's
btrfs_free_path and there's one more at the make_bad label.

To fix that, please set path = NULL after the first btrfs_free_path, it
can handle a NULL when it's called again.

Next thing I'm not sure about are the ACLs that get initialized in some
cases. There's cache_no_acl() that only resets the inode::i_acl and
inode::i_default_acl, so I think this should be called too. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/17] Freespace tree big fs_info cleanup

2018-05-11 Thread David Sterba
On Thu, May 10, 2018 at 03:44:39PM +0300, Nikolay Borisov wrote:
> Here is a series which cleans _all_ freespace tree functions from a redundant
> fs_info argument since they already take either a transaction or a 
> block_group_cache structure. Both of those structures contain a reference to 
> fs info and can be used instead of an additional parameter. This is needed 
> since I will be pulling some of the freespace tree code into btrfs-progs in 
> implementing check/rebuild functionality for the freespace tree. So better 
> have
> this sooner rather than later. 
> 
> This series should bring no functional changes but just in case it passed the 
> btrfs' selftests as well as a full xfstest run. 
> 
> Nikolay Borisov (17):
>   btrfs: Make btrfs_init_dummy_trans initialize trans' fs_info field
>   btrfs: Remove fs_info argument from add_block_group_free_space
>   btrfs: Remove fs_info argument from __add_block_group_free_space
>   btrfs: Remove fs_info argument from __add_to_free_space_tree
>   btrfs: Remove fs_info parameter from add_new_free_space_info
>   btrfs: Remove fs_info argument from add_new_free_space
>   btrfs: Remove fs_info parameter from remove_block_group_free_space
>   btrfs: Remove fs_info argument from convert_free_space_to_bitmaps
>   btrfs: Remove fs_info parameter from convert_free_space_to_extents
>   btrfs: Remove fs_info argument from update_free_space_extent_count
>   btrfs: Remove fs_info argument from modify_free_space_bitmap
>   btrfs: Remove fs_info argument from add_free_space_extent
>   btrfs: Remove fs_info argument from remove_free_space_extent
>   btrfs: Remove fs_info argument from __remove_from_free_space_tree
>   btrfs: Remove fs_info argument from remove_from_free_space_tree
>   btrfs: Remove fs_info argument from add_to_free_space_tree
>   btrfs: Remove fs_info argument from populate_free_space_tree

All
Reviewed-by: David Sterba 

and added to misc-next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/3] btrfs: Do super block verification before writing it to disk

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 01:35:27PM +0800, Qu Wenruo wrote:
> +/*
> + * Check the validation of super block at write time.
> + * Some checks like bytenr check will be skipped as their values will be
> + * overwritten soon.
> + * Extra checks like csum type and incompact flags will be executed here.
  ^

I almost missed it, it's 'incompat', short from 'incompatibility'

> + if (btrfs_super_incompat_flags(sb) & ~BTRFS_FEATURE_INCOMPAT_SUPP) {
> + ret = -EUCLEAN;
> + btrfs_err(fs_info,
> + "invalid incompact flags, has 0x%llu valid mask 0x%llu",
 ^

Also fixed, as it's in a user visible string.

> +   btrfs_super_incompat_flags(sb),
> +   BTRFS_FEATURE_INCOMPAT_SUPP);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] btrfs: Add write time super block validation

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 01:35:24PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/write_time_sb_check
> 
> We have 2 reports about corrupted btrfs super block, which has some garbage
> in its super block, but otherwise it's completely fine and its csum even
> matches.
> 
> This means we develop memory corruption during btrfs mount time.
> It's not clear whether it's caused by btrfs or some other kernel module,
> but at least let's do write time verification to catch such corruption
> early.
> 
> Current design is to do 2 different checks at mount time and super write
> time.
> And for write time check, it only checks the template super block
> (fs_info->super_to_commit) other than each super blocks to be written to
> disk, mostly to avoid duplicated checks.
> 
> Changelog:
> v2:
>   Rename btrfs_check_super_valid() to btrfs_validate_super() suggested
>   by Nikolay and David.
> v3:
>   Add a new patch to move btrfs_check_super_valid() to avoid forward
>   declaration.
>   Refactor btrfs_check_super_valid() to provide better naming and
>   function reusablity.
>   Code style and comment update.
>   Use 2 different functions, btrfs_validate_mount_super() and
>   btrfs_validate_write_super(), for mount and write time super check.

Added as topic branch to next, I'm still targeting 4.18 with this
patchset so it'll end up in misc-next after some testing. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] btrfs: Move btrfs_check_super_valid() to avoid forward declaration

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 11:36:54AM +0200, David Sterba wrote:
> On Fri, May 11, 2018 at 01:35:25PM +0800, Qu Wenruo wrote:
> > Just move btrfs_check_super_valid() before its single caller to avoid
> > forward declaration.
> 
> Please don't move functions just to get rid of the forward declarations.
> 
> Moving functions to make them static or if they're in a wrong .c is OK,
> but the extra forward declaration is not that bad and moving code
> without any change just pollutest the git history. I'll drop the patch,
> sorry.

Hm, OK I now see why you did it. Fixing up the order of the related
static functions would need new forward declarations, so I'll apply the
patch after all.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Btrfs: stop abusing current->journal_info for direct I/O

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 12:53:36PM +0300, Nikolay Borisov wrote:
> 
> 
> On 11.05.2018 09:30, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Hi, everyone,
> > 
> > Btrfs currently abuses current->journal_info in btrfs_direct_IO() in
> > order to pass around some state to get_block() and submit_io(). This
> > hack is ugly and unnecessary because the data we pass around is only
> > used in one call frame. Robbie Ko also pointed out [1] that it could
> > potentially cause a crash if we happen to end up in start_transaction()
> > (e.g., from memory reclaim calling into btrfs_evict_inode(), which can
> > start a transaction). I'm not convinced that Robbie's case can happen in
> > practice since we are using GFP_NOFS for allocations during direct I/O,
> > but either way it's fragile and nasty.
> 
> When I worked initially on btrfs-over-swap I managed to hit a case where
> ext4 stacked on top of btrfs would crash since btrfs will overwrite
> journal_info which was set by ext4. So this change is indeed welcome :)

And also this, https://lkml.org/lkml/2017/12/14/165.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 01/11] Btrfs: remove stale comment referencing vmtruncate()

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 12:56:06AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Commit a41ad394a03b ("Btrfs: convert to the new truncate sequence")
> changed vmtruncate() to truncate_setsize() but didn't update the comment
> above it. truncate_setsize() never fails (the IS_SWAPFILE() check
> happens elsewhere), so remove the comment.

There's one more mention of vmtruncate at btrfs_page_mkwrite, can you
please remove it and review that the comment is not stale in other
respects? Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/11] Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 12:56:10AM -0700, Omar Sandoval wrote:
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -23,13 +23,12 @@
>  #define BTRFS_INODE_ORPHAN_META_RESERVED 1
>  #define BTRFS_INODE_DUMMY2
>  #define BTRFS_INODE_IN_DEFRAG3
> -#define BTRFS_INODE_HAS_ORPHAN_ITEM  4
> -#define BTRFS_INODE_HAS_ASYNC_EXTENT 5
> -#define BTRFS_INODE_NEEDS_FULL_SYNC  6
> -#define BTRFS_INODE_COPY_EVERYTHING  7
> -#define BTRFS_INODE_IN_DELALLOC_LIST 8
> -#define BTRFS_INODE_READDIO_NEED_LOCK9
> -#define BTRFS_INODE_HAS_PROPS10
> +#define BTRFS_INODE_HAS_ASYNC_EXTENT 4
> +#define BTRFS_INODE_NEEDS_FULL_SYNC  5
> +#define BTRFS_INODE_COPY_EVERYTHING  6
> +#define BTRFS_INODE_IN_DELALLOC_LIST 7
> +#define BTRFS_INODE_READDIO_NEED_LOCK8
> +#define BTRFS_INODE_HAS_PROPS9

Please keep such changes minimal and only relevant to the purpose of the
patch, in this case just remove the BTRFS_INODE_HAS_ORPHAN_ITEM .

There will be a hole left in the sequence but this is not a problem and
we're going to convert the defines to enums. The defines are prone to
error if the nubmers get accidentally duplicated, like it happend not so
long ago with the fsinfo::fs_flags.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: incremental send, fix BUG when invalid memory access

2018-05-11 Thread Filipe Manana
On Fri, May 11, 2018 at 7:34 AM, robbieko  wrote:
> From: Robbie Ko 
>
> [BUG]
> btrfs incremental send BUG happens when creating a snapshot of snapshot
> that is being used by send.
>
> [REASON]
> The problem can happen if while we are doing a send one of the snapshots
> used (parent or send) is snapshotted, because snapshoting implies COWing
> the root of the source subvolume/snaphot.

snaphot -> snapshot

>
> 1. When send with the parent, the send process will get the commit roots
>from parent and send, and add references by extent_buffer_get.

When doing an incremental send, the send process will get the commit
roots from the parent and send snapshots,
and add references to them through extent_buffer_get().

>
> 2. When the snapshots(parent or send) is snapshotted, the committed root
>of the snapshot will be modified, because snapshoting implies COWing
>the root of the source subvolume/snaphot.

When a snapshot/subvolume is snapshotted, its root node is COWed
(transaction.c:create_pending_snapshot()).

>
> 3. When COWing, we will allocate new space to submit root and release
>the old space.
>
> Assume that A is the old commit root.
> __btrfs_cow_block()
> --btrfs_free_tree_block()
> btrfs_add_free_space(bytenr of A)


3. COWing releases the space used by the node immediately, through:

 __btrfs_cow_block()
 --btrfs_free_tree_block()
 btrfs_add_free_space(bytenr of node)

>
> 4. Therefore, the old commit_root space can be used when other processes
>need to allocate new treeblocks.
>However, alloc_extent_buffer is created by the bytenr.
>It will first find out if there is an existing extent_buffer through
>find_extent_buffer and cause the original extent_buffer to be modified.
>
> btrfs_alloc_tree_block
> --btrfs_reserve_extent
> find_free_extent (get bytenr of A)
> --btrfs_init_new_buffer (use bytenr of A)
> btrfs_find_create_tree_block
> --alloc_extent_buffer
> find_extent_buffer (get A)


4. Because send doesn't hold a transaction open, it's possible that
the transaction used to create
the snapshot commits, switches the commit root and the old space used
by the previous root node
gets assigned to some other node allocation. Allocation of a new node
will use the existing extent buffer
found in memory, which we previously got a reference through
extent_buffer_get(), and allow the
extent buffer's content (pages) to be modified:

 btrfs_alloc_tree_block
 --btrfs_reserve_extent
 find_free_extent (get bytenr of A)
 --btrfs_init_new_buffer (use bytenr of A)
 btrfs_find_create_tree_block
 --alloc_extent_buffer
 find_extent_buffer (get A)

>
> 5. Eventually causing send process to access illegal memory.

5. So send can access invalid memory content and have unpredictable behaviour.

>
> Thus extent_buffer_get can only prevent extent_buffer from being released,
> but it cannot prevent extent_buffer from being used by others.
>
> [FIX]
> So we fixed the problem by copy commit_root to avoid accessing illegal
> memory.

So we fix the problem by copying the commit roots of the send and
parent snapshots and use those copies.

>
> CallTrace looks like this:
>  [ cut here ]
>  kernel BUG at fs/btrfs/ctree.c:1861!
>  invalid opcode:  [#1] SMP
>  CPU: 6 PID: 24235 Comm: btrfs Tainted: P   O 3.10.105 #23721
>  88046652d680 ti: 88041b72 task.ti: 88041b72
>  RIP: 0010:[] read_node_slot+0x108/0x110 [btrfs]
>  RSP: 0018:88041b723b68  EFLAGS: 00010246
>  RAX: 88043ca6b000 RBX: 88041b723c50 RCX: 8800
>  RDX: 004c RSI: 880314b133f8 RDI: 880458b24000
>  RBP:  R08: 0001 R09: 88041b723c66
>  R10: 0001 R11: 1000 R12: 8803f3e48890
>  R13: 8803f3e48880 R14: 880466351800 R15: 0001
>  FS:  7f8c321dc8c0() GS:88047fcc()
>  CS:  0010 DS:  ES:  CR0: 80050033
>  R2: 7efd1006d000 CR3: 000213a24000 CR4: 003407e0
>  DR0:  DR1:  DR2: 
>  DR3:  DR6: fffe0ff0 DR7: 0400
>  Stack:
>  88041b723c50 8803f3e48880 8803f3e48890 8803f3e48880
>  880466351800 0001 a08dd9d7 88041b723c50
>  8803f3e48880 88041b723c66 a08dde85 a9ff88042d2c4400
>  Call Trace:
>  [] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
>  [] ? tree_advance+0xb5/0xc0 [btrfs]
>  [] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
>  [] ? finish_inode_if_needed+0x870/0x870 [btrfs]
>  [] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
>  [] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
>  [] ? handle_pte_fault+0x373/0x990
>  [] ? atomic_notifier_call_chain+0x16/0x20
>  [] ? set_task_cpu+0xb6/0x1d0
>  [] ? handle_mm_fault+0x143/0x2a0
>  [] ? __do_page_fault+0x1d0/0x500
>  [] ? check_preempt_curr+0x57/0x90
>  [] ? do_vfs_ioctl+0x4aa/0x990
>  [] ? do_fork+0x113/0x3b0
>  [] ?

Re: [PATCH 0/3] Btrfs: stop abusing current->journal_info for direct I/O

2018-05-11 Thread Nikolay Borisov


On 11.05.2018 09:30, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Hi, everyone,
> 
> Btrfs currently abuses current->journal_info in btrfs_direct_IO() in
> order to pass around some state to get_block() and submit_io(). This
> hack is ugly and unnecessary because the data we pass around is only
> used in one call frame. Robbie Ko also pointed out [1] that it could
> potentially cause a crash if we happen to end up in start_transaction()
> (e.g., from memory reclaim calling into btrfs_evict_inode(), which can
> start a transaction). I'm not convinced that Robbie's case can happen in
> practice since we are using GFP_NOFS for allocations during direct I/O,
> but either way it's fragile and nasty.

When I worked initially on btrfs-over-swap I managed to hit a case where
ext4 stacked on top of btrfs would crash since btrfs will overwrite
journal_info which was set by ext4. So this change is indeed welcome :)

> 
> This series stops using current->journal_info and instead adds some
> extra arguments to the generic direct I/O code so that we can pass
> things around like sane people.
> 
> Based on Linus' master.
> 
> Thanks!
> 
> 1: https://patchwork.kernel.org/patch/10389077/
> 
> Omar Sandoval (3):
>   fs: add initial bh_result->b_private value to __blockdev_direct_IO()
>   fs: add private argument to dio_submit_t
>   Btrfs: stop abusing current->journal_info in btrfs_direct_IO()
> 
>  fs/btrfs/inode.c   | 39 ++-
>  fs/direct-io.c | 12 +++-
>  fs/ext4/inode.c|  6 +++---
>  fs/gfs2/aops.c |  2 +-
>  fs/ocfs2/aops.c|  5 ++---
>  include/linux/fs.h | 10 +-
>  6 files changed, 28 insertions(+), 46 deletions(-)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 09/11] Btrfs: fix ENOSPC caused by orphan items reservations

2018-05-11 Thread Nikolay Borisov


On 11.05.2018 10:56, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Currently, we keep space reserved for all inode orphan items until the
> inode is evicted (i.e., all references to it are dropped). We hit an
> issue where an application would keep a bunch of deleted files open (by
> design) and thus keep a large amount of space reserved, causing ENOSPC
> errors when other operations tried to reserve space. This long-standing
> reservation isn't absolutely necessary for a couple of reasons:
> 
> - We can almost always make the reservation we need or steal from the
>   global reserve for the orphan item
> - If we can't, it's not the end of the world if we drop the orphan item
>   on the floor and let the next mount clean it up
> 
> So, get rid of persistent reservation and just reserve space in
> btrfs_evict_inode().
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/btrfs_inode.h |  17 +++--
>  fs/btrfs/inode.c   | 158 ++---
>  2 files changed, 46 insertions(+), 129 deletions(-)
> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index a81112706cd5..bbbe7f308d68 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -20,15 +20,14 @@
>   * new data the application may have written before commit.
>   */
>  #define BTRFS_INODE_ORDERED_DATA_CLOSE   0
> -#define BTRFS_INODE_ORPHAN_META_RESERVED 1
> -#define BTRFS_INODE_DUMMY2
> -#define BTRFS_INODE_IN_DEFRAG3
> -#define BTRFS_INODE_HAS_ASYNC_EXTENT 4
> -#define BTRFS_INODE_NEEDS_FULL_SYNC  5
> -#define BTRFS_INODE_COPY_EVERYTHING  6
> -#define BTRFS_INODE_IN_DELALLOC_LIST 7
> -#define BTRFS_INODE_READDIO_NEED_LOCK8
> -#define BTRFS_INODE_HAS_PROPS9
> +#define BTRFS_INODE_DUMMY1
> +#define BTRFS_INODE_IN_DEFRAG2
> +#define BTRFS_INODE_HAS_ASYNC_EXTENT 3
> +#define BTRFS_INODE_NEEDS_FULL_SYNC  4
> +#define BTRFS_INODE_COPY_EVERYTHING  5
> +#define BTRFS_INODE_IN_DELALLOC_LIST 6
> +#define BTRFS_INODE_READDIO_NEED_LOCK7
> +#define BTRFS_INODE_HAS_PROPS8
>  
>  /* in memory btrfs inode */
>  struct btrfs_inode {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 7ca55af8aa17..b64c4189e2c0 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3331,77 +3331,16 @@ void btrfs_orphan_commit_root(struct 
> btrfs_trans_handle *trans,
>  /*
>   * This creates an orphan entry for the given inode in case something goes 
> wrong
>   * in the middle of an unlink.
> - *
> - * NOTE: caller of this function should reserve 5 units of metadata for
> - *this function.
>   */
>  int btrfs_orphan_add(struct btrfs_trans_handle *trans,
> - struct btrfs_inode *inode)
> +  struct btrfs_inode *inode)
>  {
> - struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
> - struct btrfs_root *root = inode->root;
> - struct btrfs_block_rsv *block_rsv = NULL;
> - int reserve = 0;
>   int ret;
>  
> - if (!root->orphan_block_rsv) {
> - block_rsv = btrfs_alloc_block_rsv(fs_info,
> -   BTRFS_BLOCK_RSV_TEMP);
> - if (!block_rsv)
> - return -ENOMEM;
> - }
> -
> - if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
> -   &inode->runtime_flags))
> - reserve = 1;
> -
> - spin_lock(&root->orphan_lock);
> - /* If someone has created ->orphan_block_rsv, be happy to use it. */
> - if (!root->orphan_block_rsv) {
> - root->orphan_block_rsv = block_rsv;
> - } else if (block_rsv) {
> - btrfs_free_block_rsv(fs_info, block_rsv);
> - block_rsv = NULL;
> - }
> -
> - atomic_inc(&root->orphan_inodes);
> - spin_unlock(&root->orphan_lock);
> -
> - /* grab metadata reservation from transaction handle */
> - if (reserve) {
> - ret = btrfs_orphan_reserve_metadata(trans, inode);
> - ASSERT(!ret);
> - if (ret) {
> - /*
> -  * dec doesn't need spin_lock as ->orphan_block_rsv
> -  * would be released only if ->orphan_inodes is
> -  * zero.
> -  */
> - atomic_dec(&root->orphan_inodes);
> - clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
> -   &inode->runtime_flags);
> - return ret;
> - }
> - }
> -
> - /* insert an orphan item to track this unlinked file */
> - ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
> - if (ret) {
> - if (reserve) {
> - clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
> - 

Re: [PATCH v3 2/3] btrfs: Refactor btrfs_check_super_valid()

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 01:35:26PM +0800, Qu Wenruo wrote:
> Refactor btrfs_check_super_valid() by the ways:
> 
> 1) Rename it to btrfs_validate_mount_super()
>Now it's more obvious when the function should be called.
> 
> 2) Extract core check routine into __validate_super()
>So later write time check can reuse it, and if needed, we could also
>use __validate_super() to check each super block.

As there's no validate_super without the underscores, I'd rather drop
them. Otherwise ok.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/3] btrfs: Do super block verification before writing it to disk

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 01:35:27PM +0800, Qu Wenruo wrote:
> There are already 2 reports about strangely corrupted super blocks,
> where csum still matches but extra garbage gets slipped into super block.
> 
> The corruption would looks like:
> --
> superblock: bytenr=65536, device=/dev/sdc1
> -
> csum_type   41700 (INVALID)
> csum0x3b252d3a [match]
> bytenr  65536
> flags   0x1
> ( WRITTEN )
> magic   _BHRfS_M [match]
> ...
> incompat_flags  0x5b224169
> ( MIXED_BACKREF |
>   COMPRESS_LZO |
>   BIG_METADATA |
>   EXTENDED_IREF |
>   SKINNY_METADATA |
>   unknown flag: 0x5b224000 )
> ...
> --
> Or
> --
> superblock: bytenr=65536, device=/dev/mapper/x
> -
> csum_type  35355 (INVALID)
> csum_size  32
> csum   0xf0dbeddd [match]
> bytenr 65536
> flags  0x1
>( WRITTEN )
> magic  _BHRfS_M [match]
> ...
> incompat_flags 0x176d2169
>( MIXED_BACKREF |
>  COMPRESS_LZO |
>  BIG_METADATA |
>  EXTENDED_IREF |
>  SKINNY_METADATA |
>  unknown flag: 0x176d2000 )
> --
> 
> Obviously, csum_type and incompat_flags get some garbage, but its csum
> still matches, which means kernel calculates the csum based on corrupted
> super block memory.
> And after manually fixing these values, the filesystem is completely
> healthy without any problem exposed by btrfs check.
> 
> Although the cause is still unknown, at least detect it and prevent further
> corruption.
> 
> Reported-by: Ken Swenson 
> Reported-by: Ben Parsons <9parso...@gmail.com>
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/disk-io.c | 39 +++
>  1 file changed, 39 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b981ecc4b6f9..985695074c51 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2610,6 +2610,41 @@ static int btrfs_validate_mount_super(struct 
> btrfs_fs_info *fs_info)
>   return __validate_super(fs_info, fs_info->super_copy, 0);
>  }
>  
> +/*
> + * Check the validation of super block at write time.
> + * Some checks like bytenr check will be skipped as their values will be
> + * overwritten soon.
> + * Extra checks like csum type and incompact flags will be executed here.
> + */
> +static int btrfs_validate_write_super(struct btrfs_fs_info *fs_info,
> +   struct btrfs_super_block *sb)
> +{
> + int ret;
> +
> + ret = __validate_super(fs_info, sb, -1);
> + if (ret < 0)
> + goto out;
> + if (btrfs_super_csum_type(sb) != BTRFS_CSUM_TYPE_CRC32) {
> + ret = -EUCLEAN;
> + btrfs_err(fs_info, "invalid csum type, has %u want %u",
> +   btrfs_super_csum_type(sb), BTRFS_CSUM_TYPE_CRC32);
> + goto out;
> + }
> + if (btrfs_super_incompat_flags(sb) & ~BTRFS_FEATURE_INCOMPAT_SUPP) {
> + ret = -EUCLEAN;
> + btrfs_err(fs_info,
> + "invalid incompact flags, has 0x%llu valid mask 0x%llu",
> +   btrfs_super_incompat_flags(sb),
> +   BTRFS_FEATURE_INCOMPAT_SUPP);

I think you need (unsigned long long) here as
BTRFS_FEATURE_INCOMPAT_SUPP do not have a type. I'll fix that.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] btrfs: Move btrfs_check_super_valid() to avoid forward declaration

2018-05-11 Thread David Sterba
On Fri, May 11, 2018 at 01:35:25PM +0800, Qu Wenruo wrote:
> Just move btrfs_check_super_valid() before its single caller to avoid
> forward declaration.

Please don't move functions just to get rid of the forward declarations.

Moving functions to make them static or if they're in a wrong .c is OK,
but the extra forward declaration is not that bad and moving code
without any change just pollutest the git history. I'll drop the patch,
sorry.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/11] Btrfs: get rid of unused orphan infrastructure

2018-05-11 Thread Nikolay Borisov


On 11.05.2018 10:56, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Now that we don't keep long-standing reservations for orphan items,
> root->orphan_block_rsv isn't used. We can git rid of it, along with
> root->orphan_lock, which was used to protect it, root->orphan_inodes,
> which was used as a refcount for it, and btrfs_orphan_commit_root(),
> which was the last user of all of these.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/ctree.h   |  8 
>  fs/btrfs/disk-io.c |  9 -
>  fs/btrfs/extent-tree.c | 38 -
>  fs/btrfs/inode.c   | 43 +-
>  fs/btrfs/transaction.c |  1 -
>  5 files changed, 1 insertion(+), 98 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2771cc56a622..51408de11af2 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1219,9 +1219,6 @@ struct btrfs_root {
>   spinlock_t log_extents_lock[2];
>   struct list_head logged_list[2];
>  
> - spinlock_t orphan_lock;
> - atomic_t orphan_inodes;
> - struct btrfs_block_rsv *orphan_block_rsv;
>   int orphan_cleanup_state;
>  
>   spinlock_t inode_lock;
> @@ -2764,9 +2761,6 @@ void btrfs_delalloc_release_space(struct inode *inode,
>  void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
>   u64 len);
>  void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
> -int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
> -   struct btrfs_inode *inode);
> -void btrfs_orphan_release_metadata(struct btrfs_inode *inode);
>  int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
>struct btrfs_block_rsv *rsv,
>int nitems,
> @@ -3238,8 +3232,6 @@ int btrfs_update_inode_fallback(struct 
> btrfs_trans_handle *trans,
>  int btrfs_orphan_add(struct btrfs_trans_handle *trans,
>   struct btrfs_inode *inode);
>  int btrfs_orphan_cleanup(struct btrfs_root *root);
> -void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
> -   struct btrfs_root *root);
>  int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
>  void btrfs_invalidate_inodes(struct btrfs_root *root);
>  void btrfs_add_delayed_iput(struct inode *inode);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 60caa68c3618..4a40bfdddabc 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1185,7 +1185,6 @@ static void __setup_root(struct btrfs_root *root, 
> struct btrfs_fs_info *fs_info,
>   root->inode_tree = RB_ROOT;
>   INIT_RADIX_TREE(&root->delayed_nodes_tree, GFP_ATOMIC);
>   root->block_rsv = NULL;
> - root->orphan_block_rsv = NULL;
>  
>   INIT_LIST_HEAD(&root->dirty_list);
>   INIT_LIST_HEAD(&root->root_list);
> @@ -1195,7 +1194,6 @@ static void __setup_root(struct btrfs_root *root, 
> struct btrfs_fs_info *fs_info,
>   INIT_LIST_HEAD(&root->ordered_root);
>   INIT_LIST_HEAD(&root->logged_list[0]);
>   INIT_LIST_HEAD(&root->logged_list[1]);
> - spin_lock_init(&root->orphan_lock);
>   spin_lock_init(&root->inode_lock);
>   spin_lock_init(&root->delalloc_lock);
>   spin_lock_init(&root->ordered_extent_lock);
> @@ -1216,7 +1214,6 @@ static void __setup_root(struct btrfs_root *root, 
> struct btrfs_fs_info *fs_info,
>   atomic_set(&root->log_commit[1], 0);
>   atomic_set(&root->log_writers, 0);
>   atomic_set(&root->log_batch, 0);
> - atomic_set(&root->orphan_inodes, 0);
>   refcount_set(&root->refs, 1);
>   atomic_set(&root->will_be_snapshotted, 0);
>   root->log_transid = 0;
> @@ -3674,8 +3671,6 @@ static void free_fs_root(struct btrfs_root *root)
>  {
>   iput(root->ino_cache_inode);
>   WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
> - btrfs_free_block_rsv(root->fs_info, root->orphan_block_rsv);
> - root->orphan_block_rsv = NULL;
>   if (root->anon_dev)
>   free_anon_bdev(root->anon_dev);
>   if (root->subv_writers)
> @@ -3766,7 +3761,6 @@ int btrfs_commit_super(struct btrfs_fs_info *fs_info)
>  
>  void close_ctree(struct btrfs_fs_info *fs_info)
>  {
> - struct btrfs_root *root = fs_info->tree_root;
>   int ret;
>  
>   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> @@ -3861,9 +3855,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>   btrfs_free_stripe_hash_table(fs_info);
>   btrfs_free_ref_cache(fs_info);
>  
> - __btrfs_free_block_rsv(root->orphan_block_rsv);
> - root->orphan_block_rsv = NULL;
> -
>   while (!list_empty(&fs_info->pinned_chunks)) {
>   struct extent_map *em;
>  
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 51b5e2da708c..3f2e026bc206 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> 

Re: [PATCH 0/3] Btrfs: stop abusing current->journal_info for direct I/O

2018-05-11 Thread David Sterba
On Thu, May 10, 2018 at 11:30:09PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Hi, everyone,
> 
> Btrfs currently abuses current->journal_info in btrfs_direct_IO() in
> order to pass around some state to get_block() and submit_io(). This
> hack is ugly and unnecessary because the data we pass around is only
> used in one call frame.

I'd very much like to get rid of the journal_info hack. The changes to
ther filesystems are minimal.

The 3 patches look good to me, you can add my reviewed-by for btrfs
and ack for the rest.  I'm going to do a fstests round too.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/11] Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM

2018-05-11 Thread Nikolay Borisov


On 11.05.2018 10:56, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Now that we don't add orphan items for truncate, there can't be races on
> adding or deleting an orphan item, so this bit is unnecessary.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/btrfs_inode.h | 13 
>  fs/btrfs/inode.c   | 76 +++---
>  2 files changed, 26 insertions(+), 63 deletions(-)

Very nice,

Reviewed-by: Nikolay Borisov 

> 
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 234bae55b85d..a81112706cd5 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -23,13 +23,12 @@
>  #define BTRFS_INODE_ORPHAN_META_RESERVED 1
>  #define BTRFS_INODE_DUMMY2
>  #define BTRFS_INODE_IN_DEFRAG3
> -#define BTRFS_INODE_HAS_ORPHAN_ITEM  4
> -#define BTRFS_INODE_HAS_ASYNC_EXTENT 5
> -#define BTRFS_INODE_NEEDS_FULL_SYNC  6
> -#define BTRFS_INODE_COPY_EVERYTHING  7
> -#define BTRFS_INODE_IN_DELALLOC_LIST 8
> -#define BTRFS_INODE_READDIO_NEED_LOCK9
> -#define BTRFS_INODE_HAS_PROPS10
> +#define BTRFS_INODE_HAS_ASYNC_EXTENT 4
> +#define BTRFS_INODE_NEEDS_FULL_SYNC  5
> +#define BTRFS_INODE_COPY_EVERYTHING  6
> +#define BTRFS_INODE_IN_DELALLOC_LIST 7
> +#define BTRFS_INODE_READDIO_NEED_LOCK8
> +#define BTRFS_INODE_HAS_PROPS9
>  
>  /* in memory btrfs inode */
>  struct btrfs_inode {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1460823951d7..e22f8c9f6459 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3354,7 +3354,6 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
>   struct btrfs_root *root = inode->root;
>   struct btrfs_block_rsv *block_rsv = NULL;
>   int reserve = 0;
> - bool insert = false;
>   int ret;
>  
>   if (!root->orphan_block_rsv) {
> @@ -3364,10 +3363,6 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
>   return -ENOMEM;
>   }
>  
> - if (!test_and_set_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
> -   &inode->runtime_flags))
> - insert = true;
> -
>   if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
> &inode->runtime_flags))
>   reserve = 1;
> @@ -3381,8 +3376,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
>   block_rsv = NULL;
>   }
>  
> - if (insert)
> - atomic_inc(&root->orphan_inodes);
> + atomic_inc(&root->orphan_inodes);
>   spin_unlock(&root->orphan_lock);
>  
>   /* grab metadata reservation from transaction handle */
> @@ -3398,36 +3392,28 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
>   atomic_dec(&root->orphan_inodes);
>   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
> &inode->runtime_flags);
> - if (insert)
> - clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
> -   &inode->runtime_flags);
>   return ret;
>   }
>   }
>  
>   /* insert an orphan item to track this unlinked file */
> - if (insert) {
> - ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
> - if (ret) {
> - if (reserve) {
> - clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
> -   &inode->runtime_flags);
> - btrfs_orphan_release_metadata(inode);
> - }
> - /*
> -  * btrfs_orphan_commit_root may race with us and set
> -  * ->orphan_block_rsv to zero, in order to avoid that,
> -  * decrease ->orphan_inodes after everything is done.
> -  */
> - atomic_dec(&root->orphan_inodes);
> - if (ret != -EEXIST) {
> - clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
> -   &inode->runtime_flags);
> - btrfs_abort_transaction(trans, ret);
> - return ret;
> - }
> + ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
> + if (ret) {
> + if (reserve) {
> + clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
> +   &inode->runtime_flags);
> + btrfs_orphan_release_metadata(inode);
> + }
> + /*
> +  * btrfs_orphan_commit_root may race with us and set
> +  * ->orphan_block_rsv to zero, in order to avoid that,
> +  * decrease ->orphan_inodes after everything is done.
> +  */
> +   

Re: [PATCH 2/5] btrfs: Split btrfs_del_delalloc_inode into 2 functions

2018-05-11 Thread Nikolay Borisov


On 11.05.2018 08:44, Anand Jain wrote:
> 
> 
> On 04/27/2018 05:21 PM, Nikolay Borisov wrote:
>> This is in preparation of fixing delalloc inodes leakage on transaction
>> abort. Also export the new function.
>>
>> Signed-off-by: Nikolay Borisov 
> 
>  nit: I think we are reserving function prefix __ for some special
>  functions. I am not sure if the function name should prefix with __
>  here.

Generally __ prefix is used for some internal function. In this case the
gist of the function (with no locking) is behind the __ prefixed
function, whereas the non __ version adds the necessary locking. I think
this is a fairly well-established pattern in the kernel.
> 
> Reviewed-by: Anand Jain 
> 
> Thanks, Anand
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 09/11] Btrfs: fix ENOSPC caused by orphan items reservations

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Currently, we keep space reserved for all inode orphan items until the
inode is evicted (i.e., all references to it are dropped). We hit an
issue where an application would keep a bunch of deleted files open (by
design) and thus keep a large amount of space reserved, causing ENOSPC
errors when other operations tried to reserve space. This long-standing
reservation isn't absolutely necessary for a couple of reasons:

- We can almost always make the reservation we need or steal from the
  global reserve for the orphan item
- If we can't, it's not the end of the world if we drop the orphan item
  on the floor and let the next mount clean it up

So, get rid of persistent reservation and just reserve space in
btrfs_evict_inode().

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/btrfs_inode.h |  17 +++--
 fs/btrfs/inode.c   | 158 ++---
 2 files changed, 46 insertions(+), 129 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index a81112706cd5..bbbe7f308d68 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -20,15 +20,14 @@
  * new data the application may have written before commit.
  */
 #define BTRFS_INODE_ORDERED_DATA_CLOSE 0
-#define BTRFS_INODE_ORPHAN_META_RESERVED   1
-#define BTRFS_INODE_DUMMY  2
-#define BTRFS_INODE_IN_DEFRAG  3
-#define BTRFS_INODE_HAS_ASYNC_EXTENT   4
-#define BTRFS_INODE_NEEDS_FULL_SYNC5
-#define BTRFS_INODE_COPY_EVERYTHING6
-#define BTRFS_INODE_IN_DELALLOC_LIST   7
-#define BTRFS_INODE_READDIO_NEED_LOCK  8
-#define BTRFS_INODE_HAS_PROPS  9
+#define BTRFS_INODE_DUMMY  1
+#define BTRFS_INODE_IN_DEFRAG  2
+#define BTRFS_INODE_HAS_ASYNC_EXTENT   3
+#define BTRFS_INODE_NEEDS_FULL_SYNC4
+#define BTRFS_INODE_COPY_EVERYTHING5
+#define BTRFS_INODE_IN_DELALLOC_LIST   6
+#define BTRFS_INODE_READDIO_NEED_LOCK  7
+#define BTRFS_INODE_HAS_PROPS  8
 
 /* in memory btrfs inode */
 struct btrfs_inode {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7ca55af8aa17..b64c4189e2c0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3331,77 +3331,16 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
 /*
  * This creates an orphan entry for the given inode in case something goes 
wrong
  * in the middle of an unlink.
- *
- * NOTE: caller of this function should reserve 5 units of metadata for
- *  this function.
  */
 int btrfs_orphan_add(struct btrfs_trans_handle *trans,
-   struct btrfs_inode *inode)
+struct btrfs_inode *inode)
 {
-   struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
-   struct btrfs_root *root = inode->root;
-   struct btrfs_block_rsv *block_rsv = NULL;
-   int reserve = 0;
int ret;
 
-   if (!root->orphan_block_rsv) {
-   block_rsv = btrfs_alloc_block_rsv(fs_info,
- BTRFS_BLOCK_RSV_TEMP);
-   if (!block_rsv)
-   return -ENOMEM;
-   }
-
-   if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
- &inode->runtime_flags))
-   reserve = 1;
-
-   spin_lock(&root->orphan_lock);
-   /* If someone has created ->orphan_block_rsv, be happy to use it. */
-   if (!root->orphan_block_rsv) {
-   root->orphan_block_rsv = block_rsv;
-   } else if (block_rsv) {
-   btrfs_free_block_rsv(fs_info, block_rsv);
-   block_rsv = NULL;
-   }
-
-   atomic_inc(&root->orphan_inodes);
-   spin_unlock(&root->orphan_lock);
-
-   /* grab metadata reservation from transaction handle */
-   if (reserve) {
-   ret = btrfs_orphan_reserve_metadata(trans, inode);
-   ASSERT(!ret);
-   if (ret) {
-   /*
-* dec doesn't need spin_lock as ->orphan_block_rsv
-* would be released only if ->orphan_inodes is
-* zero.
-*/
-   atomic_dec(&root->orphan_inodes);
-   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
- &inode->runtime_flags);
-   return ret;
-   }
-   }
-
-   /* insert an orphan item to track this unlinked file */
-   ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
-   if (ret) {
-   if (reserve) {
-   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
- &inode->runtime_flags);
-   btrfs_orphan_release_metadata(inode);
-   }
-   /*
-* btrfs_orphan_commit_root may race with us and set
-* ->orpha

[PATCH v3 04/11] Btrfs: stop creating orphan items for truncate

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Currently, we insert an orphan item during a truncate so that if there's
a crash, we don't leak extents past the on-disk i_size. However, since
commit 7f4f6e0a3f6d ("Btrfs: only update disk_i_size as we remove
extents"), we keep disk_i_size in sync with the extent items as we
truncate, so orphan cleanup will never have any extents to remove. Don't
bother with the superfluous orphan item.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/free-space-cache.c |   6 +-
 fs/btrfs/inode.c| 159 +++-
 2 files changed, 51 insertions(+), 114 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index e5b569bebc73..d5f80cb300be 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -253,10 +253,8 @@ int btrfs_truncate_free_space_cache(struct 
btrfs_trans_handle *trans,
truncate_pagecache(inode, 0);
 
/*
-* We don't need an orphan item because truncating the free space cache
-* will never be split across transactions.
-* We don't need to check for -EAGAIN because we're a free space
-* cache inode
+* We skip the throttling logic for free space cache inodes, so we don't
+* need to check for -EAGAIN.
 */
ret = btrfs_truncate_inode_items(trans, root, inode,
 0, BTRFS_EXTENT_DATA_KEY);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bd4975476f0e..1460823951d7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3341,8 +3341,8 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
 }
 
 /*
- * This creates an orphan entry for the given inode in case something goes
- * wrong in the middle of an unlink/truncate.
+ * This creates an orphan entry for the given inode in case something goes 
wrong
+ * in the middle of an unlink.
  *
  * NOTE: caller of this function should reserve 5 units of metadata for
  *  this function.
@@ -3405,7 +3405,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
}
}
 
-   /* insert an orphan item to track this unlinked/truncated file */
+   /* insert an orphan item to track this unlinked file */
if (insert) {
ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
if (ret) {
@@ -3434,8 +3434,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
 }
 
 /*
- * We have done the truncate/delete so we can go ahead and remove the orphan
- * item for this particular inode.
+ * We have done the delete so we can go ahead and remove the orphan item for
+ * this particular inode.
  */
 static int btrfs_orphan_del(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode)
@@ -3479,7 +3479,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
struct btrfs_trans_handle *trans;
struct inode *inode;
u64 last_objectid = 0;
-   int ret = 0, nr_unlink = 0, nr_truncate = 0;
+   int ret = 0, nr_unlink = 0;
 
if (cmpxchg(&root->orphan_cleanup_state, 0, ORPHAN_CLEANUP_STARTED))
return 0;
@@ -3579,12 +3579,31 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
key.offset = found_key.objectid - 1;
continue;
}
+
}
+
/*
-* Inode is already gone but the orphan item is still there,
-* kill the orphan item.
+* If we have an inode with links, there are a couple of
+* possibilities. Old kernels (before v3.12) used to create an
+* orphan item for truncate indicating that there were possibly
+* extent items past i_size that needed to be deleted. In v3.12,
+* truncate was changed to update i_size in sync with the extent
+* items, but the (useless) orphan item was still created. Since
+* v4.18, we don't create the orphan item for truncate at all.
+*
+* So, this item could mean that we need to do a truncate, but
+* only if this filesystem was last used on a pre-v3.12 kernel
+* and was not cleanly unmounted. The odds of that are quite
+* slim, and it's a pain to do the truncate now, so just delete
+* the orphan item.
+*
+* It's also possible that this orphan item was supposed to be
+* deleted but wasn't. The inode number may have been reused,
+* but either way, we can delete the orphan item.
 */
-   if (ret == -ENOENT) {
+   if (ret == -ENOENT || inode->i_nlink) {
+   if (!ret)
+   iput(inode);
trans = btrfs_start_transaction(root, 1);
if (IS_ERR(trans)) 

[PATCH v3 06/11] Btrfs: delete dead code in btrfs_orphan_commit_root()

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_orphan_commit_root() tries to delete an orphan item for a
subvolume in the tree root, but we don't actually insert that item in
the first place. See commit 0a0d4415e338 ("Btrfs: delete dead code in
btrfs_orphan_add()"). We can get rid of it.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e22f8c9f6459..6110387f0218 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3302,7 +3302,6 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *block_rsv;
-   int ret;
 
if (atomic_read(&root->orphan_inodes) ||
root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
@@ -3323,17 +3322,6 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
root->orphan_block_rsv = NULL;
spin_unlock(&root->orphan_lock);
 
-   if (test_bit(BTRFS_ROOT_ORPHAN_ITEM_INSERTED, &root->state) &&
-   btrfs_root_refs(&root->root_item) > 0) {
-   ret = btrfs_del_orphan_item(trans, fs_info->tree_root,
-   root->root_key.objectid);
-   if (ret)
-   btrfs_abort_transaction(trans, ret);
-   else
-   clear_bit(BTRFS_ROOT_ORPHAN_ITEM_INSERTED,
- &root->state);
-   }
-
if (block_rsv) {
WARN_ON(block_rsv->size > 0);
btrfs_free_block_rsv(fs_info, block_rsv);
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 10/11] Btrfs: get rid of unused orphan infrastructure

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with
root->orphan_lock, which was used to protect it, root->orphan_inodes,
which was used as a refcount for it, and btrfs_orphan_commit_root(),
which was the last user of all of these.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ctree.h   |  8 
 fs/btrfs/disk-io.c |  9 -
 fs/btrfs/extent-tree.c | 38 -
 fs/btrfs/inode.c   | 43 +-
 fs/btrfs/transaction.c |  1 -
 5 files changed, 1 insertion(+), 98 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2771cc56a622..51408de11af2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1219,9 +1219,6 @@ struct btrfs_root {
spinlock_t log_extents_lock[2];
struct list_head logged_list[2];
 
-   spinlock_t orphan_lock;
-   atomic_t orphan_inodes;
-   struct btrfs_block_rsv *orphan_block_rsv;
int orphan_cleanup_state;
 
spinlock_t inode_lock;
@@ -2764,9 +2761,6 @@ void btrfs_delalloc_release_space(struct inode *inode,
 void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
u64 len);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
-int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans,
- struct btrfs_inode *inode);
-void btrfs_orphan_release_metadata(struct btrfs_inode *inode);
 int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
 struct btrfs_block_rsv *rsv,
 int nitems,
@@ -3238,8 +3232,6 @@ int btrfs_update_inode_fallback(struct btrfs_trans_handle 
*trans,
 int btrfs_orphan_add(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode);
 int btrfs_orphan_cleanup(struct btrfs_root *root);
-void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans,
- struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_invalidate_inodes(struct btrfs_root *root);
 void btrfs_add_delayed_iput(struct inode *inode);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 60caa68c3618..4a40bfdddabc 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1185,7 +1185,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
root->inode_tree = RB_ROOT;
INIT_RADIX_TREE(&root->delayed_nodes_tree, GFP_ATOMIC);
root->block_rsv = NULL;
-   root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
INIT_LIST_HEAD(&root->root_list);
@@ -1195,7 +1194,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
INIT_LIST_HEAD(&root->ordered_root);
INIT_LIST_HEAD(&root->logged_list[0]);
INIT_LIST_HEAD(&root->logged_list[1]);
-   spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
spin_lock_init(&root->delalloc_lock);
spin_lock_init(&root->ordered_extent_lock);
@@ -1216,7 +1214,6 @@ static void __setup_root(struct btrfs_root *root, struct 
btrfs_fs_info *fs_info,
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
atomic_set(&root->log_batch, 0);
-   atomic_set(&root->orphan_inodes, 0);
refcount_set(&root->refs, 1);
atomic_set(&root->will_be_snapshotted, 0);
root->log_transid = 0;
@@ -3674,8 +3671,6 @@ static void free_fs_root(struct btrfs_root *root)
 {
iput(root->ino_cache_inode);
WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
-   btrfs_free_block_rsv(root->fs_info, root->orphan_block_rsv);
-   root->orphan_block_rsv = NULL;
if (root->anon_dev)
free_anon_bdev(root->anon_dev);
if (root->subv_writers)
@@ -3766,7 +3761,6 @@ int btrfs_commit_super(struct btrfs_fs_info *fs_info)
 
 void close_ctree(struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_root *root = fs_info->tree_root;
int ret;
 
set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
@@ -3861,9 +3855,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
btrfs_free_stripe_hash_table(fs_info);
btrfs_free_ref_cache(fs_info);
 
-   __btrfs_free_block_rsv(root->orphan_block_rsv);
-   root->orphan_block_rsv = NULL;
-
while (!list_empty(&fs_info->pinned_chunks)) {
struct extent_map *em;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 51b5e2da708c..3f2e026bc206 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5949,44 +5949,6 @@ void btrfs_trans_release_chunk_metadata(struct 
btrfs_trans_handle *trans)
trans->chunk_bytes_reserved = 0;
 }
 
-/* Can only return 0 or -ENOSPC */
-int btrfs_orphan_reserve

[PATCH v3 07/11] Btrfs: don't return ino to ino cache if inode item removal fails

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
item will still be in the tree but we still return the ino to the ino
cache. That will blow up later when someone tries to allocate that ino,
so don't return it to the cache.

Fixes: 581bb050941b ("Btrfs: Cache free inode numbers in memory")
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6110387f0218..73bc66d153ef 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5330,13 +5330,18 @@ void btrfs_evict_inode(struct inode *inode)
trans->block_rsv = rsv;
 
ret = btrfs_truncate_inode_items(trans, root, inode, 0, 0);
-   if (ret != -ENOSPC && ret != -EAGAIN)
+   if (ret) {
+   trans->block_rsv = &fs_info->trans_block_rsv;
+   btrfs_end_transaction(trans);
+   btrfs_btree_balance_dirty(fs_info);
+   if (ret != -ENOSPC && ret != -EAGAIN) {
+   btrfs_orphan_del(NULL, BTRFS_I(inode));
+   btrfs_free_block_rsv(fs_info, rsv);
+   goto no_delete;
+   }
+   } else {
break;
-
-   trans->block_rsv = &fs_info->trans_block_rsv;
-   btrfs_end_transaction(trans);
-   trans = NULL;
-   btrfs_btree_balance_dirty(fs_info);
+   }
}
 
btrfs_free_block_rsv(fs_info, rsv);
@@ -5345,12 +5350,8 @@ void btrfs_evict_inode(struct inode *inode)
 * Errors here aren't a big deal, it just means we leave orphan items
 * in the tree.  They will be cleaned up on the next mount.
 */
-   if (ret == 0) {
-   trans->block_rsv = root->orphan_block_rsv;
-   btrfs_orphan_del(trans, BTRFS_I(inode));
-   } else {
-   btrfs_orphan_del(NULL, BTRFS_I(inode));
-   }
+   trans->block_rsv = root->orphan_block_rsv;
+   btrfs_orphan_del(trans, BTRFS_I(inode));
 
trans->block_rsv = &fs_info->trans_block_rsv;
if (!(root == fs_info->tree_root ||
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/11] Btrfs: refactor btrfs_evict_inode() reserve refill dance

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

The truncate loop in btrfs_evict_inode() does two things at once:

- It refills the temporary block reserve, potentially stealing from the
  global reserve or committing
- It calls btrfs_truncate_inode_items()

The tangle of continues hides the fact that these two steps are actually
separate. Split the first step out into a separate function both for
clarity and so that we can reuse it in a later patch.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 113 ++-
 1 file changed, 42 insertions(+), 71 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 73bc66d153ef..7ca55af8aa17 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5189,13 +5189,52 @@ static void evict_inode_truncate_pages(struct inode 
*inode)
spin_unlock(&io_tree->lock);
 }
 
+static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root 
*root,
+   struct btrfs_block_rsv 
*rsv,
+   u64 min_size)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+   int failures = 0;
+
+   for (;;) {
+   struct btrfs_trans_handle *trans;
+   int ret;
+
+   ret = btrfs_block_rsv_refill(root, rsv, min_size,
+BTRFS_RESERVE_FLUSH_LIMIT);
+
+   if (ret && ++failures > 2) {
+   btrfs_warn(fs_info,
+  "could not allocate space for a delete; will 
truncate on mount");
+   return ERR_PTR(-ENOSPC);
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans) || !ret)
+   return trans;
+
+   /*
+* Try to steal from the global reserve if there is space for
+* it.
+*/
+   if (!btrfs_check_space_for_delayed_refs(trans, fs_info) &&
+   !btrfs_block_rsv_migrate(global_rsv, rsv, min_size, 0))
+   return trans;
+
+   /* If not, commit and try again. */
+   ret = btrfs_commit_transaction(trans);
+   if (ret)
+   return ERR_PTR(ret);
+   }
+}
+
 void btrfs_evict_inode(struct inode *inode)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_trans_handle *trans;
struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct btrfs_block_rsv *rsv, *global_rsv;
-   int steal_from_global = 0;
+   struct btrfs_block_rsv *rsv;
u64 min_size;
int ret;
 
@@ -5248,85 +5287,17 @@ void btrfs_evict_inode(struct inode *inode)
}
rsv->size = min_size;
rsv->failfast = 1;
-   global_rsv = &fs_info->global_block_rsv;
 
btrfs_i_size_write(BTRFS_I(inode), 0);
 
-   /*
-* This is a bit simpler than btrfs_truncate since we've already
-* reserved our space for our orphan item in the unlink, so we just
-* need to reserve some slack space in case we add bytes and update
-* inode item when doing the truncate.
-*/
while (1) {
-   ret = btrfs_block_rsv_refill(root, rsv, min_size,
-BTRFS_RESERVE_FLUSH_LIMIT);
-
-   /*
-* Try and steal from the global reserve since we will
-* likely not use this space anyway, we want to try as
-* hard as possible to get this to work.
-*/
-   if (ret)
-   steal_from_global++;
-   else
-   steal_from_global = 0;
-   ret = 0;
-
-   /*
-* steal_from_global == 0: we reserved stuff, hooray!
-* steal_from_global == 1: we didn't reserve stuff, boo!
-* steal_from_global == 2: we've committed, still not a lot of
-* room but maybe we'll have room in the global reserve this
-* time.
-* steal_from_global == 3: abandon all hope!
-*/
-   if (steal_from_global > 2) {
-   btrfs_warn(fs_info,
-  "Could not get space for a delete, will 
truncate on mount %d",
-  ret);
-   btrfs_orphan_del(NULL, BTRFS_I(inode));
-   btrfs_free_block_rsv(fs_info, rsv);
-   goto no_delete;
-   }
-
-   trans = btrfs_join_transaction(root);
+   trans = evict_refill_and_join(root, rsv, min_size);
if (IS_ERR(trans)) {
btrfs_orphan_del(NULL, BTRFS_I(inode));
btrfs_free_blo

[PATCH v3 11/11] Btrfs: reserve space for O_TMPFILE orphan item deletion

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_link() calls btrfs_orphan_del() if it's linking an O_TMPFILE but
it doesn't reserve space to do so. Even before the removal of the
orphan_block_rsv it wasn't using it.

Fixes: ef3b9af50bfa ("Btrfs: implement inode_operations callback tmpfile")
Reviewed-by: Filipe Manana 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1edb4148ec74..98cf08944552 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6465,8 +6465,9 @@ static int btrfs_link(struct dentry *old_dentry, struct 
inode *dir,
 * 2 items for inode and inode ref
 * 2 items for dir items
 * 1 item for parent inode
+* 1 item for orphan item deletion if O_TMPFILE
 */
-   trans = btrfs_start_transaction(root, 5);
+   trans = btrfs_start_transaction(root, inode->i_nlink ? 5 : 6);
if (IS_ERR(trans)) {
err = PTR_ERR(trans);
trans = NULL;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/11] Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Now that we don't add orphan items for truncate, there can't be races on
adding or deleting an orphan item, so this bit is unnecessary.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/btrfs_inode.h | 13 
 fs/btrfs/inode.c   | 76 +++---
 2 files changed, 26 insertions(+), 63 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 234bae55b85d..a81112706cd5 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -23,13 +23,12 @@
 #define BTRFS_INODE_ORPHAN_META_RESERVED   1
 #define BTRFS_INODE_DUMMY  2
 #define BTRFS_INODE_IN_DEFRAG  3
-#define BTRFS_INODE_HAS_ORPHAN_ITEM4
-#define BTRFS_INODE_HAS_ASYNC_EXTENT   5
-#define BTRFS_INODE_NEEDS_FULL_SYNC6
-#define BTRFS_INODE_COPY_EVERYTHING7
-#define BTRFS_INODE_IN_DELALLOC_LIST   8
-#define BTRFS_INODE_READDIO_NEED_LOCK  9
-#define BTRFS_INODE_HAS_PROPS  10
+#define BTRFS_INODE_HAS_ASYNC_EXTENT   4
+#define BTRFS_INODE_NEEDS_FULL_SYNC5
+#define BTRFS_INODE_COPY_EVERYTHING6
+#define BTRFS_INODE_IN_DELALLOC_LIST   7
+#define BTRFS_INODE_READDIO_NEED_LOCK  8
+#define BTRFS_INODE_HAS_PROPS  9
 
 /* in memory btrfs inode */
 struct btrfs_inode {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1460823951d7..e22f8c9f6459 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3354,7 +3354,6 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = NULL;
int reserve = 0;
-   bool insert = false;
int ret;
 
if (!root->orphan_block_rsv) {
@@ -3364,10 +3363,6 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
return -ENOMEM;
}
 
-   if (!test_and_set_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
- &inode->runtime_flags))
-   insert = true;
-
if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
  &inode->runtime_flags))
reserve = 1;
@@ -3381,8 +3376,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
block_rsv = NULL;
}
 
-   if (insert)
-   atomic_inc(&root->orphan_inodes);
+   atomic_inc(&root->orphan_inodes);
spin_unlock(&root->orphan_lock);
 
/* grab metadata reservation from transaction handle */
@@ -3398,36 +3392,28 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
atomic_dec(&root->orphan_inodes);
clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
  &inode->runtime_flags);
-   if (insert)
-   clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
- &inode->runtime_flags);
return ret;
}
}
 
/* insert an orphan item to track this unlinked file */
-   if (insert) {
-   ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
-   if (ret) {
-   if (reserve) {
-   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
- &inode->runtime_flags);
-   btrfs_orphan_release_metadata(inode);
-   }
-   /*
-* btrfs_orphan_commit_root may race with us and set
-* ->orphan_block_rsv to zero, in order to avoid that,
-* decrease ->orphan_inodes after everything is done.
-*/
-   atomic_dec(&root->orphan_inodes);
-   if (ret != -EEXIST) {
-   clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
- &inode->runtime_flags);
-   btrfs_abort_transaction(trans, ret);
-   return ret;
-   }
+   ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
+   if (ret) {
+   if (reserve) {
+   clear_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
+ &inode->runtime_flags);
+   btrfs_orphan_release_metadata(inode);
+   }
+   /*
+* btrfs_orphan_commit_root may race with us and set
+* ->orphan_block_rsv to zero, in order to avoid that,
+* decrease ->orphan_inodes after everything is done.
+*/
+   atomic_dec(&root->orphan_inodes);
+   if (ret != -EEXIST) {
+   btrfs_abort_transaction(trans, ret);
+   return ret;
}
-

[PATCH v3 03/11] Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_free_extent() can fail because of ENOMEM. There's no reason to
panic here, we can just abort the transaction.

Fixes: f4b9aa8d3b87 ("btrfs_truncate")
Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 79d1da01a90d..bd4975476f0e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4655,7 +4655,10 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
extent_num_bytes, 0,
btrfs_header_owner(leaf),
ino, extent_offset);
-   BUG_ON(ret);
+   if (ret) {
+   btrfs_abort_transaction(trans, ret);
+   break;
+   }
if (btrfs_should_throttle_delayed_refs(trans, fs_info))
btrfs_async_run_delayed_refs(fs_info,
trans->delayed_ref_updates * 2,
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 02/11] Btrfs: fix error handling in btrfs_truncate_inode_items()

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

btrfs_truncate_inode_items() uses two variables for error handling, ret
and err. These are not handled consistently, leading to a couple of
bugs.

- Errors from btrfs_del_items() are handled but not propagated to the
  caller
- If btrfs_run_delayed_refs() fails and aborts the transaction, we
  continue running

Just use ret everywhere and simplify things a bit, fixing both of these
issues.

Fixes: 79787eaab461 ("btrfs: replace many BUG_ONs with proper error handling")
Fixes: 1262133b8d6f ("Btrfs: account for crcs in delayed ref processing")
Reviewed-by: Nikolay Borisov 
Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 55 
 1 file changed, 28 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fef8dbb6a93f..79d1da01a90d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4442,7 +4442,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
int pending_del_slot = 0;
int extent_type = -1;
int ret;
-   int err = 0;
u64 ino = btrfs_ino(BTRFS_I(inode));
u64 bytes_deleted = 0;
bool be_nice = false;
@@ -4494,22 +4493,19 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
 * up a huge file in a single leaf.  Most of the time that
 * bytes_deleted is > 0, it will be huge by the time we get here
 */
-   if (be_nice && bytes_deleted > SZ_32M) {
-   if (btrfs_should_end_transaction(trans)) {
-   err = -EAGAIN;
-   goto error;
-   }
+   if (be_nice && bytes_deleted > SZ_32M &&
+   btrfs_should_end_transaction(trans)) {
+   ret = -EAGAIN;
+   goto out;
}
 
-
path->leave_spinning = 1;
ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
-   if (ret < 0) {
-   err = ret;
+   if (ret < 0)
goto out;
-   }
 
if (ret > 0) {
+   ret = 0;
/* there are no items in the tree for us to truncate, we're
 * done
 */
@@ -4620,7 +4616,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 * We have to bail so the last_size is set to
 * just before this extent.
 */
-   err = NEED_TRUNCATE_BLOCK;
+   ret = NEED_TRUNCATE_BLOCK;
break;
}
 
@@ -4687,7 +4683,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
pending_del_nr);
if (ret) {
btrfs_abort_transaction(trans, ret);
-   goto error;
+   break;
}
pending_del_nr = 0;
}
@@ -4698,8 +4694,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
trans->delayed_ref_updates = 0;
ret = btrfs_run_delayed_refs(trans,
   updates * 2);
-   if (ret && !err)
-   err = ret;
+   if (ret)
+   break;
}
}
/*
@@ -4707,8 +4703,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 * and let the transaction restart
 */
if (should_end) {
-   err = -EAGAIN;
-   goto error;
+   ret = -EAGAIN;
+   break;
}
goto search_again;
} else {
@@ -4716,32 +4712,37 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
}
}
 out:
-   if (pending_del_nr) {
-   ret = btrfs_del_items(trans, root, path, pending_del_slot,
+   if (ret >= 0 && pending_del_nr) {
+   int err;
+
+   err = btrfs_del_items(trans, root, path, pending_del_slot,
  pending_del_nr);
-   if (ret)
-   btrfs_abort_transaction(trans, ret);
+   if (err) {
+   btrfs_abort_transaction(trans, err);
+   ret = err;
+   }
}
-error:
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
ASSERT(last_size >= new

[PATCH v3 01/11] Btrfs: remove stale comment referencing vmtruncate()

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Commit a41ad394a03b ("Btrfs: convert to the new truncate sequence")
changed vmtruncate() to truncate_setsize() but didn't update the comment
above it. truncate_setsize() never fails (the IS_SWAPFILE() check
happens elsewhere), so remove the comment.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d241285a0d2a..fef8dbb6a93f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5106,7 +5106,6 @@ static int btrfs_setsize(struct inode *inode, struct 
iattr *attr)
if (ret)
return ret;
 
-   /* we don't support swapfiles, so vmtruncate shouldn't fail */
truncate_setsize(inode, newsize);
 
/* Disable nonlocked read DIO to avoid the end less truncate */
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 00/11] Btrfs: orphan and truncate fixes

2018-05-11 Thread Omar Sandoval
From: Omar Sandoval 

Hi,

This is v3 of the fixes for the orphan item early ENOSPC issue we hit at
Facebook. The big change is that I now also got rid of
BTRFS_INODE_HAS_ORPHAN_ITEM (thanks, Nikolay) and shuffled the patches
around so there is less churn.

Changes since v2:

- Add patch 5 to get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
- Move patch 10 to patch 6
- Got rid of patch 5; the bug goes away in the process of removing code
  for patches 9 and 10
- Rename patch 10 batch to what it was called in v1

Changes since v1:

- Added two extra cleanups, patches 10 and 11
- Added a forgotten clear of the orphan bit in patch 8
- Reworded titles of patches 6 and 9
- Added people's reviewed-bys

Cover letter from v1:

At Facebook we hit an early ENOSPC issue which we tracked down to the
reservations for orphan items of deleted-but-still-open files. The
primary function of this series is to fix that bug, but I ended up
uncovering a pile of other issues in the process, most notably that the
orphan items we create for truncate are useless.

I've also posted an xfstest that reproduces this bug.

Thanks!

Omar Sandoval (11):
  Btrfs: remove stale comment referencing vmtruncate()
  Btrfs: fix error handling in btrfs_truncate_inode_items()
  Btrfs: don't BUG_ON() in btrfs_truncate_inode_items()
  Btrfs: stop creating orphan items for truncate
  Btrfs: get rid of BTRFS_INODE_HAS_ORPHAN_ITEM
  Btrfs: delete dead code in btrfs_orphan_commit_root()
  Btrfs: don't return ino to ino cache if inode item removal fails
  Btrfs: refactor btrfs_evict_inode() reserve refill dance
  Btrfs: fix ENOSPC caused by orphan items reservations
  Btrfs: get rid of unused orphan infrastructure
  Btrfs: reserve space for O_TMPFILE orphan item deletion

 fs/btrfs/btrfs_inode.h  |  18 +-
 fs/btrfs/ctree.h|   8 -
 fs/btrfs/disk-io.c  |   9 -
 fs/btrfs/extent-tree.c  |  38 ---
 fs/btrfs/free-space-cache.c |   6 +-
 fs/btrfs/inode.c| 576 ++--
 fs/btrfs/transaction.c  |   1 -
 7 files changed, 170 insertions(+), 486 deletions(-)

-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/11] btrfs-progs: Rework of "subvolume list/show" and relax the root privileges of them

2018-05-11 Thread Tomohiro Misono
Hello,

This series is an updated version of
  [RFC PATCH v3 0/7] btrfs-progs: Allow normal user to call "subvolume 
list/show" [1]
and requires new ioctls which can be found in ML as
  [PATCH v4 0/3] btrfs: Add three new unprivileged ioctls to allow normal users 
to call "sub list/show" etc.

Or, code can be found at:
  kernel ... https://github.com/t-msn/linux/tree/add-user-subvol-ioctl-misc
  progs  ... https://github.com/t-msn/btrfs-progs/tree/rework-sub-list

Since libbtrfsutil has been merged, I completely rewrote the logic using
libbtrfsutil and reset the version number.

The aim of this series is to relax the root privileges of "sub list/show"
while keeping as much output consistency between root and non-privileged
user. For "subvolume list", default output has been changed from current
btrfs-progs (see below).

* Behavior summary of new "sub list/show"
 - "subvolume list"
   - The default behavior is changed to output only the subvolumes which
 exist below the specified path (incl. the specified path itself.
 the subvolumes mounted below the specified path is not considered yet).
   - If kernel supports new ioctls, the path to a non-subvolume directory
 can be specified.
   - If kernel supports new ioctls, non-privileged user can also call it.
 The subvolumes which cannot be accessed will be skipped.

  Note that root user can get all the subvolume information in the fs
  by using -a option just as before.

 - "subvolume show"
   - No change for root.
   - If kernel supports new ioctls, non-privileged user can also call it.
 In that case, the path to be shown is a relative from mount point and
 snapshots field lists snapshots which exist under mountpoint.

* Patch structure
 1st-5th update the libbtrfsutil using new ioctls:
   - Relax the privileges of following functions if kernel supports new
 ioctls and @top/@id is zero (i.e. the given path/fd is used instead
 of arbitrary subvolume id).
 - util_subvolume_info()
 - subvolume iterator related ones (util_subvolume_iterator_next() etc.)
   - For subvolume iterator, if kernel supports new ioctls and @top is zero,
 non-subvolume directory can be specified as a start point. Also,
 subvolume which cannot be accessed (either because of permission
 error or not found (may happen if other volume is mounted in the
 path)) will be skipped.

 6th patch update the "sub list" to use libbtrfsutil (no behavior change)
   This is a copy of non-merged following patch originally written
   by Omar Sandoval:
 btrfs-progs: use libbtrfsutil for subvolume list [2]
   expect this commit keeps libbtrfs implementation which above commit
   tries to remove.

   (I suspect that the part of the reason that the original patch has not
   been merged is it removes libbtrfs and this commits modify this. but
   I'm completely fine with the original patch instead of this.)

 7th-9th patch update the behavior of "sub list/show"

 10th-11th patch is a cli-test for "sub list" of new behavior.

* Future todo:
If this approach is ok, I'd like to update the output of "sub list" more like:
  - Consider subvolume mounted below the specified path and list them as well
  - Remove obsolete field (i.e. top-level) from output

Any comments are welcome.
Thanks,
Tomohiro Misono

[1] https://www.spinics.net/lists/linux-btrfs/msg76008.html
[2] https://www.spinics.net/lists/linux-btrfs/msg74917.html 


Tomohiro Misono (11):
  btrfs-progs: ioctl/libbtrfsutil: Add 3 definitions of new unprivileged
ioctl
  btrfs-progs: libbtrfsutil: Factor out btrfs_util_subvolume_info_fd()
  btrfs-porgs: libbtrfsutil: Relax the privileges of
util_subvolume_info()
  btrfs-progs: libbtrfsuitl: Factor out
btrfs_util_subvolume_iterator_next()
  btrfs-progs: libbtrfsutil: Update the behavior of subvolume iterator
and relax the privileges
  btrfs-progs: sub list: Use libbtrfsuitl for subvolume list
  btrfs-progs: sub list: Change the default behavior of "subvolume list"
and allow non-privileged user to call it
  btrfs-progs: utils: Fallback to open without O_NOATIME flag in
find_mount_root():
  btrfs-progs: sub show: Allow non-privileged user to call "subvolume
show"
  btrfs-progs: test: Add helper function to check if test user exists
  btrfs-porgs: test: Add cli-test/009 to check subvolume list for both
root and normal user

 Documentation/btrfs-subvolume.asciidoc |2 +
 cmds-subvolume.c   | 1123 +++-
 ioctl.h|   86 +++
 libbtrfsutil/btrfs.h   |   84 +++
 libbtrfsutil/btrfsutil.h   |   26 +-
 libbtrfsutil/errors.c  |8 +
 libbtrfsutil/subvolume.c   |  429 +--
 tests/cli-tests/009-subvolume-list/test.sh |  136 
 tests/common   |   10 +
 utils.c|3 +
 10 files changed, 1819 insertions(+), 88

[PATCH 11/11] btrfs-porgs: test: Add cli-test/009 to check subvolume list for both root and normal user

2018-05-11 Thread Tomohiro Misono
Signed-off-by: Tomohiro Misono 
---
 tests/cli-tests/009-subvolume-list/test.sh | 136 +
 1 file changed, 136 insertions(+)
 create mode 100755 tests/cli-tests/009-subvolume-list/test.sh

diff --git a/tests/cli-tests/009-subvolume-list/test.sh 
b/tests/cli-tests/009-subvolume-list/test.sh
new file mode 100755
index ..5f7b7919
--- /dev/null
+++ b/tests/cli-tests/009-subvolume-list/test.sh
@@ -0,0 +1,136 @@
+#!/bin/bash
+# test for "subvolume list" both for root and normal user
+
+source "$TEST_TOP/common"
+
+check_testuser
+check_prereq mkfs.btrfs
+check_prereq btrfs
+
+setup_root_helper
+prepare_test_dev
+
+
+# test if the ids returned by "sub list" match expected ids
+# $1  ... indicate run as root or test user
+# $2  ... PATH to be specified by sub list command
+# $3~ ... expected return ids
+test_list()
+{
+   local SUDO
+   if [ $1 -eq 1 ]; then
+   SUDO=$SUDO_HELPER
+   else
+   SUDO="sudo -u progs-test"
+   fi
+
+   result=$(run_check_stdout $SUDO "$TOP/btrfs" subvolume list "$2" | \
+   awk '{print $2}' | xargs | sort -n)
+
+   shift
+   shift
+   expected=($(echo "$@" | tr " " "\n" | sort -n))
+   expected=$(IFS=" "; echo "${expected[*]}")
+
+   if [ "$result" != "$expected" ]; then
+   echo "result  : $result"
+   echo "expected: $expected"
+   _fail "ids returned by sub list does not match expected ids"
+   fi
+}
+
+run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f "$TEST_DEV"
+run_check_mount_test_dev
+cd "$TEST_MNT"
+
+# create subvolumes and directories and make some non-readable
+# by user 'progs-test'
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub1
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub1/subsub1
+run_check $SUDO_HELPER mkdir sub1/dir
+
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub2
+run_check $SUDO_HELPER mkdir -p sub2/dir/dirdir
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub2/dir/subsub2
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub2/dir/dirdir/subsubX
+
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub3
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub3/subsub3
+run_check $SUDO_HELPER mkdir sub3/dir
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub3/dir/subsubY
+run_check $SUDO_HELPER chmod o-r sub3
+
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub4
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub4/subsub4
+run_check $SUDO_HELPER mkdir sub4/dir
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create sub4/dir/subsubZ
+run_check $SUDO_HELPER setfacl -m u:progs-test:- sub4/dir
+
+run_check $SUDO_HELPER touch "file"
+
+# expected result for root at mount point:
+#
+# ID 256 gen 8 top level 5 path sub1
+# ID 258 gen 7 top level 256 path sub1/subsub1
+# ID 259 gen 10 top level 5 path sub2
+# ID 260 gen 9 top level 259 path sub2/dir/subsub2
+# ID 261 gen 10 top level 259 path sub2/dir/dirdir/subsubX
+# ID 262 gen 14 top level 5 path sub3
+# ID 263 gen 12 top level 262 path sub3/subsub3
+# ID 264 gen 13 top level 262 path sub3/dir/subsubY
+# ID 265 gen 17 top level 5 path sub4
+# ID 266 gen 15 top level 265 path sub4/subsub4
+# ID 267 gen 16 top level 265 path sub4/dir/subsubZ
+
+# check for root for both absolute/relative path
+# always returns all subvolumes
+all=(256 258 259 260 261 262 263 264 265 266 267)
+test_list 1 "$TEST_MNT" "${all[@]}"
+test_list 1 "$TEST_MNT/sub1" "256 258"
+test_list 1 "$TEST_MNT/sub1/dir" ""
+test_list 1 "$TEST_MNT/sub2" "259 260 261"
+test_list 1 "$TEST_MNT/sub2/dir" "260 261"
+test_list 1 "$TEST_MNT/sub3" "262 263 264"
+test_list 1 "$TEST_MNT/sub4" "265 266 267"
+run_mustfail "should fail for file" \
+   $SUDO_HELPER "$TOP/btrfs" subvolume list "$TEST_MNT/file"
+
+test_list 1 "." "${all[@]}"
+test_list 1 "sub1" "256 258"
+test_list 1 "sub1/dir" ""
+test_list 1 "sub2" "259 260 261"
+test_list 1 "sub2/dir" "260 261"
+test_list 1 "sub3" "262 263 264"
+test_list 1 "sub4" "265 266 267"
+run_mustfail "should fail for file" \
+   $SUDO_HELPER "$TOP/btrfs" subvolume list "file"
+
+# check for normal user for both absolute/relative path
+# only returns subvolumes under specified path
+test_list 0 "$TEST_MNT" "256 258 259 260 261 265 266"
+test_list 0 "$TEST_MNT/sub1" "256 258"
+test_list 0 "$TEST_MNT/sub1/dir" ""
+test_list 0 "$TEST_MNT/sub2" "259 260 261"
+test_list 0 "$TEST_MNT/sub2/dir" "260 261"
+run_mustfail "should raise permission error" \
+   sudo -u progs-test "$TOP/btrfs" subvolume list "$TEST_MNT/sub3"
+test_list 0 "$TEST_MNT/sub4" "265 266"
+run_mustfail "should raise permission error" \
+   sudo -u progs-test "$TOP/btrfs" subvolume list "$TEST_MNT/sub4/dir"
+run_mustfail "should fail for file" \
+   sudo -u progs-test "$TOP/btrfs" subvolume list "$TEST_MNT/file"
+
+test_list 0 "." "256 258 259 260 261 265 266"
+test_list 0 "sub1/dir" ""
+test_list 0 "sub2" "259 260 261"
+test_list 0 "sub2/dir" "

[PATCH 03/11] btrfs-porgs: libbtrfsutil: Relax the privileges of util_subvolume_info()

2018-05-11 Thread Tomohiro Misono
By using new ioctl (BTRFS_IOC_GET_SUBVOL_INFO), this commit allows
non-privileged user to call util_subvolume_info() as long as @id is zero
(user can only get the information of the subvolume which he can open).

Signed-off-by: Tomohiro Misono 
---
 libbtrfsutil/btrfsutil.h |  7 +-
 libbtrfsutil/errors.c|  4 
 libbtrfsutil/subvolume.c | 58 
 3 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h
index 6d655f49..5fe798c5 100644
--- a/libbtrfsutil/btrfsutil.h
+++ b/libbtrfsutil/btrfsutil.h
@@ -63,6 +63,8 @@ enum btrfs_util_error {
BTRFS_UTIL_ERROR_SYNC_FAILED,
BTRFS_UTIL_ERROR_START_SYNC_FAILED,
BTRFS_UTIL_ERROR_WAIT_SYNC_FAILED,
+   BTRFS_UTIL_ERROR_INVALID_ARGUMENT_FOR_USER,
+   BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED,
 };
 
 /**
@@ -266,7 +268,10 @@ struct btrfs_util_subvolume_info {
  * to check whether the subvolume exists; %BTRFS_UTIL_ERROR_SUBVOLUME_NOT_FOUND
  * will be returned if it does not.
  *
- * This requires appropriate privilege (CAP_SYS_ADMIN).
+ * This requires appropriate privilege (CAP_SYS_ADMIN) for older kernel.
+ * For newer kernel which supports BTRFS_IOC_GET_SUGBVOL_INFO,
+ * non-privileged user with appropriate permission for @path can use this too
+ * (in that case @id must be zero).
  *
  * Return: %BTRFS_UTIL_OK on success, non-zero error code on failure.
  */
diff --git a/libbtrfsutil/errors.c b/libbtrfsutil/errors.c
index 634edc65..f196fa71 100644
--- a/libbtrfsutil/errors.c
+++ b/libbtrfsutil/errors.c
@@ -45,6 +45,10 @@ static const char * const error_messages[] = {
[BTRFS_UTIL_ERROR_SYNC_FAILED] = "Could not sync filesystem",
[BTRFS_UTIL_ERROR_START_SYNC_FAILED] = "Could not start filesystem 
sync",
[BTRFS_UTIL_ERROR_WAIT_SYNC_FAILED] = "Could not wait for filesystem 
sync",
+   [BTRFS_UTIL_ERROR_INVALID_ARGUMENT_FOR_USER] =
+   "Non-root user cannot specify subvolume id",
+   [BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED] =
+   "Could not get subvolume information by BTRFS_IOC_GET_SUBVOL_INFO",
 };
 
 PUBLIC const char *btrfs_util_strerror(enum btrfs_util_error err)
diff --git a/libbtrfsutil/subvolume.c b/libbtrfsutil/subvolume.c
index 0d7ef5bf..3ce6e0a6 100644
--- a/libbtrfsutil/subvolume.c
+++ b/libbtrfsutil/subvolume.c
@@ -31,6 +31,14 @@
 
 #include "btrfsutil_internal.h"
 
+static bool is_root(void)
+{
+   uid_t uid;
+
+   uid = geteuid();
+   return (uid == 0);
+}
+
 /*
  * This intentionally duplicates btrfs_util_is_subvolume_fd() instead of 
opening
  * a file descriptor and calling it, because fstat() and fstatfs() don't accept
@@ -383,11 +391,61 @@ static enum btrfs_util_error get_subvolume_info_root(int 
fd, uint64_t id,
return BTRFS_UTIL_OK;
 }
 
+static enum btrfs_util_error get_subvolume_info_user(int fd,
+struct 
btrfs_util_subvolume_info *subvol)
+{
+   struct btrfs_ioctl_get_subvol_info_args info;
+   int ret;
+
+   ret = ioctl(fd, BTRFS_IOC_GET_SUBVOL_INFO, &info);
+   if (ret < 0)
+   return BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED;
+
+   subvol->id = info.id;
+   subvol->parent_id = info.parent_id;
+   subvol->dir_id = info.dirid;
+   subvol->flags = info.flags;
+   subvol->generation = info.generation;
+
+   memcpy(subvol->uuid, info.uuid, sizeof(subvol->uuid));
+   memcpy(subvol->parent_uuid, info.parent_uuid,
+   sizeof(subvol->parent_uuid));
+   memcpy(subvol->received_uuid, info.received_uuid,
+   sizeof(subvol->received_uuid));
+
+   subvol->ctransid = info.ctransid;
+   subvol->otransid = info.otransid;
+   subvol->stransid = info.stransid;
+   subvol->rtransid = info.rtransid;
+
+   subvol->ctime.tv_sec  = info.ctime.sec;
+   subvol->ctime.tv_nsec = info.ctime.nsec;
+   subvol->otime.tv_sec  = info.otime.sec;
+   subvol->otime.tv_nsec = info.otime.nsec;
+   subvol->stime.tv_sec  = info.stime.sec;
+   subvol->stime.tv_nsec = info.stime.nsec;
+   subvol->rtime.tv_sec  = info.rtime.sec;
+   subvol->rtime.tv_nsec = info.rtime.nsec;
+
+   return BTRFS_UTIL_OK;
+}
+
 PUBLIC enum btrfs_util_error btrfs_util_subvolume_info_fd(int fd, uint64_t id,
  struct 
btrfs_util_subvolume_info *subvol)
 {
enum btrfs_util_error err;
 
+   if (!is_root()) {
+   if (id != 0)
+   return BTRFS_UTIL_ERROR_INVALID_ARGUMENT_FOR_USER;
+
+   err = btrfs_util_is_subvolume_fd(fd);
+   if (err)
+   return err;
+
+   return get_subvolume_info_user(fd, subvol);
+   }
+
if (id == 0) {
err = btrfs_util_is_subvolume_fd(fd);
if (err)
-- 
2.14.3


--
To unsubscribe from this list:

[PATCH 02/11] btrfs-progs: libbtrfsutil: Factor out btrfs_util_subvolume_info_fd()

2018-05-11 Thread Tomohiro Misono
Factor out main logic of btrfs_util_subvolume_info_fd().
This is a preparation work to relax the root privilege of this function.

No functional change happens.

Signed-off-by: Tomohiro Misono 
---
 libbtrfsutil/subvolume.c | 45 ++---
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/libbtrfsutil/subvolume.c b/libbtrfsutil/subvolume.c
index 867b3e10..0d7ef5bf 100644
--- a/libbtrfsutil/subvolume.c
+++ b/libbtrfsutil/subvolume.c
@@ -295,8 +295,8 @@ PUBLIC enum btrfs_util_error 
btrfs_util_subvolume_info(const char *path,
return err;
 }
 
-PUBLIC enum btrfs_util_error btrfs_util_subvolume_info_fd(int fd, uint64_t id,
- struct 
btrfs_util_subvolume_info *subvol)
+static enum btrfs_util_error get_subvolume_info_root(int fd, uint64_t id,
+struct 
btrfs_util_subvolume_info *subvol)
 {
struct btrfs_ioctl_search_args search = {
.key = {
@@ -310,27 +310,10 @@ PUBLIC enum btrfs_util_error 
btrfs_util_subvolume_info_fd(int fd, uint64_t id,
.nr_items = 0,
},
};
-   enum btrfs_util_error err;
size_t items_pos = 0, buf_off = 0;
bool need_root_item = true, need_root_backref = true;
int ret;
 
-   if (id == 0) {
-   err = btrfs_util_is_subvolume_fd(fd);
-   if (err)
-   return err;
-
-   err = btrfs_util_subvolume_id_fd(fd, &id);
-   if (err)
-   return err;
-   }
-
-   if ((id < BTRFS_FIRST_FREE_OBJECTID && id != BTRFS_FS_TREE_OBJECTID) ||
-   id > BTRFS_LAST_FREE_OBJECTID) {
-   errno = ENOENT;
-   return BTRFS_UTIL_ERROR_SUBVOLUME_NOT_FOUND;
-   }
-
search.key.min_objectid = search.key.max_objectid = id;
 
if (subvol) {
@@ -400,6 +383,30 @@ PUBLIC enum btrfs_util_error 
btrfs_util_subvolume_info_fd(int fd, uint64_t id,
return BTRFS_UTIL_OK;
 }
 
+PUBLIC enum btrfs_util_error btrfs_util_subvolume_info_fd(int fd, uint64_t id,
+ struct 
btrfs_util_subvolume_info *subvol)
+{
+   enum btrfs_util_error err;
+
+   if (id == 0) {
+   err = btrfs_util_is_subvolume_fd(fd);
+   if (err)
+   return err;
+
+   err = btrfs_util_subvolume_id_fd(fd, &id);
+   if (err)
+   return err;
+   }
+
+   if ((id < BTRFS_FIRST_FREE_OBJECTID && id != BTRFS_FS_TREE_OBJECTID) ||
+   id > BTRFS_LAST_FREE_OBJECTID) {
+   errno = ENOENT;
+   return BTRFS_UTIL_ERROR_SUBVOLUME_NOT_FOUND;
+   }
+
+   return get_subvolume_info_root(fd, id, subvol);
+}
+
 PUBLIC enum btrfs_util_error btrfs_util_get_subvolume_read_only_fd(int fd,
   bool 
*read_only_ret)
 {
-- 
2.14.3


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/11] btrfs-progs: test: Add helper function to check if test user exists

2018-05-11 Thread Tomohiro Misono
Test user 'progs-test' will be used to test the behavior of normal user.

In order to pass this check, add the user by "useradd -M progs-test".
Note that progs-test should not have root privileges.

Signed-off-by: Tomohiro Misono 
---
 tests/common | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/tests/common b/tests/common
index 4b266c5b..76006efa 100644
--- a/tests/common
+++ b/tests/common
@@ -314,6 +314,16 @@ check_global_prereq()
fi
 }
 
+check_testuser()
+{
+   id -u progs-test > /dev/null 2>&1
+   if [ $? -ne 0 ]; then
+   _not_run "Need to add user \"progs-test\""
+   fi
+   # Note that progs-test should not have root privileges
+   # otherwise test may not run as expected
+}
+
 check_image()
 {
local image
-- 
2.14.3


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/11] btrfs-progs: libbtrfsuitl: Factor out btrfs_util_subvolume_iterator_next()

2018-05-11 Thread Tomohiro Misono
Factor out the main logic of btrfs_util_subvolume_iterator_next().
This is a prepareation work to update the behavior of this function
and relax the required root privilege.

No functional change happens.

Signed-off-by: Tomohiro Misono 
---
 libbtrfsutil/subvolume.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/libbtrfsutil/subvolume.c b/libbtrfsutil/subvolume.c
index 3ce6e0a6..08bbeca2 100644
--- a/libbtrfsutil/subvolume.c
+++ b/libbtrfsutil/subvolume.c
@@ -1255,7 +1255,7 @@ static enum btrfs_util_error build_subvol_path(struct 
btrfs_util_subvolume_itera
return BTRFS_UTIL_OK;
 }
 
-PUBLIC enum btrfs_util_error btrfs_util_subvolume_iterator_next(struct 
btrfs_util_subvolume_iterator *iter,
+static enum btrfs_util_error subvolume_iterator_next_root(struct 
btrfs_util_subvolume_iterator *iter,
char **path_ret,
uint64_t 
*id_ret)
 {
@@ -1331,6 +1331,13 @@ out:
return BTRFS_UTIL_OK;
 }
 
+PUBLIC enum btrfs_util_error btrfs_util_subvolume_iterator_next(struct 
btrfs_util_subvolume_iterator *iter,
+   char **path_ret,
+   uint64_t 
*id_ret)
+{
+   return subvolume_iterator_next_root(iter, path_ret, id_ret);
+}
+
 PUBLIC enum btrfs_util_error btrfs_util_subvolume_iterator_next_info(struct 
btrfs_util_subvolume_iterator *iter,
 char 
**path_ret,
 struct 
btrfs_util_subvolume_info *subvol)
-- 
2.14.3


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/11] btrfs-progs: libbtrfsutil: Update the behavior of subvolume iterator and relax the privileges

2018-05-11 Thread Tomohiro Misono
By using new ioctls (BTRFS_IOC_GET_ROOTREF_INFO/BTRFS_IOC_INO_LOOKUP_USER),
this commit update the subvolume iterator when it is created by
btrfs_util_create_subvolume_iterator() with @top zero
(i.e. if the iterator is created from givin path/fd).

In that case,
 - an iterator can be created from non-subvolume directory
   and will skip a subvolume if
   - it does not exist nor has different id from the found subvolume
 by INO_LOOKUP_USER (may happen if a dir in the path is being mounted)
   - it cannot be opened due to permission error

Since above ioctls do not require root privileges, non-privileged user can
also use the iterator. If @top is specified, the behavior is the same
as before (and thus non-privileged user cannot use).

Signed-off-by: Tomohiro Misono 
---
 libbtrfsutil/btrfsutil.h |  19 ++-
 libbtrfsutil/errors.c|   4 +
 libbtrfsutil/subvolume.c | 319 +++
 3 files changed, 315 insertions(+), 27 deletions(-)

diff --git a/libbtrfsutil/btrfsutil.h b/libbtrfsutil/btrfsutil.h
index 5fe798c5..b90dc93e 100644
--- a/libbtrfsutil/btrfsutil.h
+++ b/libbtrfsutil/btrfsutil.h
@@ -65,6 +65,8 @@ enum btrfs_util_error {
BTRFS_UTIL_ERROR_WAIT_SYNC_FAILED,
BTRFS_UTIL_ERROR_INVALID_ARGUMENT_FOR_USER,
BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED,
+   BTRFS_UTIL_ERROR_GET_SUBVOL_ROOTREF_FAILED,
+   BTRFS_UTIL_ERROR_INO_LOOKUP_USER_FAILED,
 };
 
 /**
@@ -510,6 +512,11 @@ struct btrfs_util_subvolume_iterator;
  * @flags: Bitmask of BTRFS_UTIL_SUBVOLUME_ITERATOR_* flags.
  * @ret: Returned iterator.
  *
+ * For newer kenrels which supports BTRFS_IOC_GET_SUBVOL_ROOTREF and
+ * BTRFS_IOC_INO_LOOKUP_USER, @path does not have to refer to a subvolume when
+ * @top is zero. In that case, subvolumes only below the specified path will
+ * be returned.
+ *
  * The returned iterator must be freed with
  * btrfs_util_destroy_subvolume_iterator().
  *
@@ -558,7 +565,11 @@ int btrfs_util_subvolume_iterator_fd(const struct 
btrfs_util_subvolume_iterator
  * Must be freed with free().
  * @id_ret: Returned subvolume ID. May be %NULL.
  *
- * This requires appropriate privilege (CAP_SYS_ADMIN).
+ * This requires appropriate privilege (CAP_SYS_ADMIN) for older kernel.
+ * For newer kenrels which supports BTRFS_IOC_GET_SUBVOL_ROOTREF and
+ * BTRFS_IOC_INO_LOOKUP_USER, non-privileged user also can use this.
+ * In that case, subvolumes which cannot be accessed by the user will be
+ * skipped.
  *
  * Return: %BTRFS_UTIL_OK on success, %BTRFS_UTIL_ERROR_STOP_ITERATION if there
  * are no more subvolumes, non-zero error code on failure.
@@ -577,7 +588,11 @@ enum btrfs_util_error 
btrfs_util_subvolume_iterator_next(struct btrfs_util_subvo
  * This convenience function basically combines
  * btrfs_util_subvolume_iterator_next() and btrfs_util_subvolume_info().
  *
- * This requires appropriate privilege (CAP_SYS_ADMIN).
+ * This requires appropriate privilege (CAP_SYS_ADMIN) for older kernel.
+ * For newer kenrels which supports BTRFS_IOC_GET_SUGBVOL_INFO,
+ * BTRFS_IOC_GET_SUBVOL_ROOTREF and BTRFS_IOC_INO_LOOKUP_USER,
+ * non-privileged user also can use this. In that case, subvolumes which
+ * cannot be accessed by the user will be skipped.
  *
  * Return: See btrfs_util_subvolume_iterator_next().
  */
diff --git a/libbtrfsutil/errors.c b/libbtrfsutil/errors.c
index f196fa71..21bbc7b2 100644
--- a/libbtrfsutil/errors.c
+++ b/libbtrfsutil/errors.c
@@ -49,6 +49,10 @@ static const char * const error_messages[] = {
"Non-root user cannot specify subvolume id",
[BTRFS_UTIL_ERROR_GET_SUBVOL_INFO_FAILED] =
"Could not get subvolume information by BTRFS_IOC_GET_SUBVOL_INFO",
+   [BTRFS_UTIL_ERROR_GET_SUBVOL_ROOTREF_FAILED] =
+   "Could not get rootref information by BTRRFS_IOC_GET_ROOTREF_INFO",
+   [BTRFS_UTIL_ERROR_INO_LOOKUP_USER_FAILED] =
+   "Could not resolve subvolume path by BTRFS_IOC_INO_LOOKUP_USER",
 };
 
 PUBLIC const char *btrfs_util_strerror(enum btrfs_util_error err)
diff --git a/libbtrfsutil/subvolume.c b/libbtrfsutil/subvolume.c
index 08bbeca2..036af546 100644
--- a/libbtrfsutil/subvolume.c
+++ b/libbtrfsutil/subvolume.c
@@ -39,6 +39,24 @@ static bool is_root(void)
return (uid == 0);
 }
 
+/*
+ * We need both BTRFS_IOC_GET_SUBVOL_ROOTREF and BTRFS_IOC_INO_LOOKUP_USER
+ * but only checks BTRFS_IOC_GET_SUBVOL_ROOTREF for brevity.
+ */
+static bool check_support_rootref_ioctl(int fd)
+{
+   struct btrfs_ioctl_get_subvol_rootref_args args;
+   int ret;
+
+   memset(&args, 0, sizeof(args));
+   ret = ioctl(fd, BTRFS_IOC_GET_SUBVOL_ROOTREF, &args);
+
+   if (ret < 0 && errno == ENOTTY)
+   return false;
+
+   return true;
+}
+
 /*
  * This intentionally duplicates btrfs_util_is_subvolume_fd() instead of 
opening
  * a file descriptor and calling it, because fstat() and fstatfs() don't accept
@@ -760,12 +778,18 @@ PUBLIC enum btrfs_util_error 
btrfs_util_create_subvolu

[PATCH 06/11] btrfs-progs: sub list: Use libbtrfsuitl for subvolume list

2018-05-11 Thread Tomohiro Misono
This is a copy of non-merged following patch originally written
by Omar Sandoval:
  btrfs-progs: use libbtrfsutil for subvolume list
expect this commit keeps libbtrfs implementation which above commit
tries to remove (therefore this adds suffix _v2 for struct/function).

Original Author: Omar Sandoval 
Signed-off-by: Tomohiro Misono 
---
 cmds-subvolume.c | 961 +--
 1 file changed, 934 insertions(+), 27 deletions(-)

diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index 45363a5a..06686943 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -404,6 +404,913 @@ keep_fd:
return ret;
 }
 
+#define BTRFS_LIST_NFILTERS_INCREASE   (2 * BTRFS_LIST_FILTER_MAX)
+#define BTRFS_LIST_NCOMPS_INCREASE (2 * BTRFS_LIST_COMP_MAX)
+
+struct listed_subvol {
+   struct btrfs_util_subvolume_info info;
+   char *path;
+};
+
+struct subvol_list {
+   size_t num;
+   struct listed_subvol subvols[];
+};
+
+typedef int (*btrfs_list_filter_func_v2)(struct listed_subvol *, uint64_t);
+typedef int (*btrfs_list_comp_func_v2)(const struct listed_subvol *,
+   const struct listed_subvol *,
+   int);
+
+struct btrfs_list_filter_v2 {
+   btrfs_list_filter_func_v2 filter_func;
+   u64 data;
+};
+
+struct btrfs_list_comparer_v2 {
+   btrfs_list_comp_func_v2 comp_func;
+   int is_descending;
+};
+
+struct btrfs_list_filter_set_v2 {
+   int total;
+   int nfilters;
+   int only_deleted;
+   struct btrfs_list_filter_v2 filters[0];
+};
+
+struct btrfs_list_comparer_set_v2 {
+   int total;
+   int ncomps;
+   struct btrfs_list_comparer_v2 comps[0];
+};
+
+static struct {
+   char*name;
+   char*column_name;
+   int need_print;
+} btrfs_list_columns[] = {
+   {
+   .name   = "ID",
+   .column_name= "ID",
+   .need_print = 0,
+   },
+   {
+   .name   = "gen",
+   .column_name= "Gen",
+   .need_print = 0,
+   },
+   {
+   .name   = "cgen",
+   .column_name= "CGen",
+   .need_print = 0,
+   },
+   {
+   .name   = "parent",
+   .column_name= "Parent",
+   .need_print = 0,
+   },
+   {
+   .name   = "top level",
+   .column_name= "Top Level",
+   .need_print = 0,
+   },
+   {
+   .name   = "otime",
+   .column_name= "OTime",
+   .need_print = 0,
+   },
+   {
+   .name   = "parent_uuid",
+   .column_name= "Parent UUID",
+   .need_print = 0,
+   },
+   {
+   .name   = "received_uuid",
+   .column_name= "Received UUID",
+   .need_print = 0,
+   },
+   {
+   .name   = "uuid",
+   .column_name= "UUID",
+   .need_print = 0,
+   },
+   {
+   .name   = "path",
+   .column_name= "Path",
+   .need_print = 0,
+   },
+   {
+   .name   = NULL,
+   .column_name= NULL,
+   .need_print = 0,
+   },
+};
+
+static btrfs_list_filter_func_v2 all_filter_funcs[];
+static btrfs_list_comp_func_v2 all_comp_funcs[];
+
+static void btrfs_list_setup_print_column_v2(enum btrfs_list_column_enum 
column)
+{
+   int i;
+
+   ASSERT(0 <= column && column <= BTRFS_LIST_ALL);
+
+   if (column < BTRFS_LIST_ALL) {
+   btrfs_list_columns[column].need_print = 1;
+   return;
+   }
+
+   for (i = 0; i < BTRFS_LIST_ALL; i++)
+   btrfs_list_columns[i].need_print = 1;
+}
+
+static int comp_entry_with_rootid_v2(const struct listed_subvol *entry1,
+ const struct listed_subvol *entry2,
+ int is_descending)
+{
+   int ret;
+
+   if (entry1->info.id > entry2->info.id)
+   ret = 1;
+   else if (entry1->info.id < entry2->info.id)
+   ret = -1;
+   else
+   ret = 0;
+
+   return is_descending ? -ret : ret;
+}
+
+static int comp_entry_with_gen_v2(const struct listed_subvol *entry1,
+  const struct listed_subvol *entry2,
+  int is_descending)
+{
+   int ret;
+
+   if (entry1->info.generation > entry2->info.generation)
+   ret = 1;
+   else if (entry1->info.generation < entry2->info.generation)
+   ret = -1;
+   else
+   ret = 0;
+
+   return is_descending ? -ret : ret;
+}
+
+static int comp_entry_with_ogen_v2(const struct listed_subvol *entry1,
+

[PATCH 08/11] btrfs-progs: utils: Fallback to open without O_NOATIME flag in find_mount_root():

2018-05-11 Thread Tomohiro Misono
O_NOATIME flag requires effective UID of process matches file's owner
or has CAP_FOWNER capabilities. Fallback to open without O_NOATIME flag
so that non-privileged user can also call find_mount_root().

This is a preparation work to allow non-privileged user to call
"subvolume show".

Signed-off-by: Tomohiro Misono 
---
 utils.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/utils.c b/utils.c
index d81d4980..84b81311 100644
--- a/utils.c
+++ b/utils.c
@@ -2048,6 +2048,9 @@ int find_mount_root(const char *path, char **mount_root)
char *longest_match = NULL;
 
fd = open(path, O_RDONLY | O_NOATIME);
+   if (fd < 0 && errno == EPERM)
+   fd = open(path, O_RDONLY);
+
if (fd < 0)
return -errno;
close(fd);
-- 
2.14.3


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/11] btrfs-progs: sub list: Change the default behavior of "subvolume list" and allow non-privileged user to call it

2018-05-11 Thread Tomohiro Misono
Change the default behavior of "subvolume list" and allow non-privileged
user to call it as well.

>From this commit, by default it only lists subvolumes under the specified
path (incl. the path itself except top-level subvolume). Also, if kernel
supports new ioctls (BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_ROOTREF/
BTRFS_IOC_INO_LOOKUP_USER),
  - the specified path can be non-subvolume directory.
  - non-privileged user can also call it.

Note that root user can list all the subvolume in the fs with -a option
(the same behavior as before).

Signed-off-by: Tomohiro Misono 
---
 Documentation/btrfs-subvolume.asciidoc |  2 +
 cmds-subvolume.c   | 90 +-
 2 files changed, 81 insertions(+), 11 deletions(-)

diff --git a/Documentation/btrfs-subvolume.asciidoc 
b/Documentation/btrfs-subvolume.asciidoc
index a8c4af4b..e03d4a6e 100644
--- a/Documentation/btrfs-subvolume.asciidoc
+++ b/Documentation/btrfs-subvolume.asciidoc
@@ -92,6 +92,7 @@ The output format is similar to *subvolume list* command.
 
 *list* [options] [-G [\+|-]] [-C [+|-]] 
[--sort=rootid,gen,ogen,path] ::
 List the subvolumes present in the filesystem .
+By default, this lists the subvolume under the specified path.
 +
 For every subvolume the following information is shown by default. +
 ID  top level  path  +
@@ -109,6 +110,7 @@ print parent ID.
 -a
 print all the subvolumes in the filesystem and distinguish between
 absolute and relative path with respect to the given .
+This requires root privileges.
 -c
 print the ogeneration of the subvolume, aliases: ogen or origin generation.
 -g
diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index 06686943..c3952172 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -1126,6 +1126,7 @@ out:
 }
 
 static struct subvol_list *btrfs_list_subvols(int fd,
+ int is_list_all,
  struct btrfs_list_filter_set_v2 
*filter_set)
 {
struct subvol_list *subvols;
@@ -1133,6 +1134,7 @@ static struct subvol_list *btrfs_list_subvols(int fd,
struct btrfs_util_subvolume_iterator *iter;
enum btrfs_util_error err;
int ret = -1;
+   int tree_id = 0;
 
subvols = malloc(sizeof(*subvols));
if (!subvols) {
@@ -1141,8 +1143,11 @@ static struct subvol_list *btrfs_list_subvols(int fd,
}
subvols->num = 0;
 
+   if (is_list_all)
+   tree_id = BTRFS_FS_TREE_OBJECTID;
+
err = btrfs_util_create_subvolume_iterator_fd(fd,
- BTRFS_FS_TREE_OBJECTID, 0,
+ tree_id, 0,
  &iter);
if (err) {
iter = NULL;
@@ -1189,6 +1194,60 @@ static struct subvol_list *btrfs_list_subvols(int fd,
subvols->num++;
}
 
+   /*
+* Subvolume iterator does not include the information of the
+* specified path/fd. So, add it here.
+*/
+   if (!is_list_all) {
+   uint64_t id;
+   struct listed_subvol subvol;
+
+   err = btrfs_util_is_subvolume_fd(fd);
+   if (err != BTRFS_UTIL_OK) {
+   if (err == BTRFS_UTIL_ERROR_NOT_SUBVOLUME)
+   ret = 0;
+   goto out;
+   }
+   err = btrfs_util_subvolume_id_fd(fd, &id);
+   if (err)
+   goto out;
+   if (id == BTRFS_FS_TREE_OBJECTID) {
+   /* Skip top level subvolume */
+   ret = 0;
+   goto out;
+   }
+
+   err = btrfs_util_subvolume_info_fd(fd, 0, &subvol.info);
+   if (err)
+   goto out;
+
+   subvol.path = strdup(".");
+   if (!filters_match(&subvol, filter_set)) {
+   free(subvol.path);
+   } else {
+   if (subvols->num >= capacity) {
+   struct subvol_list *new_subvols;
+   size_t new_capacity =
+   max_t(size_t, 1, capacity * 2);
+
+   new_subvols = realloc(subvols,
+   sizeof(*new_subvols) +
+   new_capacity *
+   sizeof(new_subvols->subvols[0]));
+   if (!new_subvols) {
+   error("out of memory");
+   goto out;
+   }
+
+   subvols = new_subvols;
+   capacity = new_capacity;
+   }
+
+   subvols->subvols[subvols->num] = subvol;
+

[PATCH 09/11] btrfs-progs: sub show: Allow non-privileged user to call "subvolume show"

2018-05-11 Thread Tomohiro Misono
Allow non-privileged user to call subvolume show (-r or -u cannot be used)
if new ioctls (BTRFS_IOC_GET_SUBVOL_INFO etc.) are available.
The behavior for root user is the same as before.

There are some output differences between root and user:
  root ... subvolume path is from top-level subvolume
   list all snapshots in the fs (inc. non-accessible ones)
  user ... subvolume path is from mount point
   list snapshots under the mountpoint
   (to which the user has appropriate privileges)

Signed-off-by: Tomohiro Misono 
---
 cmds-subvolume.c | 90 
 1 file changed, 78 insertions(+), 12 deletions(-)

diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index c3952172..d88d5d76 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -1883,8 +1883,8 @@ static int cmd_subvol_find_new(int argc, char **argv)
 static const char * const cmd_subvol_show_usage[] = {
"btrfs subvolume show [options] |",
"Show more information about the subvolume",
-   "-r|--rootid   rootid of the subvolume",
-   "-u|--uuid uuid of the subvolume",
+   "-r|--rootid   rootid of the subvolume (require root privileges)",
+   "-u|--uuid uuid of the subvolume   (require root privileges)",
"",
"If no option is specified,  will be shown, otherwise",
"the rootid or uuid are resolved relative to the  path.",
@@ -1897,8 +1897,10 @@ static int cmd_subvol_show(int argc, char **argv)
char uuidparse[BTRFS_UUID_UNPARSED_SIZE];
char *fullpath = NULL;
int fd = -1;
+   int fd_mnt = -1;
int ret = 1;
DIR *dirstream1 = NULL;
+   DIR *dirstream_mnt = NULL;
int by_rootid = 0;
int by_uuid = 0;
u64 rootid_arg = 0;
@@ -1906,6 +1908,8 @@ static int cmd_subvol_show(int argc, char **argv)
struct btrfs_util_subvolume_iterator *iter;
struct btrfs_util_subvolume_info subvol;
char *subvol_path = NULL;
+   char *subvol_name = NULL;
+   char *mount_point = NULL;
enum btrfs_util_error err;
 
while (1) {
@@ -1943,6 +1947,11 @@ static int cmd_subvol_show(int argc, char **argv)
usage(cmd_subvol_show_usage);
}
 
+   if (!is_root() && (by_rootid || by_uuid)) {
+   error("Only root can use -r or -u options");
+   return -1;
+   }
+
fullpath = realpath(argv[optind], NULL);
if (!fullpath) {
error("cannot find real path for '%s': %m", argv[optind]);
@@ -1997,19 +2006,65 @@ static int cmd_subvol_show(int argc, char **argv)
goto out;
}
 
-   err = btrfs_util_subvolume_path_fd(fd, subvol.id, &subvol_path);
-   if (err) {
-   error_btrfs_util(err);
-   goto out;
+   if (is_root()) {
+   /* Construct path from top-level subvolume */
+   err = btrfs_util_subvolume_path_fd(fd, subvol.id,
+   &subvol_path);
+   if (err) {
+   error_btrfs_util(err);
+   goto out;
+   }
+   subvol_name = strdup(basename(subvol_path));
+   } else {
+   /* Construct path from mount point */
+   ret = find_mount_root(fullpath, &mount_point);
+   if (ret < 0) {
+   error("cannot get mount point");
+   goto out;
+   }
+
+   fd_mnt = open_file_or_dir(mount_point, &dirstream_mnt);
+   if (fd_mnt < 0) {
+   error("cannot open mount point");
+   goto out;
+   }
+
+   if (strlen(fullpath) == strlen(mount_point)) {
+   /* Get real name at mount point */
+   struct btrfs_ioctl_get_subvol_info_args arg;
+
+   ret = ioctl(fd_mnt, BTRFS_IOC_GET_SUBVOL_INFO,
+   &arg);
+   if (ret < 0) {
+   error("cannot get subvolume info");
+   goto out;
+   }
+   subvol_path = strdup("./");
+   subvol_name = strdup(arg.name);
+   } else {
+   subvol_path = malloc(strlen(fullpath) -
+   strlen(mount_point) + 1);
+   if (!subvol_path) {
+   error("not enough memory");
+   ret = 1;
+ 

[PATCH 01/11] btrfs-progs: ioctl/libbtrfsutil: Add 3 definitions of new unprivileged ioctl

2018-05-11 Thread Tomohiro Misono
Add 3 definitions of new unprivileged ioctl (BTRFS_IOC_GET_SUBVOL_INFO,
BTRFS_IOC_GET_SUBVOL_ROOTREF and BTRFS_IOC_INO_LOOKUP_USER). They will
be used to implement the user version of "btrfs subvolume list" etc.

Signed-off-by: Tomohiro Misono 
---
 ioctl.h  | 86 
 libbtrfsutil/btrfs.h | 84 ++
 2 files changed, 170 insertions(+)

diff --git a/ioctl.h b/ioctl.h
index 709e996f..c6624352 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -320,6 +320,22 @@ struct btrfs_ioctl_ino_lookup_args {
 };
 BUILD_ASSERT(sizeof(struct btrfs_ioctl_ino_lookup_args) == 4096);
 
+#define BTRFS_INO_LOOKUP_USER_PATH_MAX (4080-BTRFS_VOL_NAME_MAX-1)
+struct btrfs_ioctl_ino_lookup_user_args {
+   /* in, inode number containing the subvolume of 'subvolid' */
+   __u64 dirid;
+   /* in */
+   __u64 subvolid;
+   /* out, name of the subvolume of 'subvolid' */
+   char name[BTRFS_VOL_NAME_MAX + 1];
+   /*
+* out, constructed path from the directory with which
+* the ioctl is called to dirid
+*/
+   char path[BTRFS_INO_LOOKUP_USER_PATH_MAX];
+};
+BUILD_ASSERT(sizeof(struct btrfs_ioctl_ino_lookup_user_args) == 4096);
+
 struct btrfs_ioctl_search_key {
/* which root are we searching.  0 is the tree of tree roots */
__u64 tree_id;
@@ -672,6 +688,70 @@ BUILD_ASSERT(sizeof(struct btrfs_ioctl_send_args_64) == 
72);
 
 #define BTRFS_IOC_SEND_64_COMPAT_DEFINED 1
 
+struct btrfs_ioctl_get_subvol_info_args {
+   /* All filed is out */
+   /* Id of this subvolume */
+   __u64 id;
+   /* Name of this subvolume, used to get the real name at mount point */
+   char name[BTRFS_VOL_NAME_MAX + 1];
+   /*
+* Id of the subvolume which contains this subvolume.
+* Zero for top-level subvolume or deleted subvolume
+*/
+   __u64 parent_id;
+   /*
+* Inode number of the directory which contains this subvolume.
+* Zero for top-level subvolume or deleted subvolume
+*/
+   __u64 dirid;
+
+   /* Latest transaction id of this subvolume */
+   __u64 generation;
+   /* Flags of this subvolume */
+   __u64 flags;
+
+   /* uuid of this subvolume */
+   __u8 uuid[BTRFS_UUID_SIZE];
+   /*
+* uuid of the subvolume of which this subvolume is a snapshot.
+* All zero for non-snapshot subvolume
+*/
+   __u8 parent_uuid[BTRFS_UUID_SIZE];
+   /*
+* uuid of the subvolume from which this subvolume is received.
+* All zero for non-received subvolume
+*/
+   __u8 received_uuid[BTRFS_UUID_SIZE];
+
+   /* Transaction id indicates when change/create/send/receive happens */
+   __u64 ctransid;
+   __u64 otransid;
+   __u64 stransid;
+   __u64 rtransid;
+   /* Time corresponds to c/o/s/rtransid */
+   struct btrfs_ioctl_timespec ctime;
+   struct btrfs_ioctl_timespec otime;
+   struct btrfs_ioctl_timespec stime;
+   struct btrfs_ioctl_timespec rtime;
+
+   __u64 reserved[8];
+};
+
+#define BTRFS_MAX_ROOTREF_BUFFER_NUM 255
+struct btrfs_ioctl_get_subvol_rootref_args {
+   /* in/out, min id of rootref's subvolid to be searched */
+   __u64 min_id;
+   /* out */
+   struct {
+   __u64 subvolid;
+   __u64 dirid;
+   } rootref[BTRFS_MAX_ROOTREF_BUFFER_NUM];
+   /* out, number of found items */
+   __u8 num_items;
+   __u8 align[7];
+};
+BUILD_ASSERT(sizeof(struct btrfs_ioctl_get_subvol_rootref_args) == 4096);
+
 /* Error codes as returned by the kernel */
 enum btrfs_err_code {
notused,
@@ -828,6 +908,12 @@ static inline char *btrfs_err_str(enum btrfs_err_code 
err_code)
   struct btrfs_ioctl_feature_flags[3])
 #define BTRFS_IOC_RM_DEV_V2_IOW(BTRFS_IOCTL_MAGIC, 58, \
   struct btrfs_ioctl_vol_args_v2)
+#define BTRFS_IOC_GET_SUBVOL_INFO _IOR(BTRFS_IOCTL_MAGIC, 60, \
+   struct btrfs_ioctl_get_subvol_info_args)
+#define BTRFS_IOC_GET_SUBVOL_ROOTREF _IOWR(BTRFS_IOCTL_MAGIC, 61, \
+   struct btrfs_ioctl_get_subvol_rootref_args)
+#define BTRFS_IOC_INO_LOOKUP_USER _IOWR(BTRFS_IOCTL_MAGIC, 62, \
+   struct btrfs_ioctl_ino_lookup_user_args)
 #ifdef __cplusplus
 }
 #endif
diff --git a/libbtrfsutil/btrfs.h b/libbtrfsutil/btrfs.h
index c293f6bf..451e227c 100644
--- a/libbtrfsutil/btrfs.h
+++ b/libbtrfsutil/btrfs.h
@@ -421,6 +421,21 @@ struct btrfs_ioctl_ino_lookup_args {
char name[BTRFS_INO_LOOKUP_PATH_MAX];
 };
 
+#define BTRFS_INO_LOOKUP_USER_PATH_MAX (4080-BTRFS_VOL_NAME_MAX-1)
+struct btrfs_ioctl_ino_lookup_user_args {
+   /* in, inode number containing the subvolume of 'subvolid' */
+   __u64 dirid;
+   /*

[PATCH v4 2/3] btrfs: Add unprivileged ioctl which returns subvolume's ROOT_REF

2018-05-11 Thread Tomohiro Misono
Add unprivileged ioctl BTRFS_IOC_GET_SUBVOL_ROOTREF which
returns ROOT_REF information of the subvolume containing this inode
except the subvolume name (this is because to prevent potential name
leak). The subvolume name will be gained by user version of ino_lookup
ioctl (BTRFS_IOC_INO_LOOKUP_USER) which also performs permission check.

The min id of root ref's subvolume to be searched is specified by
@min_id in struct btrfs_ioctl_get_subvol_rootref_args. After the search
ends, @min_id is set to the last searched root ref's subvolid + 1. Also,
if there are more root refs than BTRFS_MAX_ROOTREF_BUFFER_NUM, -EOVERFLOW
is returned. Therefore the caller can just call this ioctl again without
changing the argument to continue search.

Signed-off-by: Tomohiro Misono 
---
 fs/btrfs/ioctl.c   | 102 +
 include/uapi/linux/btrfs.h |  16 +++
 2 files changed, 118 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 64b23e22852f..7988d328aed5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2369,6 +2369,106 @@ static noinline int btrfs_ioctl_get_subvol_info(struct 
file *file,
return ret;
 }
 
+/*
+ * Return ROOT_REF information of the subvolume contining this inode
+ * except the subvolume name.
+ */
+static noinline int btrfs_ioctl_get_subvol_rootref(struct file *file,
+  void __user *argp)
+{
+   struct btrfs_ioctl_get_subvol_rootref_args *rootrefs;
+   struct btrfs_root_ref *rref;
+   struct btrfs_root *root;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+
+   struct extent_buffer *l;
+   int slot;
+
+   struct inode *inode;
+   int i, nritems;
+   int ret;
+   u64 objectid;
+   u8 found;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   rootrefs = memdup_user(argp, sizeof(*rootrefs));
+   if (!rootrefs) {
+   btrfs_free_path(path);
+   return -ENOMEM;
+   }
+
+   inode = file_inode(file);
+   root = BTRFS_I(inode)->root->fs_info->tree_root;
+   objectid = BTRFS_I(inode)->root->root_key.objectid;
+
+   key.objectid = objectid;
+   key.type = BTRFS_ROOT_REF_KEY;
+   key.offset = rootrefs->min_id;
+   found = 0;
+   while (1) {
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret < 0) {
+   goto out;
+   } else if (path->slots[0] >=
+   btrfs_header_nritems(path->nodes[0])) {
+   ret = btrfs_next_leaf(root, path);
+   if (ret < 0)
+   return ret;
+   }
+
+   l = path->nodes[0];
+   slot = path->slots[0];
+   nritems = btrfs_header_nritems(l);
+   if (nritems - slot == 0) {
+   ret = 0;
+   goto out;
+   }
+
+   for (i = slot; i < nritems; i++) {
+   btrfs_item_key_to_cpu(l, &key, i);
+   if (key.objectid != objectid ||
+   key.type != BTRFS_ROOT_REF_KEY) {
+   ret = 0;
+   goto out;
+   }
+
+   if (found == BTRFS_MAX_ROOTREF_BUFFER_NUM) {
+   ret = -EOVERFLOW;
+   goto out;
+   }
+
+   rref = btrfs_item_ptr(l, i, struct btrfs_root_ref);
+   rootrefs->rootref[found].subvolid = key.offset;
+   rootrefs->rootref[found].dirid =
+ btrfs_root_ref_dirid(l, rref);
+   found++;
+   }
+
+   btrfs_release_path(path);
+   key.offset++;
+   }
+
+out:
+   if (!ret || ret == -EOVERFLOW) {
+   rootrefs->num_items = found;
+   /* update min_id for next search */
+   if (found)
+   rootrefs->min_id =
+   rootrefs->rootref[found - 1].subvolid + 1;
+   if (copy_to_user(argp, rootrefs, sizeof(*rootrefs)))
+   ret = -EFAULT;
+   }
+
+   btrfs_free_path(path);
+   kfree(rootrefs);
+
+   return ret;
+}
+
 static noinline int btrfs_ioctl_snap_destroy(struct file *file,
 void __user *arg)
 {
@@ -5503,6 +5603,8 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_set_features(file, argp);
case BTRFS_IOC_GET_SUBVOL_INFO:
return btrfs_ioctl_get_subvol_info(file, argp);
+   case BTRFS_IOC_GET_SUBVOL_ROOTREF:
+   return btrfs_ioctl_get_subvol_rootref(file, argp);
}
 
return -ENOTTY;
diff --git a/includ

[PATCH v4 1/3] btrfs: Add unprivileged ioctl which returns subvolume information

2018-05-11 Thread Tomohiro Misono
Add new unprivileged ioctl BTRFS_IOC_GET_SUBVOL_INFO which returns
the information of subvolume containing this inode.
(i.e. returns the information in ROOT_ITEM and ROOT_BACKREF.)

Signed-off-by: Tomohiro Misono 
---
 fs/btrfs/ioctl.c   | 129 +
 include/uapi/linux/btrfs.h |  51 ++
 2 files changed, 180 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 48e2ddff32bd..64b23e22852f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2242,6 +2242,133 @@ static noinline int btrfs_ioctl_ino_lookup(struct file 
*file,
return ret;
 }
 
+/* Get the subvolume information in BTRFS_ROOT_ITEM and BTRFS_ROOT_BACKREF */
+static noinline int btrfs_ioctl_get_subvol_info(struct file *file,
+  void __user *argp)
+{
+   struct btrfs_ioctl_get_subvol_info_args *subvol_info;
+   struct btrfs_root *root;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+
+   struct btrfs_root_item root_item;
+   struct btrfs_root_ref *rref;
+   struct extent_buffer *l;
+   int slot;
+
+   unsigned long item_off;
+   unsigned long item_len;
+
+   struct inode *inode;
+   int ret;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   subvol_info = kzalloc(sizeof(*subvol_info), GFP_KERNEL);
+   if (!subvol_info) {
+   btrfs_free_path(path);
+   return -ENOMEM;
+   }
+   inode = file_inode(file);
+
+   root = BTRFS_I(inode)->root->fs_info->tree_root;
+   key.objectid = BTRFS_I(inode)->root->root_key.objectid;
+   key.type = BTRFS_ROOT_ITEM_KEY;
+   key.offset = 0;
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret < 0) {
+   goto out;
+   } else if (ret > 0) {
+   u64 objectid = key.objectid;
+
+   if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+   ret = btrfs_next_leaf(root, path);
+   if (ret < 0)
+   return ret;
+   }
+
+   /* If the subvolume is a snapshot, offset is not zero */
+   btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+   if (key.objectid != objectid ||
+   key.type != BTRFS_ROOT_ITEM_KEY) {
+   ret = -ENOENT;
+   goto out;
+   }
+   }
+
+   l = path->nodes[0];
+   slot = path->slots[0];
+   item_off = btrfs_item_ptr_offset(l, slot);
+   item_len = btrfs_item_size_nr(l, slot);
+   read_extent_buffer(l, &root_item, item_off, item_len);
+
+   subvol_info->id = key.objectid;
+
+   subvol_info->generation = btrfs_root_generation(&root_item);
+   subvol_info->flags = btrfs_root_flags(&root_item);
+
+   memcpy(subvol_info->uuid, root_item.uuid, BTRFS_UUID_SIZE);
+   memcpy(subvol_info->parent_uuid, root_item.parent_uuid,
+   BTRFS_UUID_SIZE);
+   memcpy(subvol_info->received_uuid, root_item.received_uuid,
+   BTRFS_UUID_SIZE);
+
+   subvol_info->ctransid = btrfs_root_ctransid(&root_item);
+   subvol_info->ctime.sec = btrfs_stack_timespec_sec(&root_item.ctime);
+   subvol_info->ctime.nsec = btrfs_stack_timespec_nsec(&root_item.ctime);
+
+   subvol_info->otransid = btrfs_root_otransid(&root_item);
+   subvol_info->otime.sec = btrfs_stack_timespec_sec(&root_item.otime);
+   subvol_info->otime.nsec = btrfs_stack_timespec_nsec(&root_item.otime);
+
+   subvol_info->stransid = btrfs_root_stransid(&root_item);
+   subvol_info->stime.sec = btrfs_stack_timespec_sec(&root_item.stime);
+   subvol_info->stime.nsec = btrfs_stack_timespec_nsec(&root_item.stime);
+
+   subvol_info->rtransid = btrfs_root_rtransid(&root_item);
+   subvol_info->rtime.sec = btrfs_stack_timespec_sec(&root_item.rtime);
+   subvol_info->rtime.nsec = btrfs_stack_timespec_nsec(&root_item.rtime);
+
+   btrfs_release_path(path);
+   key.type = BTRFS_ROOT_BACKREF_KEY;
+   key.offset = 0;
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret < 0) {
+   goto out;
+   } else if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+   ret = btrfs_next_leaf(root, path);
+   if (ret < 0)
+   return ret;
+   }
+
+   l = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(l, &key, slot);
+   if (key.objectid == subvol_info->id &&
+   key.type == BTRFS_ROOT_BACKREF_KEY){
+   subvol_info->parent_id = key.offset;
+
+   rref = btrfs_item_ptr(l, slot, struct btrfs_root_ref);
+   subvol_info->dirid = btrfs_root_ref_dirid(l, rref);
+
+   item_off

[PATCH v4 3/3] btrfs: Add unprivileged version of ino_lookup ioctl

2018-05-11 Thread Tomohiro Misono
Add unprivileged version of ino_lookup ioctl BTRFS_IOC_INO_LOOKUP_USER
to allow normal users to call "btrfs subvololume list/show" etc. in
combination with BTRFS_IOC_GET_SUBVOL_INFO/BTRFS_IOC_GET_SUBVOL_ROOTREF.

This can be used like BTRFS_IOC_INO_LOOKUP but the argument is
different. This is  because it always searches the fs/file tree
correspoinding to the fd with which this ioctl is called and also
returns the name of bottom subvolume.

The main differences from original ino_lookup ioctl are:
  1. Read + Exec permission will be checked using inode_permission()
 during path construction. -EACCES will be returned in case
 of failure.
  2. Path construction will be stopped at the inode number which
 corresponds to the fd with which this ioctl is called. If
 constructed path does not exist under fd's inode, -EACCES
 will be returned.
  3. The name of bottom subvolume is also searched and filled.

Note that the maximum length of path is shorter 256 (BTRFS_VOL_NAME_MAX+1)
bytes than ino_lookup ioctl because of space of subvolume's name.

Signed-off-by: Tomohiro Misono 
---
 fs/btrfs/ioctl.c   | 204 +
 include/uapi/linux/btrfs.h |  17 
 2 files changed, 221 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7988d328aed5..e326a85134f4 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2200,6 +2200,166 @@ static noinline int btrfs_search_path_in_tree(struct 
btrfs_fs_info *info,
return ret;
 }
 
+static noinline int btrfs_search_path_in_tree_user(struct inode *inode,
+   struct btrfs_ioctl_ino_lookup_user_args *args)
+{
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+   struct super_block *sb = inode->i_sb;
+   struct btrfs_key upper_limit = BTRFS_I(inode)->location;
+   u64 treeid = BTRFS_I(inode)->root->root_key.objectid;
+   u64 dirid = args->dirid;
+
+   unsigned long item_off;
+   unsigned long item_len;
+   struct btrfs_inode_ref *iref;
+   struct btrfs_root_ref *rref;
+   struct btrfs_root *root;
+   struct btrfs_path *path;
+   struct btrfs_key key, key2;
+   struct extent_buffer *l;
+   struct inode *temp_inode;
+   char *ptr;
+   int slot;
+   int len;
+   int total_len = 0;
+   int ret = -1;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   /*
+* If the bottom subvolume does not exist directly under upper_limit,
+* construct the path in bottomup way.
+*/
+   if (dirid != upper_limit.objectid) {
+   ptr = &args->path[BTRFS_INO_LOOKUP_USER_PATH_MAX - 1];
+
+   key.objectid = treeid;
+   key.type = BTRFS_ROOT_ITEM_KEY;
+   key.offset = (u64)-1;
+   root = btrfs_read_fs_root_no_name(fs_info, &key);
+   if (IS_ERR(root)) {
+   ret = -ENOENT;
+   goto out;
+   }
+
+   key.objectid = dirid;
+   key.type = BTRFS_INODE_REF_KEY;
+   key.offset = (u64)-1;
+   while (1) {
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret < 0) {
+   goto out;
+   } else if (ret > 0) {
+   ret = btrfs_previous_item(root, path, dirid,
+ BTRFS_INODE_REF_KEY);
+   if (ret < 0) {
+   goto out;
+   } else if (ret > 0) {
+   ret = -ENOENT;
+   goto out;
+   }
+   }
+
+   l = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(l, &key, slot);
+
+   iref = btrfs_item_ptr(l, slot, struct btrfs_inode_ref);
+   len = btrfs_inode_ref_name_len(l, iref);
+   ptr -= len + 1;
+   total_len += len + 1;
+   if (ptr < args->path) {
+   ret = -ENAMETOOLONG;
+   goto out;
+   }
+
+   *(ptr + len) = '/';
+   read_extent_buffer(l, ptr,
+   (unsigned long)(iref + 1), len);
+
+   /* Check the read+exec permission of this directory */
+   ret = btrfs_previous_item(root, path, dirid,
+ BTRFS_INODE_ITEM_KEY);
+   if (ret < 0) {
+   goto out;
+   } else if (ret > 0) {
+   ret = -ENOENT;
+   

  1   2   >