Re: [PATCH] Btrfs: remove unnecessary code of chunk_root assignment in btrfs_read_chunk_tree.

2016-09-08 Thread Sean Fu
On Thu, Sep 08, 2016 at 11:25:48PM -0400, Jeff Mahoney wrote:
> On 9/8/16 11:08 PM, Sean Fu wrote:
> > On Tue, Sep 06, 2016 at 11:12:20AM -0400, Jeff Mahoney wrote:
> >> On 9/6/16 5:58 AM, David Sterba wrote:
> >>> On Mon, Sep 05, 2016 at 11:13:40PM -0400, Jeff Mahoney wrote:
> >> Since root is only used to get fs_info->chunk_root, why not use fs_info
> >> directly?
> >
> > Weird.  Exactly this was a part of my fs_info patchset.  I guess I need
> > to go back and check what else is missing.
> 
>  Actually, most of this didn't land.  Pretty much anything that's a root
>  ->fs_info conversion is in there.
> >>>
> >>> Only half of the patchset has been merged so far because it did not pass
> >>> testing, so I bisected to some point. I was about to let you know once
> >>> most of 4.9 patches are prepared so there are less merge conflicts.
> >>
> >> Ok, thanks.  I was going to start the rebase today but I'll hold off
> >> until you're set for 4.9.
> >>
> > Hi Jeff, Could you please share your patch? Where can i get it?
> > I wanna have a look at it.
> 
> Sure, it's the whole series that starts with this commit:
> commit 160ceedfd40085cfb1e08305917fcc24cefdad93
> Author: Jeff Mahoney 
> Date:   Wed Aug 31 23:55:33 2016 -0400
> 
> btrfs: add dynamic debug support
> 
> ... I still need to do clean up some commits that need merging.
> 
> https://git.kernel.org/cgit/linux/kernel/git/jeffm/linux-btrfs.git/log/?h=btrfs-testing/kdave/misc-4.9/root-fsinfo-cleanup
>
Nice work.
Thanks

> -Jeff
> 
> 
> > Thanks
> >> -Jeff
> >>
> >> -- 
> >> Jeff Mahoney
> >> SUSE Labs
> >>
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> -- 
> Jeff Mahoney
> SUSE Labs
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hi linux

2016-09-08 Thread Hendra Soetjahja
Hi linux

http://www.navaryacht.com/feathers.php?deal=ex1p781mk6k


Hendra Soetjahja
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove unnecessary code of chunk_root assignment in btrfs_read_chunk_tree.

2016-09-08 Thread Jeff Mahoney
On 9/8/16 11:08 PM, Sean Fu wrote:
> On Tue, Sep 06, 2016 at 11:12:20AM -0400, Jeff Mahoney wrote:
>> On 9/6/16 5:58 AM, David Sterba wrote:
>>> On Mon, Sep 05, 2016 at 11:13:40PM -0400, Jeff Mahoney wrote:
>> Since root is only used to get fs_info->chunk_root, why not use fs_info
>> directly?
>
> Weird.  Exactly this was a part of my fs_info patchset.  I guess I need
> to go back and check what else is missing.

 Actually, most of this didn't land.  Pretty much anything that's a root
 ->fs_info conversion is in there.
>>>
>>> Only half of the patchset has been merged so far because it did not pass
>>> testing, so I bisected to some point. I was about to let you know once
>>> most of 4.9 patches are prepared so there are less merge conflicts.
>>
>> Ok, thanks.  I was going to start the rebase today but I'll hold off
>> until you're set for 4.9.
>>
> Hi Jeff, Could you please share your patch? Where can i get it?
> I wanna have a look at it.

Sure, it's the whole series that starts with this commit:
commit 160ceedfd40085cfb1e08305917fcc24cefdad93
Author: Jeff Mahoney 
Date:   Wed Aug 31 23:55:33 2016 -0400

btrfs: add dynamic debug support

... I still need to do clean up some commits that need merging.

https://git.kernel.org/cgit/linux/kernel/git/jeffm/linux-btrfs.git/log/?h=btrfs-testing/kdave/misc-4.9/root-fsinfo-cleanup

-Jeff


> Thanks
>> -Jeff
>>
>> -- 
>> Jeff Mahoney
>> SUSE Labs
>>
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Btrfs: remove unnecessary code of chunk_root assignment in btrfs_read_chunk_tree.

2016-09-08 Thread Sean Fu
On Tue, Sep 06, 2016 at 11:12:20AM -0400, Jeff Mahoney wrote:
> On 9/6/16 5:58 AM, David Sterba wrote:
> > On Mon, Sep 05, 2016 at 11:13:40PM -0400, Jeff Mahoney wrote:
>  Since root is only used to get fs_info->chunk_root, why not use fs_info
>  directly?
> >>>
> >>> Weird.  Exactly this was a part of my fs_info patchset.  I guess I need
> >>> to go back and check what else is missing.
> >>
> >> Actually, most of this didn't land.  Pretty much anything that's a root
> >> ->fs_info conversion is in there.
> > 
> > Only half of the patchset has been merged so far because it did not pass
> > testing, so I bisected to some point. I was about to let you know once
> > most of 4.9 patches are prepared so there are less merge conflicts.
> 
> Ok, thanks.  I was going to start the rebase today but I'll hold off
> until you're set for 4.9.
> 
Hi Jeff, Could you please share your patch? Where can i get it?
I wanna have a look at it.

Thanks
> -Jeff
> 
> -- 
> Jeff Mahoney
> SUSE Labs
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lockdep warning in btrfs in 4.8-rc3

2016-09-08 Thread Dave Jones
On Thu, Sep 08, 2016 at 08:58:48AM -0400, Chris Mason wrote:
 > On 09/08/2016 07:50 AM, Christian Borntraeger wrote:
 > > On 09/08/2016 01:48 PM, Christian Borntraeger wrote:
 > >> Chris,
 > >>
 > >> with 4.8-rc3 I get the following on an s390 box:
 > >
 > > Sorry for the noise, just saw the fix in your pull request.
 > >
 > 
 > The lockdep splat is still there, we'll need to annotate this one a little.

Here's another one (unrelated?) that I've not seen before today:

WARNING: CPU: 1 PID: 10664 at kernel/locking/lockdep.c:704 
register_lock_class+0x33f/0x510
CPU: 1 PID: 10664 Comm: kworker/u8:5 Not tainted 4.8.0-rc5-think+ #2 
Workqueue: writeback wb_workfn (flush-btrfs-1)
 0097 b97fbad3 88013b8c3770 a63d3ab1
   a6bf1792 a60df22f
 88013b8c37b0 a60897a0 02c0b97fbad3 a6bf1792
Call Trace:
 [] dump_stack+0x6c/0x9b
 [] ? register_lock_class+0x33f/0x510
 [] __warn+0x110/0x130
 [] warn_slowpath_null+0x2c/0x40
 [] register_lock_class+0x33f/0x510
 [] ? bio_add_page+0x7e/0x120
 [] __lock_acquire.isra.32+0x5b/0x8c0
 [] lock_acquire+0x58/0x70
 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] _raw_write_lock+0x38/0x70
 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] lock_extent_buffer_for_io+0x28/0x2e0 [btrfs]
 [] btree_write_cache_pages+0x231/0x550 [btrfs]
 [] ? btree_set_page_dirty+0x20/0x20 [btrfs]
 [] btree_writepages+0x74/0x90 [btrfs]
 [] do_writepages+0x3e/0x80
 [] __writeback_single_inode+0x42/0x220
 [] writeback_sb_inodes+0x351/0x730
 [] ? __wb_update_bandwidth+0x1c1/0x2b0
 [] wb_writeback+0x138/0x2a0
 [] wb_workfn+0x10e/0x340
 [] ? __lock_acquire.isra.32+0x1cf/0x8c0
 [] process_one_work+0x24f/0x5d0
 [] ? process_one_work+0x1e0/0x5d0
 [] worker_thread+0x53/0x5b0
 [] ? process_one_work+0x5d0/0x5d0
 [] kthread+0x120/0x140
 [] ? finish_task_switch+0x6a/0x200
 [] ret_from_fork+0x1f/0x40
 [] ? kthread_create_on_node+0x270/0x270
---[ end trace 7b39395c07435bf1 ]---


 700 /*
 701  * Huh! same key, different name? Did someone 
trample
 702  * on some memory? We're most confused.
 703  */ 
 704 WARN_ON_ONCE(class->name != lock->name); 

That seems kinda scary. There was a trinity run going on at the same time,
so this _might_ be a random scribble from something unrelated to btrfs,
but just in case..

IWBNI that code printed out both cases so I could see if this was
corruption or two unrelated keys. I'll make it do that in case it
happens again.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another 4.8-rc locked splat: btrfs_close_devices()

2016-09-08 Thread Anand Jain


Thanks for the report Ilya.

Yep. Have seen similar issues during hotspare fixes as well.
Where the vfs call to btrfs_show_devname() and its
device_list_mutex lock is conflicting. One of that is fixed
here.

--
779bf3fefa835cb52a07457c8acac6f2f66f2493
btrfs: fix lock dep warning, move scratch dev out of 
device_list_mutex and uuid_mutex

--

I was kind of expecting this here as well when wrote 142388194191.
However couldn't reproduce.

To fix this permanently, I see the following choices,

Chris/David,

 1. Do you think device_list_mutex is needed at btrfs_show_devname()
 or rcu should suffice. ?

 2. To me the roles of fs_info->volume_mutex can be replaced with
 device_list_mutex. Any idea, if I am missing something ?

Thanks, Anand


On 09/08/2016 10:34 PM, Ilya Dryomov wrote:

Hello,

This one seems to have appeared after Anand's commit
142388194191 ("btrfs: do not background blkdev_put()") got merged into
4.8-rc4.

Thanks,

Ilya


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ioctl_xfs_ioc_getfsmap.2: document XFS_IOC_GETFSMAP ioctl

2016-09-08 Thread Dave Chinner
On Tue, Aug 30, 2016 at 12:09:49PM -0700, Darrick J. Wong wrote:
> > I recall for FIEMAP that some filesystems may not have files aligned
> > to sector offsets, and we just used byte offsets.  Storage like
> > NVDIMMs are cacheline granular, so I don't think it makes sense to
> > tie this to old disk sector sizes.  Alternately, the units could be
> > in terms of fs blocks as returned by statvfs.st_bsize, but mixing
> > units for fmv_block, fmv_offset, fmv_length is uneeded complexity.
> 
> Ugh.  I'd rather just change the units to bytes rather than force all
> the users to multiply things. :)

Yup, units need to be either in disk addresses (i.e. 512 byte units)
or bytes. If people can't handle disk addresses (seems to be the
case), the bytes it should be.

> I'd much rather just add more special owner codes for any other
> filesystem that has distinguishable metadata types that are not
> covered by the existing OWN_ codes.  We /do/ have 2^64 possible
> values, so it's not like we're going to run out.

This is diagnositc information as much as anything, just like
fiemap is diagnostic information. So if we have specific type
information, it needs to be reported accurately to be useful.

Hence I really don't care if the users and developers of other fs
types don't understand what the special owner codes that a specific
filesystem returns mean. i.e. it's not useful user information -
only a tool that groks the specific filesystem is going to be able
to anything useful with special owner codes. So, IMO, there's little
point trying to make them generic or to even trying to define and
explain them in the man page

> > It seems like there are several fields in the structure that are used for
> > only input or only output?  Does it make more sense to have one structure
> > used only for the input request, and then the array of values returned be
> > in a different structure?  I'm not necessarily requesting that it be 
> > changed,
> > but it definitely is something I noticed a few times while reading this doc.
> 
> I've been thinking about rearranging this a bit, since the flags
> handling is very awkward with the current array structure.  Each
> rmap has its own flags; we may someday want to pass operation flags
> into the ioctl; and we currently have one operation flag to pass back
> to userspace.  Each of those flags can be a separate field.  I think
> people will get confused about FMV_OF_* and FMV_HOF_* being referenced
> in oflags, and iflags has no meaning for returned records.

Yup, that's what I initially noticed when I glanced at this. The XFS
getbmap interface is just plain nasty, and we shouldn't be copying
that API pattern if we can help it.

> So, this instead?
> 
> struct getfsmap_rec {
>   u32 device; /* device id */
>   u32 flags;  /* mapping flags */
>   u64 block;  /* physical addr, bytes */
>   u64 owner;  /* inode or special owner code */
>   u64 offset; /* file offset of mapping, bytes */
>   u64 length; /* length of segment, bytes */
>   u64 reserved;   /* will be set to zero */
> }; /* 48 bytes */
> 
> struct getfsmap_head {
>   u32 iflags; /* none defined yet */
>   u32 oflags; /* FMV_HOF_DEV_T */
>   u32 count;  /* # entries in recs array */
>   u32 entries;/* # entries filled in (output) */
>   u64 reserved[2];/* must be zero */
> 
>   struct getfsmap_rec keys[2]; /* low and high keys for the mapping 
> search */
>   struct getfsmap_rec recs[0];
> }; /* 32 bytes + 2*48 = 128 bytes */
> 
> #define XFS_IOC_GETFSMAP  _IOWR('X', 59, struct getfsmap_head)
> 
> This also means that userspace can set up for the next ioctl
> invocation with memcpy(>keys[0], >recs[head->entries - 1]).
> 
> Yes, I think I like this better.  Everyone else, please chime in. :)

That's pretty much the structure I was going to suggest - it matches
the fiemap pattern. i.e control parameters are separated from record
data. I'd dump a bit more reserved space in the structure, though;
we've got heaps of flag space for future expansion, but if we need
to pass new parameters into/out of the kernel we'll quickly use the
reserved space.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS constantly reports "No space left on device" even with a huge unallocated space

2016-09-08 Thread Jeff Mahoney
On 9/8/16 2:49 PM, Jeff Mahoney wrote:
> On 9/8/16 2:24 PM, Ronan Arraes Jardim Chagas wrote:
>> Hi all!
>>
>> Em Seg, 2016-09-05 às 16:49 +0800, Qu Wenruo escreveu:
>>> Just like what Wang has mentioned, would you please paste all the
>>> output 
>>> of the contents of /sys/fs/btrfs//allocation?
>>>
>>> It's recommended to use "grep . -IR " to get all the data as
>>> it 
>>> will show the file name.
>>
>> So, one more time, I see the problem. This time I was just using
>> Firefox and I cannot recover using `btrfs balance`. I think that, one
>> more time, I will need to reboot this machine. This problem is really
>> causing me a lot of troubles :(
> 
> I have a hunch the list is about to be flooded with similar reports if
> we don't find this one before 4.8.
> 
> commit d555b6c380c644af63dbdaa7cc14bba041a4e4dd
> Author: Josef Bacik 
> Date:   Fri Mar 25 13:25:51 2016 -0400
> 
> Btrfs: warn_on for unaccounted spaces
> 
> This commit isn't the source of the bug, but it's making it a lot more
> noisy.  I spent a few hours last night trying to track down why xfstests
> was throwing these warnings and I was able to reproduce them at least as
> far back as 4.4-vanilla with -oenospc_debug enabled.
> 
> Speaking of which, can you turn on mounting with -oenospc_debug if you
> haven't already?
> 
> In my case, space_info->bytes_may_use was getting accounted incorrectly.
> 
> I am able to reproduce that even with the following commit:
> commit 18513091af9483ba84328d42092bd4d42a3c958f
> Author: Wang Xiaoguang 
> Date:   Mon Jul 25 15:51:40 2016 +0800
> 
> btrfs: update btrfs_space_info's bytes_may_use timely

And the btrfs_free_reserved_data_space_noquota WARN_ON I was seeing is
fixed by:

commit ed7a6948394305b810d0c6203268648715e5006f
Author: Wang Xiaoguang 
Date:   Fri Aug 26 11:33:14 2016 +0800

btrfs: do not decrease bytes_may_use when replaying extents

... which shouldn't change anything for your issue, unfortunately.

I still see these:
WARNING: CPU: 2 PID: 8166 at ../fs/btrfs/extent-tree.c:9582
btrfs_free_block_groups+0x2a8/0x400 [btrfs]()
Modules linked in: loop dm_flakey af_packet iscsi_ibft iscsi_boot_sysfs
msr ext4 crc16 mbcache jbd2 ipmi_ssif dm_mod igb ptp pps_core
acpi_cpufreq tpm_infineon kvm_amd ipmi_si kvm dca pcspkr ipmi_msghandler
8250_fintek sp5100_tco fjes irqbypass i2c_piix4 shpchp processor button
amd64_edac_mod edac_mce_amd edac_core k10temp btrfs xor raid6_pq sd_mod
ata_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
ohci_pci sysimgblt ehci_pci serio_raw ohci_hcd fb_sys_fops pata_atiixp
ehci_hcd ttm ahci libahci drm usbcore libata usb_common sg scsi_mod autofs4
CPU: 2 PID: 8166 Comm: umount Tainted: GW
4.4.19-11.g81405db-vanilla #1
Hardware name: HP ProLiant DL165 G7, BIOS O37 10/17/2012
  880230317d10 813170ec 
 a0472528 880230317d48 8107d816 
 88009ab03600 8800ba106288 8800ab75a000 8800ba106200
Call Trace:
 [] dump_stack+0x63/0x87
 [] warn_slowpath_common+0x86/0xc0
 [] warn_slowpath_null+0x1a/0x20
 [] btrfs_free_block_groups+0x2a8/0x400 [btrfs]
 [] close_ctree+0x15b/0x330 [btrfs]
 [] btrfs_put_super+0x19/0x20 [btrfs]
 [] generic_shutdown_super+0x6f/0x100
 [] kill_anon_super+0x12/0x20
 [] btrfs_kill_super+0x18/0x120 [btrfs]
 [] deactivate_locked_super+0x43/0x70
 [] deactivate_super+0x46/0x60
 [] cleanup_mnt+0x3f/0x80
 [] __cleanup_mnt+0x12/0x20
 [] task_work_run+0x86/0xb0
 [] exit_to_usermode_loop+0x73/0xa2
 [] syscall_return_slowpath+0x8d/0xa0
 [] int_ret_from_sys_call+0x25/0x8f
---[ end trace 09a0cc2892b6305c ]---
BTRFS: space_info 1 has 7946240 free, is not full
BTRFS: space_info total=8388608, used=442368, pinned=0, reserved=0,
may_use=4096, readonly=0

... where the value of may_use varies.

-Jeff

> 
>> grep . -IR /sys/fs/btrfs/e9efaa0c-d477-4249-830f-
>> ee5956768b29/allocation
>> allocation/data/flags:1
>> allocation/data/bytes_pinned:0
>> allocation/data/bytes_may_use:0
>> allocation/data/total_bytes_pinned:202973265920
> 
> That adds up to ~ 189 GB.  total_bytes is only about 42 GB.
> 
>> allocation/data/bytes_reserved:0
>> allocation/data/bytes_used:45623730176
>> allocation/data/single/used_bytes:45623730176
>> allocation/data/single/total_bytes:46179287040
>> allocation/data/total_bytes:46179287040
>> allocation/data/disk_total:46179287040
>> allocation/data/disk_used:45623730176
>> allocation/metadata/dup/used_bytes:1120698368
>> allocation/metadata/dup/total_bytes:6979321856
>> allocation/metadata/flags:4
>> allocation/metadata/bytes_pinned:0
>> allocation/metadata/bytes_may_use:88521768960
>> allocation/metadata/total_bytes_pinned:-44285952
> 
> ... well that's certainly interesting.  It looks like we'll need to see
> how that happened.  It seems like we've messed up at least that portion
> of accounting.
> 
> -Jeff
> 
>> allocation/metadata/bytes_reserved:0
>> 

Re: BTRFS constantly reports "No space left on device" even with a huge unallocated space

2016-09-08 Thread Jeff Mahoney
On 9/8/16 2:24 PM, Ronan Arraes Jardim Chagas wrote:
> Hi all!
> 
> Em Seg, 2016-09-05 às 16:49 +0800, Qu Wenruo escreveu:
>> Just like what Wang has mentioned, would you please paste all the
>> output 
>> of the contents of /sys/fs/btrfs//allocation?
>>
>> It's recommended to use "grep . -IR " to get all the data as
>> it 
>> will show the file name.
> 
> So, one more time, I see the problem. This time I was just using
> Firefox and I cannot recover using `btrfs balance`. I think that, one
> more time, I will need to reboot this machine. This problem is really
> causing me a lot of troubles :(

I have a hunch the list is about to be flooded with similar reports if
we don't find this one before 4.8.

commit d555b6c380c644af63dbdaa7cc14bba041a4e4dd
Author: Josef Bacik 
Date:   Fri Mar 25 13:25:51 2016 -0400

Btrfs: warn_on for unaccounted spaces

This commit isn't the source of the bug, but it's making it a lot more
noisy.  I spent a few hours last night trying to track down why xfstests
was throwing these warnings and I was able to reproduce them at least as
far back as 4.4-vanilla with -oenospc_debug enabled.

Speaking of which, can you turn on mounting with -oenospc_debug if you
haven't already?

In my case, space_info->bytes_may_use was getting accounted incorrectly.

I am able to reproduce that even with the following commit:
commit 18513091af9483ba84328d42092bd4d42a3c958f
Author: Wang Xiaoguang 
Date:   Mon Jul 25 15:51:40 2016 +0800

btrfs: update btrfs_space_info's bytes_may_use timely


> grep . -IR /sys/fs/btrfs/e9efaa0c-d477-4249-830f-
> ee5956768b29/allocation
> allocation/data/flags:1
> allocation/data/bytes_pinned:0
> allocation/data/bytes_may_use:0
> allocation/data/total_bytes_pinned:202973265920

That adds up to ~ 189 GB.  total_bytes is only about 42 GB.

> allocation/data/bytes_reserved:0
> allocation/data/bytes_used:45623730176
> allocation/data/single/used_bytes:45623730176
> allocation/data/single/total_bytes:46179287040
> allocation/data/total_bytes:46179287040
> allocation/data/disk_total:46179287040
> allocation/data/disk_used:45623730176
> allocation/metadata/dup/used_bytes:1120698368
> allocation/metadata/dup/total_bytes:6979321856
> allocation/metadata/flags:4
> allocation/metadata/bytes_pinned:0
> allocation/metadata/bytes_may_use:88521768960
> allocation/metadata/total_bytes_pinned:-44285952

... well that's certainly interesting.  It looks like we'll need to see
how that happened.  It seems like we've messed up at least that portion
of accounting.

-Jeff

> allocation/metadata/bytes_reserved:0
> allocation/metadata/bytes_used:1120698368
> allocation/metadata/total_bytes:6979321856
> allocation/metadata/disk_total:13958643712
> allocation/metadata/disk_used:2241396736
> allocation/global_rsv_size:385875968
> allocation/global_rsv_reserved:385875968
> allocation/system/dup/used_bytes:16384
> allocation/system/dup/total_bytes:33554432
> allocation/system/flags:2
> allocation/system/bytes_pinned:0
> allocation/system/bytes_may_use:0
> allocation/system/total_bytes_pinned:0
> allocation/system/bytes_reserved:0
> allocation/system/bytes_used:16384
> allocation/system/total_bytes:33554432
> allocation/system/disk_total:67108864
> allocation/system/disk_used:32768
> 
> Additional information:
> 
> btrfs fi usage /
> Overall:
> Device size: 1.26TiB
> Device allocated:   56.07GiB
> Device unallocated:  1.20TiB
> Device missing:0.00B
> Used:   44.58GiB
> Free (estimated):1.20TiB  (min: 616.41GiB)
> Data ratio: 1.00
> Metadata ratio: 2.00
> Global reserve:368.00MiB  (used: 0.00B)
> 
> Data,single: Size:43.01GiB, Used:42.49GiB
>/dev/sda643.01GiB
> 
> Metadata,DUP: Size:6.50GiB, Used:1.04GiB
>/dev/sda613.00GiB
> 
> System,DUP: Size:32.00MiB, Used:16.00KiB
>/dev/sda664.00MiB
> 
> Unallocated:
>/dev/sda6 1.20TiB
> 
> Can anyone help me?
> 
> Best regards,
> Ronan Arraes
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: BTRFS constantly reports "No space left on device" even with a huge unallocated space

2016-09-08 Thread Ronan Arraes Jardim Chagas
Hi all!

Em Seg, 2016-09-05 às 16:49 +0800, Qu Wenruo escreveu:
> Just like what Wang has mentioned, would you please paste all the
> output 
> of the contents of /sys/fs/btrfs//allocation?
> 
> It's recommended to use "grep . -IR " to get all the data as
> it 
> will show the file name.

So, one more time, I see the problem. This time I was just using
Firefox and I cannot recover using `btrfs balance`. I think that, one
more time, I will need to reboot this machine. This problem is really
causing me a lot of troubles :(

I have disabled the quotas and the first error message after the
problem was:

[ 2444.592255] [ cut here ]
[ 2444.592314] WARNING: CPU: 4 PID: 289 at ../fs/btrfs/extent-
tree.c:4303 btrfs_free_reserved_data_space_noquota+0xfe/0x110 [btrfs]
[ 2444.592317] Modules linked in: fuse nf_log_ipv6 xt_pkttype
nf_log_ipv4 nf_log_common xt_LOG xt_limit af_packet iscsi_ibft
iscsi_boot_sysfs msr ip6t_REJECT nf_reject_ipv6 xt_tcpudp
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw nvidia_drm(PO) ipt_REJECT
nf_reject_ipv4 snd_hda_codec_hdmi nvidia_modeset(PO) intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp nvidia(PO) coretemp
snd_hda_codec_realtek iTCO_wdt snd_hda_codec_generic iptable_raw
drm_kms_helper snd_hda_intel drm xt_CT snd_hda_codec snd_hda_core
snd_hwdep kvm_intel snd_pcm snd_timer joydev mei_wdt fb_sys_fops
iTCO_vendor_support i2c_i801 lpc_ich kvm syscopyarea snd sysfillrect
irqbypass mei_me hp_wmi sysimgblt iptable_filter crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper
[ 2444.592386]  cryptd soundcore mei sparse_keymap rfkill e1000e shpchp
pcspkr ioatdma mfd_core tpm_infineon tpm_tis dca tpm fjes ptp pps_core
ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast
nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack
ip6table_filter ip6_tables x_tables btrfs xor raid6_pq hid_generic
usbhid crc32c_intel serio_raw xhci_pci ehci_pci xhci_hcd ehci_hcd
firewire_ohci sr_mod firewire_core cdrom crc_itu_t usbcore isci
usb_common libsas ata_generic mpt3sas raid_class scsi_transport_sas wmi
button sg
[ 2444.592447] CPU: 4 PID: 289 Comm: kworker/u65:7 Tainted:
PW  O4.7.1-1-default #1
[ 2444.592450] Hardware name: Hewlett-Packard HP Z820 Workstation/158B,
BIOS J63 v03.65 12/19/2013
[ 2444.592458] Workqueue: writeback wb_workfn (flush-btrfs-1)
[ 2444.592462]   81393104 

[ 2444.592468]  8107ca1e 88080de6d800 9000
88080c437a00
[ 2444.592472]  880634b379ac 9000 88080dcfb73c
a02af98e
[ 2444.592477] Call Trace:
[ 2444.592499]  [] dump_trace+0x5e/0x320
[ 2444.592507]  [] show_stack_log_lvl+0x10c/0x180
[ 2444.592514]  [] show_stack+0x21/0x40
[ 2444.592523]  [] dump_stack+0x5c/0x78
[ 2444.592531]  [] __warn+0xbe/0xe0
[ 2444.592561]  []
btrfs_free_reserved_data_space_noquota+0xfe/0x110 [btrfs]
[ 2444.592602]  [] btrfs_clear_bit_hook+0x296/0x380
[btrfs]
[ 2444.592642]  [] clear_state_bit+0x55/0x1d0 [btrfs]
[ 2444.592676]  [] __clear_extent_bit+0x13d/0x3f0
[btrfs]
[ 2444.592707]  []
extent_clear_unlock_delalloc+0x62/0x280 [btrfs]
[ 2444.592739]  [] cow_file_range+0x299/0x440 [btrfs]
[ 2444.592768]  [] run_delalloc_range+0x392/0x3b0
[btrfs]
[ 2444.592801]  []
writepage_delalloc.isra.40+0x100/0x170 [btrfs]
[ 2444.592834]  [] __extent_writepage+0xc3/0x340
[btrfs]
[ 2444.592864]  []
extent_write_cache_pages.isra.36.constprop.53+0x23b/0x350 [btrfs]
[ 2444.592894]  [] extent_writepages+0x4e/0x60
[btrfs]
[ 2444.592900]  []
__writeback_single_inode+0x3d/0x3b0
[ 2444.592907]  [] writeback_sb_inodes+0x20a/0x440
[ 2444.592914]  [] __writeback_inodes_wb+0x87/0xb0
[ 2444.592921]  [] wb_writeback+0x28d/0x330
[ 2444.592927]  [] wb_workfn+0x222/0x3f0
[ 2444.592934]  [] process_one_work+0x1ed/0x4e0
[ 2444.592942]  [] worker_thread+0x47/0x4c0
[ 2444.592947]  [] kthread+0xbd/0xe0
[ 2444.592954]  [] ret_from_fork+0x1f/0x40
[ 2444.596679] DWARF2 unwinder stuck at ret_from_fork+0x1f/0x40

[ 2444.596683] Leftover inexact backtrace:

[ 2444.596689]  [] ? kthread_worker_fn+0x170/0x170

I will also provide the information requested by Qu:

grep . -IR /sys/fs/btrfs/e9efaa0c-d477-4249-830f-
ee5956768b29/allocation
allocation/data/flags:1
allocation/data/bytes_pinned:0
allocation/data/bytes_may_use:0
allocation/data/total_bytes_pinned:202973265920
allocation/data/bytes_reserved:0
allocation/data/bytes_used:45623730176
allocation/data/single/used_bytes:45623730176
allocation/data/single/total_bytes:46179287040
allocation/data/total_bytes:46179287040
allocation/data/disk_total:46179287040
allocation/data/disk_used:45623730176
allocation/metadata/dup/used_bytes:1120698368
allocation/metadata/dup/total_bytes:6979321856
allocation/metadata/flags:4
allocation/metadata/bytes_pinned:0
allocation/metadata/bytes_may_use:88521768960
allocation/metadata/total_bytes_pinned:-44285952

Re: Finding only non-snapshots via btrfs subvol list

2016-09-08 Thread Holger Hoffstätte
On 07/21/16 16:55, Holger Hoffstätte wrote:
> I'm trying to find non-snapshots, i.e. 'top-level' subvolumes in a
> filesystem and this seems harder than it IMHO should be.
> 
> The fs is just like:
> 
> /mnt/stuff
>  subvolA
>  subvolA-date1
>  subvolA-date2
>  subvolB
>  subvolB-date1
>  subvolB-date2
> ..
> 
> All I want are the subvol{A,B} *without* the snapshots, but so
> far I haven't been able to accomplish this easily with "subvol list"
> and its options. -s lists only snapshots, but what I want is the
> exact opposite.

This question received a deafening lack of feedback, so I just took
a swing at this and apparently hit something.

When you have a set of subvols and snapshots like so:

$./btrfs subvolume list /t/btrfs
ID 257 gen 13 top level 5 path a
ID 258 gen 16 top level 5 path b
ID 259 gen 15 top level 5 path c
ID 260 gen 11 top level 5 path a1
ID 261 gen 12 top level 5 path a2
ID 263 gen 14 top level 5 path b1
ID 264 gen 15 top level 5 path c1
ID 265 gen 16 top level 5 path b2

where a,b,c are subvolumes and ?{1,2,3} are snapshots, you can now do:

$./btrfs subvolume list -P /t/btrfs
ID 257 gen 13 top level 5 path a
ID 258 gen 16 top level 5 path b
ID 259 gen 15 top level 5 path c

Is this of interest? I find it useful to iterate over all parent
subvols (though you'll still need cut or awk to get only the name)
without accidentally hitting the snapshots, or relying on fragile inhouse
naming conventions.

The -P was the only meaningful letter left (P for Parent). I first used
-S (for grown-up -s ;) but that was already used for matching getopt
on --sort. If -S is deemed better I can reroute that to -Z or something,
since it's unused in short form.

The patch is surprisingly small and was quite easy to write. Nice!

cheers
Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] Btrfs: kill BUG_ON()'s in btrfs_mark_extent_written

2016-09-08 Thread David Sterba
On Fri, Sep 02, 2016 at 03:40:06PM -0400, Josef Bacik wrote:
> No reason to bug on in here, fs corruption could easily cause these things to
> happen.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: David Sterba 

Also on the way to 4.9.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs fixes for 4.8-rc

2016-09-08 Thread Chris Mason

On 09/08/2016 12:57 PM, David Sterba wrote:

Hi,

here are two fixups for the new space handling code introduced in 4.8.
Please pull.


The following changes since commit cb887083d084d74421ae7bb18acca40568da791f:

  Merge tag 'for-chris' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.8 
(2016-09-01 17:29:34 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris

for you to fetch changes up to ce129655c9d9aaa7b3bcc46529db1b36693575ed:

  btrfs: introduce tickets_id to determine whether asynchronous metadata 
reclaim work makes progress (2016-09-06 16:31:43 +0200)


Wang Xiaoguang (2):
  btrfs: do not decrease bytes_may_use when replaying extents
  btrfs: introduce tickets_id to determine whether asynchronous metadata 
reclaim work makes progress


Thanks Dave, I pulled these yesterday along with my list_del_init for 
the logging code.  Testing is looking good so far, so I expect to send 
to Linus on Friday.


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] Btrfs: kill the start argument to read_extent_buffer_pages

2016-09-08 Thread David Sterba
On Fri, Sep 02, 2016 at 03:40:03PM -0400, Josef Bacik wrote:
> Nobody uses this, it makes no sense to do partial reads of extent buffers.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: David Sterba 

Same here, picked to 4.9.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] Btrfs: add a flags field to btrfs_fs_info

2016-09-08 Thread David Sterba
On Fri, Sep 02, 2016 at 03:40:02PM -0400, Josef Bacik wrote:
> We have a lot of random ints in btrfs_fs_info that can be put into flags.  
> This
> is mostly equivalent with the exception of how we deal with quota going on or
> off, now instead we set a flag when we are turning it on or off and deal with
> that appropriately, rather than just having a pending state that the current
> quota_enabled gets set to.  Thanks,
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: David Sterba 

I'm picking this patch independently to 4.9, but feel free to include it
in the patch series if you send more revisions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] Btrfs fixes for 4.8-rc

2016-09-08 Thread David Sterba
Hi,

here are two fixups for the new space handling code introduced in 4.8.
Please pull.


The following changes since commit cb887083d084d74421ae7bb18acca40568da791f:

  Merge tag 'for-chris' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.8 
(2016-09-01 17:29:34 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris

for you to fetch changes up to ce129655c9d9aaa7b3bcc46529db1b36693575ed:

  btrfs: introduce tickets_id to determine whether asynchronous metadata 
reclaim work makes progress (2016-09-06 16:31:43 +0200)


Wang Xiaoguang (2):
  btrfs: do not decrease bytes_may_use when replaying extents
  btrfs: introduce tickets_id to determine whether asynchronous metadata 
reclaim work makes progress

 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 23 +++
 2 files changed, 16 insertions(+), 8 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Another 4.8-rc locked splat: btrfs_close_devices()

2016-09-08 Thread Ilya Dryomov
Hello,

This one seems to have appeared after Anand's commit
142388194191 ("btrfs: do not background blkdev_put()") got merged into
4.8-rc4.

Thanks,

Ilya
[  983.284212] ==
[  983.290401] [ INFO: possible circular locking dependency detected ]
[  983.296677] 4.8.0-rc5-ceph-00023-g1b39cec2 #1 Not tainted
[  983.302081] ---
[  983.308357] umount/21720 is trying to acquire lock:
[  983.313243]  (>bd_mutex){+.+.+.}, at: [] 
blkdev_put+0x31/0x150
[  983.321264] 
[  983.321264] but task is already holding lock:
[  983.327101]  (_devs->device_list_mutex){+.+...}, at: [] 
__btrfs_close_devices+0x46/0x200 [btrfs]
[  983.337839] 
[  983.337839] which lock already depends on the new lock.
[  983.337839] 
[  983.346024] 
[  983.346024] the existing dependency chain (in reverse order) is:
[  983.353512] 
-> #4 (_devs->device_list_mutex){+.+...}:
[  983.359096][] lock_acquire+0x1bc/0x1f0
[  983.365143][] mutex_lock_nested+0x65/0x350
[  983.371521][] btrfs_show_devname+0x36/0x1f0 [btrfs]
[  983.378710][] show_vfsmnt+0x4e/0x150
[  983.384593][] m_show+0x17/0x20
[  983.389957][] seq_read+0x2b5/0x3b0
[  983.395669][] __vfs_read+0x28/0x100
[  983.401464][] vfs_read+0xab/0x150
[  983.407080][] SyS_read+0x52/0xb0
[  983.412609][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.419617] 
-> #3 (namespace_sem){++}:
[  983.424024][] lock_acquire+0x1bc/0x1f0
[  983.430074][] down_write+0x49/0x80
[  983.435785][] lock_mount+0x67/0x1c0
[  983.441582][] do_add_mount+0x32/0xf0
[  983.447458][] finish_automount+0x5a/0xc0
[  983.453682][] follow_managed+0x1b3/0x2a0
[  983.459912][] lookup_fast+0x300/0x350
[  983.465875][] path_openat+0x3a7/0xaa0
[  983.471846][] do_filp_open+0x85/0xe0
[  983.477731][] do_sys_open+0x14c/0x1f0
[  983.483702][] SyS_open+0x1e/0x20
[  983.489240][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.496254] 
-> #2 (>s_type->i_mutex_key#3){+.+.+.}:
[  983.501798][] lock_acquire+0x1bc/0x1f0
[  983.507855][] down_write+0x49/0x80
[  983.513558][] start_creating+0x87/0x100
[  983.519703][] debugfs_create_dir+0x17/0x100
[  983.526195][] bdi_register+0x93/0x210
[  983.532165][] bdi_register_owner+0x43/0x70
[  983.538570][] device_add_disk+0x1fb/0x450
[  983.544888][] loop_add+0x1e6/0x290
[  983.550596][] loop_init+0x10b/0x14f
[  983.556394][] do_one_initcall+0xa7/0x180
[  983.562618][] kernel_init_freeable+0x1cc/0x266
[  983.569370][] kernel_init+0xe/0x100
[  983.575166][] ret_from_fork+0x1f/0x40
[  983.581131] 
-> #1 (loop_index_mutex){+.+.+.}:
[  983.585801][] lock_acquire+0x1bc/0x1f0
[  983.591858][] mutex_lock_nested+0x65/0x350
[  983.598256][] lo_open+0x1f/0x60
[  983.603704][] __blkdev_get+0x123/0x400
[  983.609757][] blkdev_get+0x34a/0x350
[  983.615639][] blkdev_open+0x64/0x80
[  983.621428][] do_dentry_open+0x1c6/0x2d0
[  983.627651][] vfs_open+0x69/0x80
[  983.633181][] path_openat+0x834/0xaa0
[  983.639152][] do_filp_open+0x85/0xe0
[  983.645035][] do_sys_open+0x14c/0x1f0
[  983.650999][] SyS_open+0x1e/0x20
[  983.656535][] entry_SYSCALL_64_fastpath+0x23/0xc1
[  983.663541] 
-> #0 (>bd_mutex){+.+.+.}:
[  983.668107][] __lock_acquire+0x1003/0x17b0
[  983.674510][] lock_acquire+0x1bc/0x1f0
[  983.680561][] mutex_lock_nested+0x65/0x350
[  983.686967][] blkdev_put+0x31/0x150
[  983.692761][] btrfs_close_bdev+0x4f/0x60 [btrfs]
[  983.699699][] __btrfs_close_devices+0xcb/0x200 
[btrfs]
[  983.707178][] btrfs_close_devices+0x2b/0xa0 [btrfs]
[  983.714380][] close_ctree+0x265/0x340 [btrfs]
[  983.721061][] btrfs_put_super+0x19/0x20 [btrfs]
[  983.727908][] generic_shutdown_super+0x6f/0x100
[  983.734744][] kill_anon_super+0x16/0x30
[  983.740888][] btrfs_kill_super+0x1e/0x130 [btrfs]
[  983.747909][] deactivate_locked_super+0x49/0x80
[  983.754745][] deactivate_super+0x5d/0x70
[  983.760977][] cleanup_mnt+0x5c/0x80
[  983.766773][] __cleanup_mnt+0x12/0x20
[  983.772738][] task_work_run+0x7e/0xc0
[  983.778708][] exit_to_usermode_loop+0x7e/0xb4
[  983.785373][] syscall_return_slowpath+0xbb/0xd0
[  983.792212][] entry_SYSCALL_64_fastpath+0xbf/0xc1
[  983.799225] 
[  983.799225] other info that might help us debug this:
[  983.799225] 
[  983.807291] Chain exists of:
  >bd_mutex --> namespace_sem --> _devs->device_list_mutex

[  983.816521]  Possible unsafe locking scenario:
[  983.816521] 
[  983.822489]CPU0CPU1
[  983.827043]   

Re: [PATCH 6/7] Btrfs: kill the btree_inode

2016-09-08 Thread Josef Bacik

On 09/08/2016 01:17 AM, Chandan Rajendra wrote:

On Friday, September 02, 2016 03:40:05 PM Josef Bacik wrote:

Please find my comment inlined below,


In order to more efficiently support sub-page blocksizes we need to stop
allocating pages from pagecache for our metadata.  Instead switch to using the
account_metadata* counters for making sure we are keeping the system aware of
how much dirty metadata we have, and use the ->free_cached_objects super
operation in order to handle freeing up extent buffers.  This greatly simplifies
how we deal with extent buffers as now we no longer have to tie the page cache
reclaimation stuff to the extent buffer stuff.  This will also allow us to
simply kmalloc() our data for sub-page blocksizes.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/btrfs_inode.h |   3 +-
 fs/btrfs/ctree.c   |  10 +-
 fs/btrfs/ctree.h   |  13 +-
 fs/btrfs/disk-io.c | 389 --
 fs/btrfs/extent_io.c   | 913 ++---
 fs/btrfs/extent_io.h   |  49 +-
 fs/btrfs/inode.c   |   6 +-
 fs/btrfs/root-tree.c   |   2 +-
 fs/btrfs/super.c   |  29 +-
 fs/btrfs/tests/btrfs-tests.c   |  37 +-
 fs/btrfs/tests/extent-io-tests.c   |   4 +-
 fs/btrfs/tests/free-space-tree-tests.c |   4 +-
 fs/btrfs/tests/qgroup-tests.c  |   4 +-
 fs/btrfs/transaction.c |  11 +-
 14 files changed, 726 insertions(+), 748 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 1a8fa46..ad7b185 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -229,10 +229,9 @@ static inline u64 btrfs_ino(struct inode *inode)
u64 ino = BTRFS_I(inode)->location.objectid;

/*
-* !ino: btree_inode
 * type == BTRFS_ROOT_ITEM_KEY: subvol dir
 */
-   if (!ino || BTRFS_I(inode)->location.type == BTRFS_ROOT_ITEM_KEY)
+   if (BTRFS_I(inode)->location.type == BTRFS_ROOT_ITEM_KEY)
ino = inode->i_ino;
return ino;
 }
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d1c56c9..b267053 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1373,8 +1373,8 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,

if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
BUG_ON(tm->slot != 0);
-   eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start,
-   eb->len);
+   eb_rewin = alloc_dummy_extent_buffer(fs_info->eb_info,
+eb->start, eb->len);
if (!eb_rewin) {
btrfs_tree_read_unlock_blocking(eb);
free_extent_buffer(eb);
@@ -1455,8 +1455,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
} else if (old_root) {
btrfs_tree_read_unlock(eb_root);
free_extent_buffer(eb_root);
-   eb = alloc_dummy_extent_buffer(root->fs_info, logical,
-   root->nodesize);
+   eb = alloc_dummy_extent_buffer(root->fs_info->eb_info, logical,
+  root->nodesize);
} else {
btrfs_set_lock_blocking_rw(eb_root, BTRFS_READ_LOCK);
eb = btrfs_clone_extent_buffer(eb_root);
@@ -1772,7 +1772,7 @@ static noinline int generic_bin_search(struct 
extent_buffer *eb,
int err;

if (low > high) {
-   btrfs_err(eb->fs_info,
+   btrfs_err(eb->eb_info->fs_info,
 "%s: low (%d) > high (%d) eb %llu owner %llu level %d",
  __func__, low, high, eb->start,
  btrfs_header_owner(eb), btrfs_header_level(eb));
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 282a031..ee6956c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -675,6 +676,7 @@ struct btrfs_device;
 struct btrfs_fs_devices;
 struct btrfs_balance_control;
 struct btrfs_delayed_root;
+struct btrfs_eb_info;

 #define BTRFS_FS_BARRIER   1
 #define BTRFS_FS_CLOSING_START 2
@@ -797,7 +799,7 @@ struct btrfs_fs_info {
struct btrfs_super_block *super_for_commit;
struct block_device *__bdev;
struct super_block *sb;
-   struct inode *btree_inode;
+   struct btrfs_eb_info *eb_info;
struct backing_dev_info bdi;
struct mutex tree_log_mutex;
struct mutex transaction_kthread_mutex;
@@ -1042,10 +1044,6 @@ struct btrfs_fs_info {
/* readahead works cnt */
atomic_t reada_works_cnt;

-   /* Extent buffer radix tree */
-   spinlock_t buffer_lock;
-   

Re: lockdep warning in btrfs in 4.8-rc3

2016-09-08 Thread Chris Mason

On 09/08/2016 07:50 AM, Christian Borntraeger wrote:

On 09/08/2016 01:48 PM, Christian Borntraeger wrote:

Chris,

with 4.8-rc3 I get the following on an s390 box:


Sorry for the noise, just saw the fix in your pull request.



The lockdep splat is still there, we'll need to annotate this one a little.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lockdep warning in btrfs in 4.8-rc3

2016-09-08 Thread Christian Borntraeger
On 09/08/2016 01:48 PM, Christian Borntraeger wrote:
> Chris,
> 
> with 4.8-rc3 I get the following on an s390 box:

Sorry for the noise, just saw the fix in your pull request.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


lockdep warning in btrfs in 4.8-rc3

2016-09-08 Thread Christian Borntraeger
Chris,

with 4.8-rc3 I get the following on an s390 box:


[ 1094.009172] =
[ 1094.009174] [ INFO: possible recursive locking detected ]
[ 1094.009177] 4.8.0-rc3 #126 Tainted: GW  
[ 1094.009179] -
[ 1094.009180] vim/12891 is trying to acquire lock:
[ 1094.009182]  (>log_mutex){+.+...}, at: [<03ff817e83c6>] 
btrfs_log_inode+0x126/0x1010 [btrfs]
[ 1094.009256] 
   but task is already holding lock:
[ 1094.009258]  (>log_mutex){+.+...}, at: [<03ff817e83c6>] 
btrfs_log_inode+0x126/0x1010 [btrfs]
[ 1094.009276] 
   other info that might help us debug this:
[ 1094.009278]  Possible unsafe locking scenario:

[ 1094.009280]CPU0
[ 1094.009281]
[ 1094.009282]   lock(>log_mutex);
[ 1094.009284]   lock(>log_mutex);
[ 1094.009286] 
*** DEADLOCK ***

[ 1094.009288]  May be due to missing lock nesting notation

[ 1094.009290] 3 locks held by vim/12891:
[ 1094.009291]  #0:  (>s_type->i_mutex_key#15){+.+.+.}, at: 
[<03ff817afbd6>] btrfs_sync_file+0x1de/0x5e8 [btrfs]
[ 1094.009311]  #1:  (sb_internal#2){.+.+..}, at: [<0035e0ba>] 
__sb_start_write+0x122/0x138
[ 1094.009320]  #2:  (>log_mutex){+.+...}, at: [<03ff817e83c6>] 
btrfs_log_inode+0x126/0x1010 [btrfs]
[ 1094.009370] 
   stack backtrace:
[ 1094.009375] CPU: 14 PID: 12891 Comm: vim Tainted: GW   4.8.0-rc3 
#126
[ 1094.009377] Hardware name: IBM  2964 NC9  704
  (LPAR)
[ 1094.009380]00f061367608 00f061367698 0002 
 
  00f061367738 00f0613676b0 00f0613676b0 
001133ec 
    00f7000a 
00f7000a 
  00f0613676f8 00f061367698  
 
  040001d821c8 001133ec 00f061367698 
00f0613676e8 
[ 1094.009396] Call Trace:
[ 1094.009401] ([<00113334>] show_trace+0xec/0xf0)
[ 1094.009403] ([<0011339a>] show_stack+0x62/0xe8)
[ 1094.009406] ([<0055211c>] dump_stack+0x9c/0xe0)
[ 1094.009411] ([<001d9930>] validate_chain.isra.22+0xc00/0xd70)
[ 1094.009413] ([<001dad9c>] __lock_acquire+0x39c/0x7d8)
[ 1094.009414] ([<001db8d0>] lock_acquire+0x108/0x320)
[ 1094.009420] ([<008845c6>] mutex_lock_nested+0x86/0x3f8)
[ 1094.009440] ([<03ff817e83c6>] btrfs_log_inode+0x126/0x1010 [btrfs])
[ 1094.009457] ([<03ff817e8fb2>] btrfs_log_inode+0xd12/0x1010 [btrfs])
[ 1094.009474] ([<03ff817e95b4>] btrfs_log_inode_parent+0x244/0x980 [btrfs])
[ 1094.009490] ([<03ff817eafea>] btrfs_log_dentry_safe+0x7a/0xa0 [btrfs])
[ 1094.009506] ([<03ff817afe1a>] btrfs_sync_file+0x422/0x5e8 [btrfs])
[ 1094.009512] ([<0039e64e>] do_fsync+0x5e/0x90)
[ 1094.009514] ([<0039e9e2>] SyS_fsync+0x32/0x40)
[ 1094.009517] ([<0088a336>] system_call+0xd6/0x270)
[ 1094.009518] INFO: lockdep is turned off.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Security implications of btrfs receive?

2016-09-08 Thread Austin S. Hemmelgarn

On 2016-09-07 15:34, Chris Murphy wrote:

On Wed, Sep 7, 2016 at 1:08 PM, Austin S. Hemmelgarn
 wrote:


I think I covered it already in the last thread on this, but the best way I
see to fix the whole auto-assembly issue is:
1. Stop the damn auto-scanning of new devices on hot-plug.  The scanning
should be done on mount or invoking something like btrfs dev scan, not on
hot-plug.  This is the biggest current issue, and is in theory the easiest
thing to fix.  The problem here is that it's udev sources we need to change,
not our own.
2. Get rid of the tracking in the kernel.  If a filesystem isn't mounted or
requested to be mounted, then the kernel has no business worrying about what
what devices it's on.  If the filesystem is mounted, then the only way to
associate new devices should be from userspace.
3. When mounting, the mount helper should be doing the checking to verify
that the UUID's and everything else are correct.  Ideally, the mount(2) call
should require a list of devices to use, and mount should be doing the
discovery.  This is at odds with how systemd handles BTRFS mounts, but
they're being stupid with that too (the only way to tell for certain if a FS
will mount is to try to mount it, if the mount(2) call succeeds, then the
filesystem was ready, regardless of whether or not userspace thinks the
device is).
4. The kernel should be doing a better job of validating filesystems. It
should be checking that all the devices agree on how many devices there
should be, as well as checking that they all have correct UUID's. This is
technically not necessary if item 3 is implemented, but is still good
practice from a hardening perspective.


It'd be nice to leverage WWN when available, cross referencing all WWN
with volume UUID and device UUID. There are all sorts of policies that
can make use of this, not least of which is "fail on mount whenever
all three expected IDs are not present" but also this crazy hunt for
the actual physical device to replace when all we know from btrfs fi
show is that a device is missing, and inferring what devid it is from
the devids that are still listed. I'd like to see a WWN for what's
missing. devid is useless, devuuid is useless, a serial number is
better than nothing but WWN fits the use case by design.
I like the idea of matching WWN as part of the check, with a couple of 
caveats:
1. We need to keep in mind that in some environments, this can be 
spoofed (Virtualization for example, although doing so would require 
source level modifications to most hypervisors).
2. There needs to be a way to forcibly mount in the case of a mismatch, 
as well as a way to update the filesystem to match the current WWN's of 
all of it's disks.  I also specifically think that these should be 
separate options, the first is useful for debugging a filesystem using 
image files, while the second is useful for external clones of disks.
3. single device filesystems should store the WWN, and ideally keep it 
up-to-date, but not check it.  They have no need to check it, and single 
device is the primary use case for a traditional user, so it should be 
as simple as possible.
4. We should be matching on more than just fsuuid, devuuid, and WWN, 
because just matching those would allow a second partition on the same 
device to cause issues.


But yeah I think we kinda need some other ducks in a row, the
mechanisms of discovery, and the continuum of faultiness (occasionally
flaky, to flat out vanished).

It is also kinda important to see things like udisks and storaged as
user agents, ensuring they have a way to communicate with the helper
so things are mounted and umounted correctly as most DE's now expect
to just automount everything. I still get weird behaviors on GNOME
with udisks2 and multiple device Btrfs volumes with current upstream
GNOME stuff.
DE's expect the ability to automount things as a regular user, not 
necessarily that it has to happen.  I'm not all that worried personally 
about automounting of multi-device filesystems, largely because the type 
of person who automounting in the desktop primarily caters to is not 
likely to have a multi-device filesystem to begin with.  For that 
matter, the primary (only realistic?) use for multi-device filesystems 
on removable media is backups, and the few people who are going to set 
things up to automatically run backups when the disks get plugged in 
will be smart enough to get things working correctly themselves, while 
anyone else is going to be running the backup manually and can mount the 
FS by hand if they aren't using something like autofs.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] btrfs-progs: receive: Introduce option to exam and dump send stream

2016-09-08 Thread David Sterba
On Thu, Sep 08, 2016 at 11:42:29AM +0200, David Sterba wrote:
> On Wed, Sep 07, 2016 at 08:29:34AM +0800, Qu Wenruo wrote:
> > @@ -1265,19 +1274,37 @@ int cmd_receive(int argc, char **argv)
> > }
> > }
> >  
> > -   ret = do_receive(, tomnt, realmnt, receive_fd, max_errors);
> > +   if (dump) {
> > +   struct btrfs_dump_send_args dump_args;
> > +
> > +   dump_args.root_path = malloc(PATH_MAX);
> > +   dump_args.root_path[0] = '.';
> > +   dump_args.root_path[1] = '\0';
> > +   dump_args.full_subvol_path = malloc(PATH_MAX);
> 
> Please always check malloc return values. I'm fixing this for now.

Uh and the buffers are not freed either. Anyway, I'm switching it to an
array, there's no reason to allocate the memory dynamically.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] btrfs-progs: receive: Introduce option to exam and dump send stream

2016-09-08 Thread David Sterba
On Wed, Sep 07, 2016 at 08:29:34AM +0800, Qu Wenruo wrote:
> @@ -1265,19 +1274,37 @@ int cmd_receive(int argc, char **argv)
>   }
>   }
>  
> - ret = do_receive(, tomnt, realmnt, receive_fd, max_errors);
> + if (dump) {
> + struct btrfs_dump_send_args dump_args;
> +
> + dump_args.root_path = malloc(PATH_MAX);
> + dump_args.root_path[0] = '.';
> + dump_args.root_path[1] = '\0';
> + dump_args.full_subvol_path = malloc(PATH_MAX);

Please always check malloc return values. I'm fixing this for now.

> + dump_args.full_subvol_path[0] = '.';
> + dump_args.full_subvol_path[1] = '\0';
> + ret = btrfs_read_and_process_send_stream(receive_fd,
> + _print_send_ops, _args, 0, 0);
> + if (ret < 0)
> + error("failed to dump the send stream: %s",
> +   strerror(-ret));
> + } else {
> + ret = do_receive(, tomnt, realmnt, receive_fd, max_errors);
> + }
> +
>   if (receive_fd != fileno(stdin))
>   close(receive_fd);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 03/15] btrfs: dedupe: Introduce function to initialize dedupe info

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/Makefile  |   2 +-
 fs/btrfs/dedupe.c  | 185 +
 fs/btrfs/dedupe.h  |  13 +++-
 include/uapi/linux/btrfs.h |   4 +-
 4 files changed, 200 insertions(+), 4 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..1b8c627 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o hash.o free-space-tree.o
+  uuid-tree.o props.o hash.o free-space-tree.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index 000..b14166a
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,185 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+   struct rb_node hash_node;
+   struct rb_node bytenr_node;
+   struct list_head lru_list;
+
+   u64 bytenr;
+   u32 num_bytes;
+
+   u8 hash[];
+};
+
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info,
+   struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+   if (!dedupe_info)
+   return -ENOMEM;
+
+   dedupe_info->hash_algo = dargs->hash_algo;
+   dedupe_info->backend = dargs->backend;
+   dedupe_info->blocksize = dargs->blocksize;
+   dedupe_info->limit_nr = dargs->limit_nr;
+
+   /* only support SHA256 yet */
+   dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+   if (IS_ERR(dedupe_info->dedupe_driver)) {
+   int ret;
+
+   ret = PTR_ERR(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return ret;
+   }
+
+   dedupe_info->hash_root = RB_ROOT;
+   dedupe_info->bytenr_root = RB_ROOT;
+   dedupe_info->current_nr = 0;
+   INIT_LIST_HEAD(_info->lru_list);
+   mutex_init(_info->lock);
+
+   *ret_info = dedupe_info;
+   return 0;
+}
+
+/*
+ * Helper to check if parameters are valid.
+ * The first invalid field will be set to (-1), to info user which parameter
+ * is invalid.
+ * Except dargs->limit_nr or dargs->limit_mem, in that case, 0 will returned
+ * to info user, since user can specify any value to limit, except 0.
+ */
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info,
+ struct btrfs_ioctl_dedupe_args *dargs)
+{
+   u64 blocksize = dargs->blocksize;
+   u64 limit_nr = dargs->limit_nr;
+   u64 limit_mem = dargs->limit_mem;
+   u16 hash_algo = dargs->hash_algo;
+   u8 backend = dargs->backend;
+
+   /*
+* Set all reserved fields to -1, allow user to detect
+* unsupported optional parameters.
+*/
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+   if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+   blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+   blocksize < fs_info->tree_root->sectorsize ||
+   !is_power_of_2(blocksize) ||
+   blocksize < PAGE_SIZE) {
+   dargs->blocksize = (u64)-1;
+   return -EINVAL;
+   }
+   if (hash_algo >= ARRAY_SIZE(btrfs_hash_sizes)) {
+   dargs->hash_algo = (u16)-1;
+   return -EINVAL;
+   }
+   if (backend >= BTRFS_DEDUPE_BACKEND_COUNT) {
+   dargs->backend = (u8)-1;
+   return -EINVAL;
+   }
+
+   /* Backend specific check */
+   if 

[PATCH v13 11/15] btrfs: dedupe: Add ioctl for inband dedupelication

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ioctl interface for inband dedupelication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interfaces are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Also, for invalid parameters, enable ioctl interface will set the field
of the first encounted invalid parameter to (-1) to inform caller.
While for limit_nr/limit_mem, the value will be (0).

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c  | 50 ++
 fs/btrfs/dedupe.h  | 17 
 fs/btrfs/disk-io.c |  3 ++
 fs/btrfs/ioctl.c   | 68 ++
 fs/btrfs/sysfs.c   |  2 ++
 include/uapi/linux/btrfs.h | 12 +++-
 6 files changed, 146 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index d0d2f8a..37b5a05 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,35 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled || !dedupe_info) {
+   dargs->status = 0;
+   dargs->blocksize = 0;
+   dargs->backend = 0;
+   dargs->hash_algo = 0;
+   dargs->limit_nr = 0;
+   dargs->current_nr = 0;
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+   return;
+   }
+   mutex_lock(_info->lock);
+   dargs->status = 1;
+   dargs->blocksize = dedupe_info->blocksize;
+   dargs->backend = dedupe_info->backend;
+   dargs->hash_algo = dedupe_info->hash_algo;
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_hash_sizes[dedupe_info->hash_algo]);
+   dargs->current_nr = dedupe_info->current_nr;
+   mutex_unlock(_info->lock);
+   memset(dargs->__unused, -1, sizeof(dargs->__unused));
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info,
struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -420,6 +449,27 @@ static void unblock_all_writers(struct btrfs_fs_info 
*fs_info)
percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+
+   fs_info->dedupe_enabled = 0;
+   /* same as disable */
+   smp_wmb();
+   dedupe_info = fs_info->dedupe_info;
+   fs_info->dedupe_info = NULL;
+
+   if (!dedupe_info)
+   return 0;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+   return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 8311ee1..c3d50bc 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -91,6 +91,15 @@ static inline struct btrfs_dedupe_hash 
*btrfs_dedupe_alloc_hash(u16 algo)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
struct btrfs_ioctl_dedupe_args *dargs);
 
+
+ /*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -102,12 +111,10 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
- * Get current dedupe status.
- * Return 0 for success
- * No possible error yet
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
  */
-void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
-struct btrfs_ioctl_dedupe_args *dargs);
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
 
 /*
  * Calculate hash for dedupe.
diff --git 

[PATCH v13 10/15] btrfs: dedupe: Inband in-memory only de-duplication implement

2016-09-08 Thread Qu Wenruo
Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c |  20 
 fs/btrfs/inode.c   | 256 ++---
 fs/btrfs/relocation.c  |  16 
 3 files changed, 260 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9a7258e..a9a0855 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2401,6 +2402,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
 
if (btrfs_delayed_ref_is_head(node)) {
struct btrfs_delayed_ref_head *head;
+   struct btrfs_fs_info *fs_info = root->fs_info;
+
/*
 * we've hit the end of the chain and we were supposed
 * to insert this extent into the tree.  But, it got
@@ -2416,6 +2419,18 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
btrfs_pin_extent(root, node->bytenr,
 node->num_bytes, 1);
if (head->is_data) {
+   /*
+* If insert_reserved is given, it means
+* a new extent is revered, then deleted
+* in one tran, and inc/dec get merged to 0.
+*
+* In this case, we need to remove its dedupe
+* hash.
+*/
+   ret = btrfs_dedupe_del(trans, fs_info,
+  node->bytenr);
+   if (ret < 0)
+   return ret;
ret = btrfs_del_csums(trans, root,
  node->bytenr,
  node->num_bytes);
@@ -7096,6 +7111,11 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
btrfs_release_path(path);
 
if (is_data) {
+   ret = btrfs_dedupe_del(trans, info, bytenr);
+   if (ret < 0) {
+   btrfs_abort_transaction(trans, ret);
+   goto out;
+   }
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
btrfs_abort_transaction(trans, ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85c5b30..ed37204 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -337,6 +337,7 @@ struct async_extent {
struct page **pages;
unsigned long nr_pages;
int compress_type;
+   struct btrfs_dedupe_hash *hash;
struct list_head list;
 };
 
@@ -355,7 +356,8 @@ static noinline int add_async_extent(struct async_cow *cow,
 u64 compressed_size,
 struct page **pages,
 unsigned long nr_pages,
-int compress_type)
+int compress_type,
+struct btrfs_dedupe_hash *hash)
 {
struct async_extent *async_extent;
 
@@ -367,6 +369,7 @@ static noinline int add_async_extent(struct async_cow *cow,
async_extent->pages = pages;
async_extent->nr_pages = nr_pages;
async_extent->compress_type = compress_type;
+   async_extent->hash = hash;
list_add_tail(_extent->list, >extents);
return 0;
 }
@@ -599,7 +602,7 @@ cont:
 */
add_async_extent(async_cow, start, num_bytes,
total_compressed, pages, nr_pages_ret,
-   compress_type);
+   compress_type, NULL);
 
if (start + num_bytes < end) {
start += num_bytes;
@@ 

[PATCH v13 04/15] btrfs: dedupe: Introduce function to add hash into in-memory tree

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 151 ++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index b14166a..e51412b 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -32,6 +32,14 @@ struct inmem_hash {
u8 hash[];
 };
 
+static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
+{
+   if (WARN_ON(algo >= ARRAY_SIZE(btrfs_hash_sizes)))
+   return NULL;
+   return kzalloc(sizeof(struct inmem_hash) + btrfs_hash_sizes[algo],
+   GFP_NOFS);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info,
struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -183,3 +191,146 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
/* Place holder for bisect, will be implemented in later patches */
return 0;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+struct inmem_hash *hash, int hash_len)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+   if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+   p = &(*p)->rb_left;
+   else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>hash_node, parent, p);
+   rb_insert_color(>hash_node, root);
+   return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+  struct inmem_hash *hash)
+{
+   struct rb_node **p = >rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+   if (hash->bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (hash->bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return 1;
+   }
+   rb_link_node(>bytenr_node, parent, p);
+   rb_insert_color(>bytenr_node, root);
+   return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+   struct inmem_hash *hash)
+{
+   list_del(>lru_list);
+   rb_erase(>hash_node, _info->hash_root);
+   rb_erase(>bytenr_node, _info->bytenr_root);
+
+   if (!WARN_ON(dedupe_info->current_nr == 0))
+   dedupe_info->current_nr--;
+
+   kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+struct btrfs_dedupe_hash *hash)
+{
+   int ret = 0;
+   u16 algo = dedupe_info->hash_algo;
+   struct inmem_hash *ihash;
+
+   ihash = inmem_alloc_hash(algo);
+
+   if (!ihash)
+   return -ENOMEM;
+
+   /* Copy the data out */
+   ihash->bytenr = hash->bytenr;
+   ihash->num_bytes = hash->num_bytes;
+   memcpy(ihash->hash, hash->hash, btrfs_hash_sizes[algo]);
+
+   mutex_lock(_info->lock);
+
+   ret = inmem_insert_bytenr(_info->bytenr_root, ihash);
+   if (ret > 0) {
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   ret = inmem_insert_hash(_info->hash_root, ihash,
+   btrfs_hash_sizes[algo]);
+   if (ret > 0) {
+   /*
+* We only keep one hash in tree to save memory, so if
+* hash conflicts, free the one to insert.
+*/
+   rb_erase(>bytenr_node, _info->bytenr_root);
+   kfree(ihash);
+   ret = 0;
+   goto out;
+   }
+
+   list_add(>lru_list, _info->lru_list);
+   dedupe_info->current_nr++;
+
+   /* Remove the last dedupe hash if we exceed limit */
+   while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+   struct inmem_hash *last;
+
+   last = list_entry(dedupe_info->lru_list.prev,
+ struct inmem_hash, lru_list);
+   __inmem_del(dedupe_info, last);
+   }
+out:
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+int btrfs_dedupe_add(struct 

[PATCH v13 14/15] btrfs: dedupe: fix false ENOSPC

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

When testing in-band dedupe, sometimes we got ENOSPC error, though fs
still has much free space. After some debuging work, we found that it's
btrfs_delalloc_reserve_metadata() which sometimes tries to reserve
plenty of metadata space, even for very small data range.

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents. Please see below case for how ENOSPC occurs:

  1, Buffered write 128MB data in unit of 1MB, so finially we'll have
inode outstanding extents be 1, and reserved_extents be 128.
Note it's btrfs_merge_extent_hook() that merges these 1MB units into
one big outstanding extent, but do not change reserved_extents.

  2, When writing dirty pages, for in-band dedupe, cow_file_range() will
split above big extent in unit of 16KB(assume our in-band dedupe blocksize
is 16KB). When first split opeartion finishes, we'll have 2 outstanding
extents and 128 reserved extents, and just right the currently generated
ordered extent is dispatched to run and complete, then
btrfs_delalloc_release_metadata()(see btrfs_finish_ordered_io()) will be
called to release metadata, after that we will have 1 outstanding extents
and 1 reserved extents(also see logic in drop_outstanding_extent()). Later
cow_file_range() continues to handles left data range[16KB, 128MB), and if
no other ordered extent was dispatched to run, there will be 8191
outstanding extents and 1 reserved extent.

  3, Now if another bufferd write for this file enters, then
btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
for 8191 outstanding extents' metadata, for 64K node size, it'll be
8191*65536*16, about 8GB metadata, so obviously it'll return ENOSPC error.

But indeed when a file goes through in-band dedupe, its max extent size
will no longer be BTRFS_MAX_EXTENT_SIZE(128MB), it'll be limited by in-band
dedupe blocksize, so current metadata reservation method in btrfs is not
appropriate or correct, here we introduce btrfs_max_extent_size(), which
will return max extent size for corresponding files, which go through
in-band and we use this value to do metadata reservation and extent_io
merge, split, clear operations, we can make sure difference between
outstanding_extents and reserved_extents will not be so big.

Currently only buffered write will go through in-band dedupe if in-band
dedupe is enabled.

Reported-by: Satoru Takeuchi 
Cc: Josef Bacik 
Cc: Mark Fasheh 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h |  14 ++--
 fs/btrfs/dedupe.h|  35 ++
 fs/btrfs/extent-tree.c   |  62 +
 fs/btrfs/extent_io.c |  63 -
 fs/btrfs/extent_io.h |  15 +++-
 fs/btrfs/file.c  |  26 ---
 fs/btrfs/free-space-cache.c  |   5 +-
 fs/btrfs/inode-map.c |   4 +-
 fs/btrfs/inode.c | 147 +++
 fs/btrfs/ioctl.c |   6 +-
 fs/btrfs/ordered-data.h  |   1 +
 fs/btrfs/relocation.c|   6 +-
 fs/btrfs/tests/extent-io-tests.c |   6 +-
 13 files changed, 284 insertions(+), 106 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 891a583..fde4d25 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2681,10 +2681,14 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
 void btrfs_subvolume_release_metadata(struct btrfs_root *root,
  struct btrfs_block_rsv *rsv,
  u64 qgroup_reserved);
-int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
-void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes,
+   u32 max_extent_size);
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes,
+u32 max_extent_size);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
+u32 max_extent_size);
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
+ u32 max_extent_size);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
  unsigned short type);
@@ -3221,7 +3225,7 @@ int btrfs_release_file(struct inode *inode, struct file 
*file);
 int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
   

[PATCH v13 08/15] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index ef4968f..d0d2f8a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -639,3 +639,49 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
}
return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash)
+{
+   int i;
+   int ret;
+   struct page *p;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+   struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+   SHASH_DESC_ON_STACK(sdesc, tfm);
+   u64 dedupe_bs;
+   u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
+
+   if (!fs_info->dedupe_enabled || !hash)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+   dedupe_bs = dedupe_info->blocksize;
+
+   sdesc->tfm = tfm;
+   sdesc->flags = 0;
+   ret = crypto_shash_init(sdesc);
+   if (ret)
+   return ret;
+   for (i = 0; sectorsize * i < dedupe_bs; i++) {
+   char *d;
+
+   p = find_get_page(inode->i_mapping,
+ (start >> PAGE_SHIFT) + i);
+   if (WARN_ON(!p))
+   return -ENOENT;
+   d = kmap(p);
+   ret = crypto_shash_update(sdesc, d, sectorsize);
+   kunmap(p);
+   put_page(p);
+   if (ret)
+   return ret;
+   }
+   ret = crypto_shash_final(sdesc, hash->hash);
+   return ret;
+}
-- 
2.9.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 00/15] Btrfs In-band De-duplication

2016-09-08 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160907

This version is just another small update, rebased to David's
for-next-20160906 branch.

This updates only includes one small fix, which is exposed by recent
commits which checks space_info->bytes_may_use at umount time.
Caused by that we only free quota reserved space at hash hit, but doesn't
free space_info->bytes_may_use.

Other rebase changes are all related to recent infrastructure change,
like io_tree and quota flags change.

We ran xfstests with dedupe enabled.
While we encountered several bugs, but it's unrelated to dedupe, but
the base branch.

We'll keep digging to fix these non-dedupe bugs.

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.
v9:
  Re-order the patchset to completely separate pure in-memory and any
  on-disk format change.
  Fold bug fixes into its original patch.
v10:
  Adding back missing bug fix patch.
  Reduce on-disk item size.
  Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.
v11:
  Remove other backend and props support to focus on the framework and
  in-memory backend. Suggested by David.
  Better disable and buffered write race protection.
  Comprehensive fix to dedupe metadata ENOSPC problem.
v12:
  Stateful 'enable' ioctl and new 'reconf' ioctl
  New FORCE flag for enable ioctl to allow stateless ioctl
  Precise error report and extendable ioctl structure.
v12.1
  Rebase to David's for-next-20160704 branch
  Add co-ordinate patch for subpage and dedupe patchset. 
v12.2
  Rebase to David's for-next-20160715 branch
  Add co-ordinate patch for other patchset.
v13
  Rebase to David's for-next-20160906 branch
  Fix a reserved space leak bug, which only frees quota reserved space
  but not space_info->byte_may_use.

Qu Wenruo (5):
  btrfs: expand btrfs_set_extent_delalloc() and its friends to support
in-band dedupe and subpage size patchset
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: relocation: Enhance error handling to avoid BUG_ON
  btrfs: dedupe: Introduce new reconfigure ioctl

Wang Xiaoguang (10):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: dedupe: Add ioctl for inband dedupelication
  btrfs: improve inode's outstanding_extents computation
  btrfs: dedupe: fix false ENOSPC

 fs/btrfs/Makefile|   2 +-
 fs/btrfs/ctree.h |  25 +-
 fs/btrfs/dedupe.c| 820 +++
 fs/btrfs/dedupe.h| 201 +-
 fs/btrfs/delayed-ref.c   |  30 +-
 fs/btrfs/delayed-ref.h   |   8 +
 fs/btrfs/disk-io.c   |   4 +
 fs/btrfs/extent-tree.c   |  82 +++-
 fs/btrfs/extent_io.c |  65 +++-
 fs/btrfs/extent_io.h |  17 +-
 fs/btrfs/file.c  |  26 +-
 fs/btrfs/free-space-cache.c  |   5 +-
 fs/btrfs/inode-map.c |   4 +-
 fs/btrfs/inode.c | 463 +-
 fs/btrfs/ioctl.c |  93 -
 fs/btrfs/ordered-data.c  |  46 ++-
 fs/btrfs/ordered-data.h  |  14 +
 

[PATCH v13 15/15] btrfs: dedupe: Introduce new reconfigure ioctl

2016-09-08 Thread Qu Wenruo
Introduce new reconfigure ioctl, and new FORCE flag for in-band dedupe
ioctls.

Now dedupe enable and reconfigure ioctl are stateful.


| Current state |   Ioctl| Next state  |

| Disabled  |  enable| Enabled |
| Enabled   |  enable| Not allowed |
| Enabled   |  reconf| Enabled |
| Enabled   |  disable   | Disabled|
| Disabled  |  dsiable   | Disabled|
| Disabled  |  reconf| Not allowed |

(While disbale is always stateless)

While for guys prefer stateless ioctl (myself for example), new FORCE
flag is introduced.

In FORCE mode, enable/disable is completely stateless.

| Current state |   Ioctl| Next state  |

| Disabled  |  enable| Enabled |
| Enabled   |  enable| Enabled |
| Enabled   |  disable   | Disabled|
| Disabled  |  disable   | Disabled|


Also, re-configure ioctl will only modify specified fields.
Unlike enable, un-specified fields will be filled with default value.

For example:
 # btrfs dedupe enable --block-size 64k /mnt
 # btrfs dedupe reconfigure --limit-hash 1m /mnt
Will leads to:
 dedupe blocksize: 64K
 dedupe hash limit nr: 1m

While for enable:
 # btrfs dedupe enable --force --block-size 64k /mnt
 # btrfs dedupe enable --force --limit-hash 1m /mnt
Will reset blocksize to default value:
 dedupe blocksize: 128K << reset
 dedupe hash limit nr: 1m

Suggested-by: David Sterba 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dedupe.c  | 131 -
 fs/btrfs/dedupe.h  |  13 +
 fs/btrfs/ioctl.c   |  13 +
 include/uapi/linux/btrfs.h |  11 +++-
 4 files changed, 143 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 37b5a05..5fd4a9c 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,40 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 algo)
GFP_NOFS);
 }
 
+/*
+ * Copy from current dedupe info to fill dargs.
+ * For reconf case, only fill members which is uninitialized.
+ */
+static void get_dedupe_status(struct btrfs_dedupe_info *dedupe_info,
+ struct btrfs_ioctl_dedupe_args *dargs)
+{
+   int reconf = (dargs->cmd == BTRFS_DEDUPE_CTL_RECONF);
+
+   dargs->status = 1;
+
+   if (!reconf || (reconf && dargs->blocksize == (u64)-1))
+   dargs->blocksize = dedupe_info->blocksize;
+   if (!reconf || (reconf && dargs->backend == (u16)-1))
+   dargs->backend = dedupe_info->backend;
+   if (!reconf || (reconf && dargs->hash_algo ==(u16)-1))
+   dargs->hash_algo = dedupe_info->hash_algo;
+
+   /*
+* For re-configure case, if not modifying limit,
+* therir limit will be set to 0, unlike other fields
+*/
+   if (!reconf || !(dargs->limit_nr || dargs->limit_mem)) {
+   dargs->limit_nr = dedupe_info->limit_nr;
+   dargs->limit_mem = dedupe_info->limit_nr *
+   (sizeof(struct inmem_hash) +
+btrfs_hash_sizes[dedupe_info->hash_algo]);
+   }
+
+   /* current_nr doesn't makes sense for reconfig case */
+   if (!reconf)
+   dargs->current_nr = dedupe_info->current_nr;
+}
+
 void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
 struct btrfs_ioctl_dedupe_args *dargs)
 {
@@ -57,15 +91,7 @@ void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
return;
}
mutex_lock(_info->lock);
-   dargs->status = 1;
-   dargs->blocksize = dedupe_info->blocksize;
-   dargs->backend = dedupe_info->backend;
-   dargs->hash_algo = dedupe_info->hash_algo;
-   dargs->limit_nr = dedupe_info->limit_nr;
-   dargs->limit_mem = dedupe_info->limit_nr *
-   (sizeof(struct inmem_hash) +
-btrfs_hash_sizes[dedupe_info->hash_algo]);
-   dargs->current_nr = dedupe_info->current_nr;
+   get_dedupe_status(dedupe_info, dargs);
mutex_unlock(_info->lock);
memset(dargs->__unused, -1, sizeof(dargs->__unused));
 }
@@ -114,17 +140,50 @@ static int init_dedupe_info(struct btrfs_dedupe_info 
**ret_info,
 static int check_dedupe_parameter(struct btrfs_fs_info *fs_info,
  struct btrfs_ioctl_dedupe_args *dargs)
 {
-   u64 blocksize = dargs->blocksize;
-   u64 limit_nr = dargs->limit_nr;
-   u64 limit_mem = dargs->limit_mem;
-   u16 hash_algo = dargs->hash_algo;
-   u8 backend = dargs->backend;
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   u64 blocksize;
+   u64 limit_nr;
+   u64 

[PATCH v13 07/15] btrfs: dedupe: Introduce function to search for an existing hash

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/dedupe.c | 185 ++
 1 file changed, 185 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 14c57fa..ef4968f 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -20,6 +20,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
struct rb_node hash_node;
@@ -454,3 +455,187 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
kfree(dedupe_info);
return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+   struct rb_node **p = _info->hash_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+   u16 hash_algo = dedupe_info->hash_algo;
+   int hash_len = btrfs_hash_sizes[hash_algo];
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+   if (memcmp(hash, entry->hash, hash_len) < 0) {
+   p = &(*p)->rb_left;
+   } else if (memcmp(hash, entry->hash, hash_len) > 0) {
+   p = &(*p)->rb_right;
+   } else {
+   /* Found, need to re-add it to LRU list head */
+   list_del(>lru_list);
+   list_add(>lru_list, _info->lru_list);
+   return entry;
+   }
+   }
+   return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash)
+{
+   int ret;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *head;
+   struct btrfs_delayed_ref_head *insert_head;
+   struct btrfs_delayed_data_ref *insert_dref;
+   struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+   struct inmem_hash *found_hash;
+   int free_insert = 1;
+   u64 bytenr;
+   u32 num_bytes;
+
+   insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+   if (!insert_head)
+   return -ENOMEM;
+   insert_head->extent_op = NULL;
+   insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+   if (!insert_dref) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+   return -ENOMEM;
+   }
+   if (test_bit(BTRFS_FS_QUOTA_ENABLED, >fs_info->flags) &&
+   is_fstree(root->root_key.objectid)) {
+   insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+   if (!insert_qrecord) {
+   kmem_cache_free(btrfs_delayed_ref_head_cachep,
+   insert_head);
+   kmem_cache_free(btrfs_delayed_data_ref_cachep,
+   insert_dref);
+   return -ENOMEM;
+   }
+   }
+
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   goto free_mem;
+   }
+
+again:
+   mutex_lock(_info->lock);
+   found_hash = inmem_search_hash(dedupe_info, hash->hash);
+   /* If we don't find a duplicated extent, just return. */
+   if (!found_hash) {
+   ret = 0;
+   goto out;
+   }
+   bytenr = found_hash->bytenr;
+   num_bytes = found_hash->num_bytes;
+
+   delayed_refs = >transaction->delayed_refs;
+
+   spin_lock(_refs->lock);
+   head = btrfs_find_delayed_ref_head(trans, bytenr);
+   if (!head) {
+   /*
+* We can safely insert a new delayed_ref as long as we
+* hold delayed_refs->lock.
+* Only need to use atomic inc_extent_ref()
+*/
+   btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
+   insert_dref, insert_head, insert_qrecord,
+   bytenr, num_bytes, 0, root->root_key.objectid,
+   btrfs_ino(inode), file_pos, 0,
+   BTRFS_ADD_DELAYED_REF);
+   spin_unlock(_refs->lock);
+
+   /* 

[PATCH v13 02/15] btrfs: dedupe: Introduce dedupe framework and its header

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce the header for btrfs in-band(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h   |   7 +++
 fs/btrfs/dedupe.h  | 137 -
 fs/btrfs/disk-io.c |   1 +
 include/uapi/linux/btrfs.h |  34 +++
 4 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ec46519..e4a7489 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1080,6 +1080,13 @@ struct btrfs_fs_info {
 
/* Used to record internally whether fs has been frozen */
int fs_frozen;
+
+   /*
+* Inband de-duplication related structures
+*/
+   unsigned long dedupe_enabled:1;
+   struct btrfs_dedupe_info *dedupe_info;
+   struct mutex dedupe_ioctl_lock;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 83ebfe2..5ecc321 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -19,6 +19,139 @@
 #ifndef __BTRFS_DEDUPE__
 #define __BTRFS_DEDUPE__
 
-/* later in-band dedupe will expand this struct */
-struct btrfs_dedupe_hash;
+#include 
+#include 
+#include 
+
+static const int btrfs_hash_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedupe.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+   u64 bytenr;
+   u32 num_bytes;
+
+   /* last field is a variable length array of dedupe hash */
+   u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+   /* dedupe blocksize */
+   u64 blocksize;
+   u16 backend;
+   u16 hash_algo;
+
+   struct crypto_shash *dedupe_driver;
+
+   /*
+* Use mutex to portect both backends
+* Even for in-memory backends, the rb-tree can be quite large,
+* so mutex is better for such use case.
+*/
+   struct mutex lock;
+
+   /* following members are only used in in-memory backend */
+   struct rb_root hash_root;
+   struct rb_root bytenr_root;
+   struct list_head lru_list;
+   u64 limit_nr;
+   u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+   return (hash && hash->bytenr);
+}
+
+int btrfs_dedupe_hash_size(u16 algo);
+struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 algo);
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (from unsupported param to tree creation error for some backends)
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info,
+   struct btrfs_ioctl_dedupe_args *dargs);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Get current dedupe status.
+ * Return 0 for success
+ * No possible error yet
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+struct btrfs_ioctl_dedupe_args *dargs);
+
+/*
+ * Calculate hash for dedupe.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (error from hash codes)
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+  struct inode *inode, u64 start,
+  struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+   struct inode *inode, u64 file_pos,
+   struct btrfs_dedupe_hash *hash);
+
+/*
+ * Add a dedupe hash into dedupe info
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info,
+struct btrfs_dedupe_hash *hash);
+
+/*
+ * Remove a dedupe hash from dedupe info
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree 

[PATCH v13 05/15] btrfs: dedupe: Introduce function to remove hash from in-memory tree

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces.

Also for btrfs_dedupe_disable(), add new functions to wait existing
writer and block incoming writers to eliminate all possible race.

Cc: Mark Fasheh 
Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/dedupe.c | 132 +++---
 1 file changed, 126 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index e51412b..14c57fa 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -186,12 +186,6 @@ enable:
return ret;
 }
 
-int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
-{
-   /* Place holder for bisect, will be implemented in later patches */
-   return 0;
-}
-
 static int inmem_insert_hash(struct rb_root *root,
 struct inmem_hash *hash, int hash_len)
 {
@@ -334,3 +328,129 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
return inmem_add(dedupe_info, hash);
return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct rb_node **p = _info->bytenr_root.rb_node;
+   struct rb_node *parent = NULL;
+   struct inmem_hash *entry = NULL;
+
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+   if (bytenr < entry->bytenr)
+   p = &(*p)->rb_left;
+   else if (bytenr > entry->bytenr)
+   p = &(*p)->rb_right;
+   else
+   return entry;
+   }
+
+   return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+   struct inmem_hash *hash;
+
+   mutex_lock(_info->lock);
+   hash = inmem_search_bytenr(dedupe_info, bytenr);
+   if (!hash) {
+   mutex_unlock(_info->lock);
+   return 0;
+   }
+
+   __inmem_del(dedupe_info, hash);
+   mutex_unlock(_info->lock);
+   return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+   struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+   if (!fs_info->dedupe_enabled)
+   return 0;
+
+   if (WARN_ON(dedupe_info == NULL))
+   return -EINVAL;
+
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   return inmem_del(dedupe_info, bytenr);
+   return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+   struct inmem_hash *entry, *tmp;
+
+   mutex_lock(_info->lock);
+   list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list)
+   __inmem_del(dedupe_info, entry);
+   mutex_unlock(_info->lock);
+}
+
+/*
+ * Helper function to wait and block all incoming writers
+ *
+ * Use rw_sem introduced for freeze to wait/block writers.
+ * So during the block time, no new write will happen, so we can
+ * do something quite safe, espcially helpful for dedupe disable,
+ * as it affect buffered write.
+ */
+static void block_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+   down_write(>s_umount);
+}
+
+static void unblock_all_writers(struct btrfs_fs_info *fs_info)
+{
+   struct super_block *sb = fs_info->sb;
+
+   up_write(>s_umount);
+   percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_dedupe_info *dedupe_info;
+   int ret;
+
+   dedupe_info = fs_info->dedupe_info;
+
+   if (!dedupe_info)
+   return 0;
+
+   /* Don't allow disable status change in RO mount */
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
+   /*
+* Wait for all unfinished writers and block further writers.
+* Then sync the whole fs so all current write will go through
+* dedupe, and all later write won't go through dedupe.
+*/
+   block_all_writers(fs_info);
+   ret = sync_filesystem(fs_info->sb);
+   fs_info->dedupe_enabled = 0;
+   fs_info->dedupe_info = NULL;
+   unblock_all_writers(fs_info);
+   if (ret < 0)
+   return ret;
+
+   /* now we are OK to clean up everything */
+   if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+   inmem_destroy(dedupe_info);
+
+   crypto_free_shash(dedupe_info->dedupe_driver);
+   kfree(dedupe_info);
+  

[PATCH v13 13/15] btrfs: improve inode's outstanding_extents computation

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

This issue was revealed by modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
When modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often
gets these warnings from btrfs_destroy_inode():
WARN_ON(BTRFS_I(inode)->outstanding_extents);
WARN_ON(BTRFS_I(inode)->reserved_extents);

Simple test program below can reproduce this issue steadily.
Note: you need to modify BTRFS_MAX_EXTENT_SIZE to 64KB to have test,
otherwise there won't be such WARNING.
#include 
#include 
#include 
#include 
#include 

int main(void)
{
int fd;
char buf[68 *1024];

memset(buf, 0, 68 * 1024);
fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
pwrite(fd, buf, 68 * 1024, 64 * 1024);
return;
}

When BTRFS_MAX_EXTENT_SIZE is 64KB, and buffered data range is:
64KB128K132KB
|---|---|
 64 + 4KB

1) for above data range, btrfs_delalloc_reserve_metadata() will reserve
metadata and set BTRFS_I(inode)->outstanding_extents to 2.
(68KB + 64KB - 1) / 64KB == 2

Outstanding_extents: 2

2) then btrfs_dirty_page() will be called to dirty pages and set
EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook() will be called
twice.
The 1st set_bit_hook() call will set DEALLOC flag for the first 64K.
64KB128KB
|---|
64KB DELALLOC
Outstanding_extents: 2

Set_bit_hooks() uses FIRST_DELALLOC flag to avoid re-increase
outstanding_extents counter.
So for 1st set_bit_hooks() call, it won't modify outstanding_extents,
it's still 2.

Then FIRST_DELALLOC flag is *CLEARED*.

3) 2nd btrfs_set_bit_hook() call.
Because FIRST_DELALLOC have been cleared by previous set_bit_hook(),
btrfs_set_bit_hook() will increase BTRFS_I(inode)->outstanding_extents by
one, so now BTRFS_I(inode)->outstanding_extents is 3.
64KB128KB132KB
|---||
64K DELALLOC   4K DELALLOC
Outstanding_extents: 3

But the correct outstanding_extents number should be 2, not 3.
The 2nd btrfs_set_bit_hook() call just screwed up this, and leads to the
WARN_ON().

Normally, we can solve it by only increasing outstanding_extents in
set_bit_hook().
But the problem is for delalloc_reserve/release_metadata(), we only have
a 'length' parameter, and calculate in-accurate outstanding_extents.
If we only rely on set_bit_hook() release_metadata() will crew things up
as it will decrease inaccurate number.

So the fix we use is:
1) Increase *INACCURATE* outstanding_extents at delalloc_reserve_meta
   Just as a place holder.
2) Increase *accurate* outstanding_extents at set_bit_hooks()
   This is the real increaser.
3) Decrease *INACCURATE* outstanding_extents before returning
   This makes outstanding_extents to correct value.

For 128M BTRFS_MAX_EXTENT_SIZE, due to limitation of
__btrfs_buffered_write(), each iteration will only handle about 2MB
data.
So btrfs_dirty_pages() won't need to handle cases cross 2 extents.

Cc: Mark Fasheh 
Cc: Josef Bacik 
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 68 +++-
 fs/btrfs/ioctl.c |  6 ++---
 3 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e4a7489..891a583 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3129,6 +3129,8 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info 
*fs_info, int delay_iput,
   int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
  struct extent_state **cached_state, int dedupe);
+int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
+   struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 struct btrfs_root *new_root,
 struct btrfs_root *parent_root,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ed37204..ef6abb2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1750,11 +1750,15 @@ static void btrfs_split_extent_hook(void *private_data,
 {
struct inode *inode = private_data;
u64 size;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
 
/* not delalloc, ignore it */
if (!(orig->state & EXTENT_DELALLOC))
return;
 
+   if (root == root->fs_info->tree_root)
+   return;
+
size = orig->end - orig->start + 1;
if (size > 

[PATCH v13 01/15] btrfs: expand btrfs_set_extent_delalloc() and its friends to support in-band dedupe and subpage size patchset

2016-09-08 Thread Qu Wenruo
Extract btrfs_set_extent_delalloc() and extent_clear_unlock_delalloc()
parameters for both in-band dedupe and subpage sector size patchset.

This should reduce conflict of both patchset and the effort to rebase
them.

Cc: Chandan Rajendra 
Cc: David Sterba 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/extent_io.c |  2 +-
 fs/btrfs/extent_io.h |  2 +-
 fs/btrfs/file.c  |  2 +-
 fs/btrfs/inode.c | 40 ++--
 fs/btrfs/relocation.c|  2 +-
 fs/btrfs/tests/inode-tests.c | 12 ++--
 7 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e590152..ec46519 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3121,7 +3121,7 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, 
int delay_iput);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
   int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
- struct extent_state **cached_state);
+ struct extent_state **cached_state, int dedupe);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 struct btrfs_root *new_root,
 struct btrfs_root *parent_root,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 05bb391..0764e95 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1709,7 +1709,7 @@ out_failed:
 }
 
 void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
-struct page *locked_page,
+u64 delalloc_end, struct page *locked_page,
 unsigned clear_bits,
 unsigned long page_ops)
 {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index aa6341c..c6177a9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -431,7 +431,7 @@ int map_private_extent_buffer(struct extent_buffer *eb, 
unsigned long offset,
 void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end,
-struct page *locked_page,
+u64 delalloc_end, struct page *locked_page,
 unsigned bits_to_clear,
 unsigned long page_ops);
 struct bio *
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 6a9ada0..cbefdc8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -503,7 +503,7 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode 
*inode,
 
end_of_last_block = start_pos + num_bytes - 1;
err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
-   cached);
+   cached, 0);
if (err)
return err;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 353e80e..85c5b30 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -560,8 +560,9 @@ cont:
 * we don't need to create any more async work items.
 * Unlock and free up our temp pages.
 */
-   extent_clear_unlock_delalloc(inode, start, end, NULL,
-clear_flags, PAGE_UNLOCK |
+   extent_clear_unlock_delalloc(inode, start, end, end,
+NULL, clear_flags,
+PAGE_UNLOCK |
 PAGE_CLEAR_DIRTY |
 PAGE_SET_WRITEBACK |
 page_error_op |
@@ -837,6 +838,8 @@ retry:
extent_clear_unlock_delalloc(inode, async_extent->start,
async_extent->start +
async_extent->ram_size - 1,
+   async_extent->start +
+   async_extent->ram_size - 1,
NULL, EXTENT_LOCKED | EXTENT_DELALLOC,
PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
PAGE_SET_WRITEBACK);
@@ -856,7 +859,8 @@ retry:
tree->ops->writepage_end_io_hook(p, start, end,
 NULL, 0);
p->mapping = NULL;
-   extent_clear_unlock_delalloc(inode, start, end, NULL, 0,
+   extent_clear_unlock_delalloc(inode, start, end, end,
+

[PATCH v13 12/15] btrfs: relocation: Enhance error handling to avoid BUG_ON

2016-09-08 Thread Qu Wenruo
Since the introduce of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] [ cut here ]
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode:  [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  []
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  []
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [] ? vma_link+0xb9/0xc0
[ 2611.693303]  [] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [] SyS_ioctl+0x41/0x70
[ 2611.694758]  [] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  []
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP 

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/relocation.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 6e8086a..6ae287f 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -888,6 +888,13 @@ again:
root = read_fs_root(rc->extent_root->fs_info, key.offset);
if (IS_ERR(root)) {
err = PTR_ERR(root);
+   /*
+* Don't forget to cleanup current node.
+* As it may not be added to backref_cache but nr_node
+* increased.
+* This will cause BUG_ON() in backref_cache_cleanup().
+*/
+   remove_backref_node(>backref_cache, cur);
goto out;
}
 
@@ -2999,14 +3006,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
}
 
rb_node = rb_first(blocks);
-   while (rb_node) {
+   for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
block = rb_entry(rb_node, struct tree_block, rb_node);
 
node = build_backref_tree(rc, >key,
  block->level, block->bytenr);
if (IS_ERR(node)) {
+   /*
+* The root(dedupe tree yet) of the tree block is
+* going to be freed and can't be reached.
+* Just skip it and continue balancing.
+*/
+   if (PTR_ERR(node) == -ENOENT)
+   continue;
err = PTR_ERR(node);
-   goto out;
+   break;
}
 
ret = relocate_tree_block(trans, rc, node, >key,
@@ -3014,11 +3028,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle 
*trans,
if (ret < 0) {
if (ret != -EAGAIN || rb_node == rb_first(blocks))
err = ret;
-   goto out;
+   break;
}
-   rb_node = rb_next(rb_node);
}
-out:
err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.9.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 06/15] btrfs: delayed-ref: Add support for increasing data ref under spinlock

2016-09-08 Thread Qu Wenruo
For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked.

Signed-off-by: Qu Wenruo 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 30 +++---
 fs/btrfs/delayed-ref.h |  8 
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index a5d81f3..93a604f 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -804,6 +804,26 @@ free_ref:
 }
 
 /*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action)
+{
+   head_ref = add_delayed_ref_head(fs_info, trans, _ref->node,
+   qrecord, bytenr, num_bytes, ref_root, reserved,
+   action, 1);
+   add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
+   num_bytes, parent, ref_root, owner, offset, action);
+}
+
+/*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
@@ -849,13 +869,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
 * insert both the head node and the new ref without dropping
 * the spin lock
 */
-   head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record,
-   bytenr, num_bytes, ref_root, reserved,
-   action, 1);
-
-   add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr,
-  num_bytes, parent, ref_root, owner, offset,
-  action);
+   btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record,
+   bytenr, num_bytes, parent, ref_root, owner, offset,
+   reserved, action);
spin_unlock(_refs->lock);
 
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 43f3629..d3a4369 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct 
btrfs_delayed_ref_node *ref)
}
 }
 
+struct btrfs_qgroup_extent_record;
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes, u64 parent,
   u64 ref_root, int level, int action,
   struct btrfs_delayed_extent_op *extent_op);
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+   struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_data_ref *dref,
+   struct btrfs_delayed_ref_head *head_ref,
+   struct btrfs_qgroup_extent_record *qrecord,
+   u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+   u64 owner, u64 offset, u64 reserved, int action);
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
   struct btrfs_trans_handle *trans,
   u64 bytenr, u64 num_bytes,
-- 
2.9.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 09/15] btrfs: ordered-extent: Add support for dedupe

2016-09-08 Thread Qu Wenruo
From: Wang Xiaoguang 

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo 
Signed-off-by: Wang Xiaoguang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/ordered-data.c | 46 ++
 fs/btrfs/ordered-data.h | 13 +
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 3b78d38..71d05ca 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -26,6 +26,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct 
btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
  u64 start, u64 len, u64 disk_len,
- int type, int dio, int compress_type)
+ int type, int dio, int compress_type,
+ struct btrfs_dedupe_hash *hash)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_inode_tree *tree;
@@ -204,6 +206,33 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
entry->inode = igrab(inode);
entry->compress_type = compress_type;
entry->truncated_len = (u64)-1;
+   entry->hash = NULL;
+   /*
+* A hash hit means we have already incremented the extents delayed
+* ref.
+* We must handle this even if another process is trying to
+* turn off dedupe, otherwise we will leak a reference.
+*/
+   if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+   struct btrfs_dedupe_info *dedupe_info;
+
+   dedupe_info = root->fs_info->dedupe_info;
+   if (WARN_ON(dedupe_info == NULL)) {
+   kmem_cache_free(btrfs_ordered_extent_cache,
+   entry);
+   return -EINVAL;
+   }
+   entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_algo);
+   if (!entry->hash) {
+   kmem_cache_free(btrfs_ordered_extent_cache, entry);
+   return -ENOMEM;
+   }
+   entry->hash->bytenr = hash->bytenr;
+   entry->hash->num_bytes = hash->num_bytes;
+   memcpy(entry->hash->hash, hash->hash,
+  btrfs_hash_sizes[dedupe_info->hash_algo]);
+   }
+
if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
set_bit(type, >flags);
 
@@ -250,15 +279,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 
file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+  u64 start, u64 len, u64 disk_len, int type,
+  struct btrfs_dedupe_hash *hash)
+{
+   return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+ disk_len, type, 0,
+ BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 u64 start, u64 len, u64 disk_len, int type)
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 1,
- BTRFS_COMPRESS_NONE);
+ BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -267,7 +304,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, 
u64 file_offset,
 {
return __btrfs_add_ordered_extent(inode, file_offset, start, len,
  disk_len, type, 0,
- compress_type);
+ compress_type, NULL);
 }
 
 /*
@@ -577,6 +614,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent 
*entry)
list_del(>list);
kfree(sum);
}
+   kfree(entry->hash);
kmem_cache_free(btrfs_ordered_extent_cache, entry);
}
 }
diff --git a/fs/btrfs/ordered-data.h 

Re: [PATCH v2] fstests: common: Enhance _exclude_scratch_mount_option to handle multiple options

2016-09-08 Thread Eryu Guan
On Thu, Sep 08, 2016 at 10:52:21AM +0800, Qu Wenruo wrote:
> Enhance _exclude_scratch_mount_option() function to normalize mount
> options.
> Now it can understand and extract real mount option from string like
> "-o opt1,opt2 -oopt3".
> 
> And now we do word grep to handle mount options like noinode_cache and
> inode_cache.
> 
> Finally, allow it to accept multiple options at the same time.
> No need for multiple _exclude_scratch_mount_option lines now
> 
> Signed-off-by: Qu Wenruo 
> ---
> changelog:
> v2:
>Don't introduce new 'fstype' parameter, suggested by Dave and Eryu.
>Use easier grep -w method, suggested by Dave and Eryu.
> ---
>  common/rc  | 22 ++
>  tests/ext4/271 |  6 ++
>  tests/xfs/134  |  3 +--
>  3 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/common/rc b/common/rc
> index 04039a4..23c007a 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -3183,12 +3183,26 @@ _require_cloner()
>   _notrun "cloner binary not present at $CLONER_PROG"
>  }
>  
> -# skip test if MKFS_OPTIONS contains the given string
> +# Normalize mount options from global $MOUNT_OPTIONS
> +# Convert options like "-o opt1,opt2 -oopt3" to
> +# "opt1 opt2 opt3"
> +_normalize_mount_options()
> +{
> + echo $MOUNT_OPTIONS | sed -n 's/-o\s*\(\S*\)/\1/gp' |\
> + sed 's/,/ /g'
> +}
> +
> +# skip test if MOUNT_OPTIONS contains the given string

Make "string" plural? Because it accepts multiple arguments now :)

>  _exclude_scratch_mount_option()
>  {
> - if echo $MOUNT_OPTIONS | grep -q "$1"; then
> - _notrun "mount option \"$1\" not allowed in this test"
> - fi
> + mnt_opts=$(_normalize_mount_options)
> +
> + while [ $# -gt 1 ]; do

"-gt" should be "-ge" or "-gt 0", otherwise the last mount option in
arguments is not checked (no check is done if there's only one option).

I can fix them at commit time if there's no other review comments.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html