Re: Extents for a particular subvolume

2016-08-04 Thread Austin S. Hemmelgarn

On 2016-08-03 17:55, Graham Cobb wrote:

On 03/08/16 21:37, Adam Borowski wrote:

On Wed, Aug 03, 2016 at 08:56:01PM +0100, Graham Cobb wrote:

Are there any btrfs commands (or APIs) to allow a script to create a
list of all the extents referred to within a particular (mounted)
subvolume?  And is it a reasonably efficient process (i.e. doesn't
involve backrefs and, preferably, doesn't involve following directory
trees)?


Since the size of your output is linear to the number of extents which is
between the number of files and sum of their sizes, I see no gain in
trying to avoid following the directory tree.


Thanks for the help, Adam.  There are a lot of files and a lot of
directories - find, "ls -R" and similar operations take a very long
time. I was hoping that I could query some sort of extent tree for the
subvolume and get the answer back in seconds instead of multiple minutes.

But I can follow the directory tree if I need to.


I am not looking to relate the extents to files/inodes/paths.  My
particular need, at the moment, is to work out how much of two snapshots
is shared data, but I can think of other uses for the information.


Thus, unlike the question you asked above, you're not interested in _all_
extents, merely those which changed.

You may want to look at "btrfs subv find-new" and "btrfs send --no-data".


Unfortunately, the subvolumes do not have an ancestor-descendent
relationship (although they do have some common ancestors), so I don't
think find-new is much help (as far as I can see).

But just looking at the size of the output  from "send -c" would work
well enough for the particular problem I am trying to solve tonight!
Although I will need to take read-only snapshots of the subvolumes to
allow send to work. Thanks for the suggestion.

FWIW, if you're not using any files in the subvolumes, you can run:
btrfs property set  ro true

to mark them read-only so you don't need the snapshots, and then run the 
same command with 'false' at the end instead of true to mark them 
writable again.


I would still be interested in the extent list, though.  The main
problem with find-new and send is that they don't tell me how much has
been deleted, only added.  I am thinking about using the extents to get
a much better handle on what is using up space and what I could recover
if I removed (or moved to another volume) various groups of related
subvolumes.
You may want to look into 'btrfs filesystem usage' and 'btrfs filesystem 
du' commands.  I'm not sure if they'll cover what you need, but they can 
show info about how much is shared.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: memory overflow or undeflow in free space tree / space_info?

2016-08-04 Thread Stefan Priebe - Profihost AG
Am 29.07.2016 um 23:03 schrieb Josef Bacik:
> On 07/29/2016 03:14 PM, Omar Sandoval wrote:
>> On Fri, Jul 29, 2016 at 12:11:53PM -0700, Omar Sandoval wrote:
>>> On Fri, Jul 29, 2016 at 08:40:26PM +0200, Stefan Priebe - Profihost
>>> AG wrote:
 Dear list,

 i'm seeing btrfs no space messages frequently on big filesystems (>
 30TB).

 In all cases i'm getting a trace like this one a space_info warning.
 (since commit [1]). Could someone please be so kind and help me
 debugging / fixing this bug? I'm using space_cache=v2 on all those
 systems.
>>>
>>> Hm, so I think this indicates a bug in space accounting somewhere else
>>> rather than the free space tree itself. I haven't debugged one of these
>>> issues before, I'll see if I can reproduce it. Cc'ing Josef, too.
>>
>> I should've asked, what sort of filesystem activity triggers this?
>>
> 
> Chris just fixed this I think, try his next branch from his git tree
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git

Thanks now running a 4.4 with those patches backported. If that still
shows an error i will try that vanilla tree.

Thanks!

Stefan

> and see if it still happens.  Thanks,
> 
> Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] exportfs: be careful to only return expected errors.

2016-08-04 Thread Christoph Hellwig
On Thu, Aug 04, 2016 at 10:19:06AM +1000, NeilBrown wrote:
> 
> 
> When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
> In particular it can be tempting to return ENOENT, but this is not
> handled well by nfsd.
> 
> Rather than requiring strict adherence to error code code filesystems,
> treat all unexpected error codes the same as ESTALE.  This is safest.
> 
> Signed-off-by: NeilBrown 
> ---
> 
> I didn't add a dprintk for unexpected error messages, partly
> because dprintk isn't usable in exportfs.  I could have used pr_debug()
> but I really didn't see much value.
> 
> This has been tested together with the btrfs change, and it restores
> correct functionality.

I don't really like all this magic which is partially historic.  I think
we should instead allow the fs to return any error from the export
operations, and forbid returning NULL entirely.  Then the actualy caller
(nfsd) can sort out which errors it wants to send over the wire.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.8] btrfs heats my room with lock contention

2016-08-04 Thread Chris Mason



On 08/04/2016 02:41 AM, Dave Chinner wrote:


Simple test. 8GB pmem device on a 16p machine:

# mkfs.btrfs /dev/pmem1
# mount /dev/pmem1 /mnt/scratch
# dbench -t 60 -D /mnt/scratch 16

And heat your room with the warm air rising from your CPUs. Top
half of the btrfs profile looks like:

  36.71%  [kernel]  [k] _raw_spin_unlock_irqrestore 

   ¿
  32.29%  [kernel]  [k] native_queued_spin_lock_slowpath

   ¿
   5.14%  [kernel]  [k] queued_write_lock_slowpath  

   ¿
   2.46%  [kernel]  [k] _raw_spin_unlock_irq

   ¿
   2.15%  [kernel]  [k] queued_read_lock_slowpath   

   ¿
   1.54%  [kernel]  [k] _find_next_bit.part.0   

   ¿
   1.06%  [kernel]  [k] __crc32c_le 

   ¿
   0.82%  [kernel]  [k] btrfs_tree_lock 

   ¿
   0.79%  [kernel]  [k] steal_from_bitmap.part.29   

   ¿
   0.70%  [kernel]  [k] __copy_user_nocache 

   ¿
   0.69%  [kernel]  [k] btrfs_tree_read_lock

   ¿
   0.69%  [kernel]  [k] delay_tsc   

   ¿
   0.64%  [kernel]  [k] btrfs_set_lock_blocking_rw  

   ¿
   0.63%  [kernel]  [k] copy_user_generic_string

   ¿
   0.51%  [kernel]  [k] do_raw_read_unlock  

   ¿
   0.48%  [kernel]  [k] do_raw_spin_lock

   ¿
   0.47%  [kernel]  [k] do_raw_read_lock

   ¿
   0.46%  [kernel]  [k] btrfs_clear_lock_blocking_rw

   ¿
   0.44%  [kernel]  [k] do_raw_write_lock   

   ¿
   0.41%  [kernel]  [k] __do_softirq

   ¿
   0.28%  [kernel]  [k] __memcpy

   ¿
   0.24%  [kernel]  [k] map_private_extent_buffer   

   ¿
   0.23%  [kernel]  [k] find_next_zero_bit  

   ¿
   0.22%  [kernel]  [k] btrfs_tree_read_unlock  

   ¿

Performance vs CPu usage is:

nprocs  throughput  cpu usage
1   440MB/s  50%
2   770MB/s 100%
4   880MB/s 250%
8   690MB/s 450%
16  280MB/s 950%

In comparision, at 8-16 threads ext4 is running at ~2600MB/s and
XFS is running at ~3800MB/s. Even if I throw 300-400 processes at
ext4 and XFS, they only drop to ~1500-2000MB/s as they hit internal
limits.

Yes, with dbench btrfs does much much better if you make a subvol per 
dbench dir.  The d

Re: [PATCH 37/45] drivers: use req op accessor

2016-08-04 Thread Christoph Hellwig
On Wed, Aug 03, 2016 at 07:30:29PM -0500, Shaun Tancheff wrote:
> I think the translation in loop.c is suspicious here:
> 
> "if use DIO && not (a flush_flag or discard_flag)"
> should translate to:
> "if use DIO && not ((a flush_flag) || op == discard)"
> 
> But in the patch I read:
> "if use DIO && ((not a flush_flag) || op == discard)
> 
> Which would have DIO && discards follow the AIO path?

Indeed.  Sorry for missing out on your patch, I just sent a fix
in reply to Dave's other report earlier which is pretty similar to
yours.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/45] drivers: use req op accessor

2016-08-04 Thread Shaun Tancheff
On Thu, Aug 4, 2016 at 10:46 AM, Christoph Hellwig  wrote:
> On Wed, Aug 03, 2016 at 07:30:29PM -0500, Shaun Tancheff wrote:
>> I think the translation in loop.c is suspicious here:
>>
>> "if use DIO && not (a flush_flag or discard_flag)"
>> should translate to:
>> "if use DIO && not ((a flush_flag) || op == discard)"
>>
>> But in the patch I read:
>> "if use DIO && ((not a flush_flag) || op == discard)
>>
>> Which would have DIO && discards follow the AIO path?
>
> Indeed.  Sorry for missing out on your patch, I just sent a fix
> in reply to Dave's other report earlier which is pretty similar to
> yours.

No worries. I prefer your switch to a an if conditional here.

-- 
Shaun Tancheff
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem

2016-08-04 Thread Lutz Vieweg

Hi,

I was today hit by what I think is probably the same bug:
A btrfs on a close-to-4TB sized block device, only half filled
to almost exactly 2 TB, suddenly says "no space left on device"
upon any attempt to write to it. The filesystem was NOT automatically
switched to read-only by the kernel, I should mention.

Re-mounting (which is a pain as this filesystem is used for
$HOMEs of a multitude of active users who I have to kick from
the server for doing things like re-mounting) removed the symptom
for now, but from what I can read in linux-btrfs mailing list
archives, it pretty likely the symptom will re-appear.

Here are some more details:

Software versions:

linux-4.6.1 (vanilla from kernel.org)
btrfs-progs v4.1


Info obtained while the symptom occured (before re-mount):

> btrfs filesystem show /data3
Label: 'data3'  uuid: f4c69d29-62ac-4e15-a825-c6283c8fd74c
Total devices 1 FS bytes used 2.05TiB
devid1 size 3.64TiB used 2.16TiB path 
/dev/mapper/cryptedResourceData3


(/dev/mapper/cryptedResourceData3 is a dm-crypt device,
which is based on a DRBD block device, which is based
on locally attached SATA disks on two servers - no trouble
with that setup for years, no I/O-errors or such, same
kind of block-device stack also used for another btrfs
and some XFS filesystems.)


> btrfs filesystem df /data3
Data, single: total=2.11TiB, used=2.01TiB
System, single: total=4.00MiB, used=256.00KiB
Metadata, single: total=48.01GiB, used=36.67GiB
GlobalReserve, single: total=512.00MiB, used=5.52MiB


Currently and at the time the bug occured no snapshots existed
on "/data3". A snapshot is created once per night, a backup
created, then the snapshot is removed again.
There is lots of mixed I/O-activity during the day, both from interactive
users and from automatic build processes and such.

dmesg output from the time the "no space left on device"-symptom
appeared:


[5171203.601620] WARNING: CPU: 4 PID: 23208 at fs/btrfs/inode.c:9261 
btrfs_destroy_inode+0x263/0x2a0 [btrfs]
[5171203.602719] Modules linked in: dm_snapshot dm_bufio fuse btrfs xor 
raid6_pq nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT 
nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter drbd lru_cache bridge stp llc kvm_amd kvm irqbypass 
ghash_clmulni_intel amd64_edac_mod ses edac_mce_amd enclosure edac_core 
sp5100_tco pcspkr k10temp fam15h_power sg i2c_piix4 shpchp acpi_cpufreq nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c dm_crypt mgag200 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe 
crct10dif_pclmul crc32_pclmul crc32c_intel igb ahci libahci aesni_intel 
glue_helper libata lrw gf128mul ablk_helper mdio cryptd ptp serio_raw 
i2c_algo_bit pps_core i2c_core dca sd_mod dm_mirror dm_region_hash dm_log dm_mod

...

[5171203.617358] Call Trace:
[5171203.618543]  [] dump_stack+0x4d/0x6c
[5171203.619568]  [] __warn+0xe3/0x100
[5171203.620660]  [] warn_slowpath_null+0x1d/0x20
[5171203.621779]  [] btrfs_destroy_inode+0x263/0x2a0 [btrfs]
[5171203.622716]  [] destroy_inode+0x3b/0x60
[5171203.623774]  [] evict+0x11c/0x180

...

[5171230.306037] WARNING: CPU: 18 PID: 12656 at fs/btrfs/extent-tree.c:4233 
btrfs_free_reserved_data_space_noquota+0xf3/0x100 [btrfs]
[5171230.310298] Modules linked in: dm_snapshot dm_bufio fuse btrfs xor 
raid6_pq nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT 
nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter drbd lru_cache bridge stp llc kvm_amd kvm irqbypass 
ghash_clmulni_intel amd64_edac_mod ses edac_mce_amd enclosure edac_core 
sp5100_tco pcspkr k10temp fam15h_power sg i2c_piix4 shpchp acpi_cpufreq nfsd 
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c dm_crypt mgag200 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe 
crct10dif_pclmul crc32_pclmul crc32c_intel igb ahci libahci aesni_intel 
glue_helper libata lrw gf128mul ablk_helper mdio cryptd ptp serio_raw 
i2c_algo_bit pps_core i2c_core dca sd_mod dm_mirror dm_region_hash dm_log dm_mod

...

[5171230.341755] Call Trace:
[5171230.344119]  [] dump_stack+0x4d/0x6c
[5171230.346444]  [] __warn+0xe3/0x100
[5171230.348709]  [] warn_slowpath_null+0x1d/0x20
[5171230.350976]  [] 
btrfs_free_reserved_data_space_noquota+0xf3/0x100 [btrfs]
[5171230.353212]  [] btrfs_clear_bit_hook+0x27f/0x350 [btrfs]
[5171230.355392]  [] ? free_extent_state+0x1a/0x20 [btrfs]
[5171230.357556]  [] clear_state_bit+0x66/0x1d0 [btrfs]
[5171230.359698]  [] __clear_extent_bit+0x224/0x3a0 [btrfs]
[5171230.361810]  [] ? btrfs_update_reserved_bytes+0x45/0x130 
[btrfs]
[5171230.363960]  [] extent_clear_unlock_delalloc+0x7a/0x2d0 
[btrfs]
[5171230.366079]  [] ? kmem_cache_alloc+0x17d/0x1f0
[5171230.368204]  [] ? __btrfs_add_ordered_extent+0x43/0x310 
[btrfs]
[5171230.370350]  [] ? __btrfs_add_ordered_extent+0x1fb/0x310 
[btrfs]
[5171230.372491]  [] cow_file_range+0x28a/0x460 [btrfs]
[517

How to stress test raid6 on 122 disk array

2016-08-04 Thread Martin
Hi,

I would like to find rare raid6 bugs in btrfs, where I have the following hw:

* 2x 8 core CPU
* 128GB ram
* 70 FC disk array (56x 500GB + 14x 1TB SATA disks)
* 24 FC or 2x SAS disk array (1TB SAS disks)
* 16 FC disk array (1TB SATA disks)
* 12 SAS disk array (3TB SATA disks)

The test can run for a month or so.

I prefer CentOS/Fedora, but if someone will write a script that
configures and compiles a preferred kernel, then we can do that on any
preferred OS.

Can anyone give recommendations on how the setup should be configured
to most likely find rare raid6 bugs?

And does there exist a script that is good for testing this sort of thing?

Best regards,
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-04 Thread Austin S. Hemmelgarn

On 2016-08-04 13:43, Martin wrote:

Hi,

I would like to find rare raid6 bugs in btrfs, where I have the following hw:

* 2x 8 core CPU
* 128GB ram
* 70 FC disk array (56x 500GB + 14x 1TB SATA disks)
* 24 FC or 2x SAS disk array (1TB SAS disks)
* 16 FC disk array (1TB SATA disks)
* 12 SAS disk array (3TB SATA disks)

The test can run for a month or so.

I prefer CentOS/Fedora, but if someone will write a script that
configures and compiles a preferred kernel, then we can do that on any
preferred OS.

Can anyone give recommendations on how the setup should be configured
to most likely find rare raid6 bugs?

And does there exist a script that is good for testing this sort of thing?
I'm glad to hear there people interested in testing BTRFS for the 
purpose of finding bugs.  Sadly I can't provide much help in this 
respect (I do testing, but it's all regression testing these days).


Regarding OS, I'd avoid CentOS for testing something like BTRFS unless 
you specifically want to help their development team fix issues.  They 
have a large number of back-ported patches, and it's not all that 
practical for us to chase down bugs in such a situation, because it 
could just as easily be a bug introduced by the back-porting process or 
may be fixed in the mainline kernel anyway.  Fedora should be fine 
(they're good about staying up to date), but if possible you should 
probably use Rawhide instead of a regular release, as that will give you 
quite possibly one of the closest distribution kernels to a mainline 
Linux kernel available, and will make sure everything is as up to date 
as possible.


As far as testing, I don't know that there are any scripts for this type 
of thing, you may want to look into dbench, fio, iozone, and similar 
tools though, as well as xfstests (which is more about regression 
testing, but is still worth looking at).


Most of the big known issues with RAID6 in BTRFS at the moment involve 
device failures and array recovery, but most of them aren't well 
characterized and nobody's really sure why they're happening, so if you 
want to look for something specific, figuring out those issues would be 
a great place to start (even if they aren't rare bugs).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2016-08-04 Thread Chris Mason
Hi Linus,

This is part two of my btrfs pull, which is some cleanups and a batch of
fixes.  

Most of the code here is from Jeff Mahoney, making the pointers
we pass around internally more consistent and less confusing overall.  I
noticed a small problem right before I sent this out yesterday, so I
fixed it up and re-tested overnight.

Please pull my for-linus-4.8 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.8

There are some minor conflicts against Mike Christie's changes in
your tree.  I've put the conflict resolution I used for testing here:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.8-merged

Jeff Mahoney (14) commits (+754/-669):
btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root 
(+19/-21)
btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction 
(+1/-1)
btrfs: btrfs_test_opt and friends should take a btrfs_fs_info (+135/-130)
btrfs: cleanup, remove prototype for btrfs_find_root_ref (+0/-3)
btrfs: btrfs_abort_transaction, drop root parameter (+147/-152)
btrfs: convert nodesize macros to static inlines (+33/-15)
btrfs: tests, move initialization into tests/ (+48/-77)
btrfs: add btrfs_trans_handle->fs_info pointer (+6/-4)
btrfs: copy_to_sk drop unused root parameter (+2/-3)
btrfs: simpilify btrfs_subvol_inherit_props (+3/-3)
btrfs: prefix fsid to all trace events (+186/-158)
btrfs: tests, require fs_info for root (+103/-61)
btrfs: plumb fs_info into btrfs_work (+63/-31)
btrfs: introduce BTRFS_MAX_ITEM_SIZE (+8/-10)

Liu Bo (10) commits (+149/-49):
Btrfs: change BUG_ON()'s to ASSERT()'s in backref_cache_cleanup() (+6/-6)
Btrfs: error out if generic_bin_search get invalid arguments (+8/-0)
Btrfs: check inconsistence between chunk and block group (+16/-1)
Btrfs: fix unexpected balance crash due to BUG_ON (+24/-4)
Btrfs: fix eb memory leak due to readpage failure (+22/-3)
Btrfs: fix BUG_ON in btrfs_submit_compressed_write (+8/-2)
Btrfs: fix read_node_slot to return errors (+52/-21)
Btrfs: fix panic in balance due to EIO (+4/-0)
Btrfs: cleanup BUG_ON in merge_bio (+6/-3)
Btrfs: fix double free of fs root (+3/-9)

Nikolay Borisov (4) commits (+49/-20):
btrfs: Ratelimit "no csum found" info message (+1/-1)
btrfs: Handle uninitialised inode eviction (+8/-1)
btrfs: Add ratelimit to btrfs printing (+24/-2)
btrfs: Fix slab accounting flags (+16/-16)

Wang Xiaoguang (3) commits (+45/-13):
btrfs: expand cow_file_range() to support in-band dedup and 
subpage-blocksize (+41/-11)
btrfs: add missing bytes_readonly attribute file in sysfs (+2/-0)
btrfs: fix free space calculation in dump_space_info() (+2/-2)

Anand Jain (2) commits (+40/-36):
btrfs: make sure device is synced before return (+5/-0)
btrfs: reorg btrfs_close_one_device() (+35/-36)

David Sterba (2) commits (+4/-3):
btrfs: remove obsolete part of comment in statfs (+0/-3)
btrfs: hide test-only member under ifdef (+4/-0)

Ashish Samant (1) commits (+35/-37):
btrfs: Cleanup compress_file_range()

Chris Mason (1) commits (+3/-2):
Btrfs: fix __MAX_CSUM_ITEMS

Chandan Rajendra (1) commits (+1/-1):
Btrfs: subpage-blocksize: Rate limit scrub error message

Salah Triki (1) commits (+1/-2):
btrfs: Replace -ENOENT by -ERANGE in btrfs_get_acl()

Hans van Kranenburg (1) commits (+1/-1):
Btrfs: use the correct struct for BTRFS_IOC_LOGICAL_INO

Total: (40) commits (+1082/-833)

 fs/btrfs/acl.c |   3 +-
 fs/btrfs/async-thread.c|  31 +++-
 fs/btrfs/async-thread.h|   6 +-
 fs/btrfs/backref.c |   4 +-
 fs/btrfs/compression.c |  10 +-
 fs/btrfs/ctree.c   |  91 ++
 fs/btrfs/ctree.h   | 101 ++-
 fs/btrfs/dedupe.h  |  24 +++
 fs/btrfs/delayed-inode.c   |   4 +-
 fs/btrfs/delayed-ref.c |  17 +-
 fs/btrfs/dev-replace.c |   4 +-
 fs/btrfs/disk-io.c | 101 +--
 fs/btrfs/disk-io.h |   3 +-
 fs/btrfs/extent-tree.c | 124 --
 fs/btrfs/extent_io.c   |  30 +++-
 fs/btrfs/extent_map.c  |   2 +-
 fs/btrfs/file-item.c   |   4 +-
 fs/btrfs/file.c|  12 +-
 fs/btrfs/free-space-cache.c|   8 +-
 fs/btrfs/free-space-tree.c |  16 +-
 fs/btrfs/inode-map.c   |  16 +-
 fs/btrfs/inode.c   | 218 
 fs/btrfs/ioctl.c   |  40 ++---
 fs/btrfs/ordered-data.c|   2 +-
 fs/btrfs/props.c   |   6 +-
 fs/btrfs/qgroup.c  |  25 +--
 fs/btrfs/qgroup.h  |   9 +-
 fs/btrfs/relocation.c  |  20 ++-
 fs/btrfs

Re: How to stress test raid6 on 122 disk array

2016-08-04 Thread Chris Murphy
On Thu, Aug 4, 2016 at 1:05 PM, Austin S. Hemmelgarn
 wrote:

>Fedora should be fine (they're good about staying up to
> date), but if possible you should probably use Rawhide instead of a regular
> release, as that will give you quite possibly one of the closest
> distribution kernels to a mainline Linux kernel available, and will make
> sure everything is as up to date as possible.

Yes. It's possible to run on a release version (currently Fedora 23
and Fedora 24) and run a Rawhide kernel. This is what I often do.


> As far as testing, I don't know that there are any scripts for this type of
> thing, you may want to look into dbench, fio, iozone, and similar tools
> though, as well as xfstests (which is more about regression testing, but is
> still worth looking at).
>
> Most of the big known issues with RAID6 in BTRFS at the moment involve
> device failures and array recovery, but most of them aren't well
> characterized and nobody's really sure why they're happening, so if you want
> to look for something specific, figuring out those issues would be a great
> place to start (even if they aren't rare bugs).

Yeah it seems pretty reliable to do normal things with raid56 arrays.
The problem is when they're degraded, weird stuff seems to happen some
of the time. So it might be valid to have several raid56's that are
intentionally running in degraded mode with some tests that will
tolerate that and see when it breaks and why.

There is also in the archives the bug where parity is being computed
wrongly when a data strip is wrong (corrupt), and Btrfs sees this,
reports the mismatch, fixes the mismatch, recomputes parity for some
reason, and the parity is then wrong. It'd be nice to know when else
this can happen, if it's possible parity is recomputed (and wrongly)
on a normal read, or a balance, or if it's really restricted to scrub.

Another test might be raid 1 or raid10 metadata vs raid56 for data.
That'd probably be more performance related, but there might be some
unexpected behaviors that crop up.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] exportfs: be careful to only return expected errors.

2016-08-04 Thread J. Bruce Fields
On Thu, Aug 04, 2016 at 05:47:19AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 04, 2016 at 10:19:06AM +1000, NeilBrown wrote:
> > 
> > 
> > When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
> > In particular it can be tempting to return ENOENT, but this is not
> > handled well by nfsd.
> > 
> > Rather than requiring strict adherence to error code code filesystems,
> > treat all unexpected error codes the same as ESTALE.  This is safest.
> > 
> > Signed-off-by: NeilBrown 
> > ---
> > 
> > I didn't add a dprintk for unexpected error messages, partly
> > because dprintk isn't usable in exportfs.  I could have used pr_debug()
> > but I really didn't see much value.
> > 
> > This has been tested together with the btrfs change, and it restores
> > correct functionality.
> 
> I don't really like all this magic which is partially historic.  I think
> we should instead allow the fs to return any error from the export
> operations,

What errors other than ENOENT and ENOMEM do you think are reasonable?

ENOENT is going to screw up both nfsd and open_by_fhandle_at, which are
the only callers.

> and forbid returning NULL entirely.  Then the actualy caller
> (nfsd) can sort out which errors it wants to send over the wire.

The needs of those two callers don't look very different to me, and I
can't recall seeing a correct use of an error other than ESTALE or
ENOMEM, so I've been thinking of it more of a question of how to best
handle a misbehaving filesystem.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem

2016-08-04 Thread Chris Murphy
On Thu, Aug 4, 2016 at 10:53 AM, Lutz Vieweg  wrote:

> The amount of threads on "lost or unused free space" without resolutions
> in the btrfs mailing list archive is really frightening. If these
> symptoms commonly re-appear with no fix in sight, I'm afraid I'll have
> to either resort to using XFS (with ugly block-device based snapshots
> for backup) or try my luck with OpenZFS :-(

Keep in mind the list is rather self-selecting for problems. People
who aren't having problems are unlikely to post their non-problems to
the list.

It'll be interesting to see what other suggestions you get, but I see
it as basically three options in order of increasing risk+effort.

a. Try the clear_cache mount option (one time) and let the file system
stay mounted so the cache is recreated. If the problem happens soon
after again, try nospace_cache. This might buy you time before 4.8 is
out, which has a bunch of new enospc code in it.

b. Recreate the file system. For reasons not well understood, some
file systems just get stuck in this state with bogus enospc claims.

c. Take some risk and use 4.8 rc1 once it's out. Just make sure to
keep backups. I have no idea to what degree the new enospc code can
help well used existing systems already having enospc issues, vs the
code prevents the problem from happening in the first place. So you
may end up at b. anyway.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-04 Thread Martin
Thanks for the benchmark tools and tips on where the issues might be.

Is Fedora 24 rawhide preferred over ArchLinux?

If I want to compile a mainline kernel. Are there anything I need to tune?

When I do the tests, how do I log the info you would like to see, if I
find a bug?



On 4 August 2016 at 22:01, Chris Murphy  wrote:
> On Thu, Aug 4, 2016 at 1:05 PM, Austin S. Hemmelgarn
>  wrote:
>
>>Fedora should be fine (they're good about staying up to
>> date), but if possible you should probably use Rawhide instead of a regular
>> release, as that will give you quite possibly one of the closest
>> distribution kernels to a mainline Linux kernel available, and will make
>> sure everything is as up to date as possible.
>
> Yes. It's possible to run on a release version (currently Fedora 23
> and Fedora 24) and run a Rawhide kernel. This is what I often do.
>
>
>> As far as testing, I don't know that there are any scripts for this type of
>> thing, you may want to look into dbench, fio, iozone, and similar tools
>> though, as well as xfstests (which is more about regression testing, but is
>> still worth looking at).
>>
>> Most of the big known issues with RAID6 in BTRFS at the moment involve
>> device failures and array recovery, but most of them aren't well
>> characterized and nobody's really sure why they're happening, so if you want
>> to look for something specific, figuring out those issues would be a great
>> place to start (even if they aren't rare bugs).
>
> Yeah it seems pretty reliable to do normal things with raid56 arrays.
> The problem is when they're degraded, weird stuff seems to happen some
> of the time. So it might be valid to have several raid56's that are
> intentionally running in degraded mode with some tests that will
> tolerate that and see when it breaks and why.
>
> There is also in the archives the bug where parity is being computed
> wrongly when a data strip is wrong (corrupt), and Btrfs sees this,
> reports the mismatch, fixes the mismatch, recomputes parity for some
> reason, and the parity is then wrong. It'd be nice to know when else
> this can happen, if it's possible parity is recomputed (and wrongly)
> on a normal read, or a balance, or if it's really restricted to scrub.
>
> Another test might be raid 1 or raid10 metadata vs raid56 for data.
> That'd probably be more performance related, but there might be some
> unexpected behaviors that crop up.
>
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-04 Thread Chris Murphy
On Thu, Aug 4, 2016 at 2:51 PM, Martin  wrote:
> Thanks for the benchmark tools and tips on where the issues might be.
>
> Is Fedora 24 rawhide preferred over ArchLinux?

I'm not sure what Arch does any differently to their kernels from
kernel.org kernels. But bugzilla.kernel.org offers a Mainline and
Fedora drop down for identifying the kernel source tree.

>
> If I want to compile a mainline kernel. Are there anything I need to tune?

Fedora kernels do not have these options set.

# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set

The sanity and integrity tests are both compile time and mount time
options, i.e. it has to be compiled enabled for the mount option to do
anything. I can't recall any thread where a developer asked a user to
set any of these options for testing though.


> When I do the tests, how do I log the info you would like to see, if I
> find a bug?

bugzilla.kernel.org for tracking, and then reference the URL for the
bug with a summary in an email to list is how I usually do it. The
main thing is going to be the exact reproduce steps. It's also better,
I think, to have complete dmesg (or journalctl -k) attached to the bug
report because not all problems are directly related to Btrfs, they
can have contributing factors elsewhere. And various MTAs, or more
commonly MUAs, have a tendancy to wrap such wide text as found in
kernel or journald messages.

And then whatever Austin says.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to stress test raid6 on 122 disk array

2016-08-04 Thread Martin
Excellent. Thanks.

In order to automate it, would it be ok if I dd some zeroes directly
to the devices to corrupt them, or do need to physically take the
disks out while running?

The smallest disk of the 122 is 500GB. Is it possible to have btrfs
see each disk as only e.g. 10GB? That way I can corrupt and resilver
more disks over a month.








On 4 August 2016 at 23:12, Chris Murphy  wrote:
> On Thu, Aug 4, 2016 at 2:51 PM, Martin  wrote:
>> Thanks for the benchmark tools and tips on where the issues might be.
>>
>> Is Fedora 24 rawhide preferred over ArchLinux?
>
> I'm not sure what Arch does any differently to their kernels from
> kernel.org kernels. But bugzilla.kernel.org offers a Mainline and
> Fedora drop down for identifying the kernel source tree.
>
>>
>> If I want to compile a mainline kernel. Are there anything I need to tune?
>
> Fedora kernels do not have these options set.
>
> # CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
> # CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
> # CONFIG_BTRFS_DEBUG is not set
> # CONFIG_BTRFS_ASSERT is not set
>
> The sanity and integrity tests are both compile time and mount time
> options, i.e. it has to be compiled enabled for the mount option to do
> anything. I can't recall any thread where a developer asked a user to
> set any of these options for testing though.
>
>
>> When I do the tests, how do I log the info you would like to see, if I
>> find a bug?
>
> bugzilla.kernel.org for tracking, and then reference the URL for the
> bug with a summary in an email to list is how I usually do it. The
> main thing is going to be the exact reproduce steps. It's also better,
> I think, to have complete dmesg (or journalctl -k) attached to the bug
> report because not all problems are directly related to Btrfs, they
> can have contributing factors elsewhere. And various MTAs, or more
> commonly MUAs, have a tendancy to wrap such wide text as found in
> kernel or journald messages.
>
> And then whatever Austin says.
>
>
>
> --
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS: Transaction aborted (error -28)

2016-08-04 Thread Mordechay Kaganer
B.H.

> On Fri, Jul 29, 2016 at 8:23 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> So I'd recommend upgrading to the latest kernel 4.4 if you want to stay
>> with the stable series, or 4.6 or 4.7 if you want current, and then (less
>> important) upgrading the btrfs userspace as well.  It's possible the
>> newer kernel will handle the combined rsync and send stresses better, and
>> if not, you're on a better base to provide bug reports, etc.
>
> OK, upgraded to 4.4 (Ubuntu 16.04 stock kernel) and the fresh
> btrfs-progs 4.7. I'm assuming the error was due to some kind of bug or
> race condition and the FS is clean. Let's see how it behaves. Thanks!

Hello, i'm still getting ENOSPC errors. The latest time the log looks like this:

Aug  4 21:55:06 yemot-4u kernel: [304090.288927] [ cut
here ]
Aug  4 21:55:06 yemot-4u kernel: [304090.288961] WARNING: CPU: 1 PID:
4531 at /build/linux-dcxD3m/linux-4.4.0/fs/btrfs/extent-tree.c:2927
btrfs_run_delayed_refs+0x26b/0
x2a0 [btrfs]()
Aug  4 21:55:06 yemot-4u kernel: [304090.288965] BTRFS: error (device
md1) in btrfs_run_delayed_refs:2927: errno=-28 No space left
Aug  4 21:55:06 yemot-4u kernel: [304090.288968] BTRFS info (device
md1): forced readonly
Aug  4 21:55:06 yemot-4u kernel: [304090.288972] BTRFS: error (device
md1) in btrfs_run_delayed_refs:2927: errno=-28 No space left
Aug  4 21:55:06 yemot-4u kernel: [304090.289129] BTRFS: Transaction
aborted (error -28)
Aug  4 21:55:06 yemot-4u kernel: [304090.289131] Modules linked in:
binfmt_misc ipmi_ssif btrfs x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm irqbypass c
rct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd input_leds sb_edac serio_raw joydev
edac_core lpc_ich snd_hda_codec_real
tek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core
snd_hwdep snd_pcm mei_me snd_timer mei snd soundcore shpchp ipmi_si
8250_fintek ipmi_msghandler mac_h
id nfsd auth_rpcgss nfs_acl lockd grace sunrpc lp parport autofs4
raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq
async_xor async_tx xor raid6_pq
 libcrc32c raid10 raid1 ses enclosure igb dca ast ttm drm_kms_helper
syscopyarea hid_generic sysfillrect firewire_ohci sysimgblt
fb_sys_fops ahci usbhid firewire_core p
tp psmouse libahci isci hid drm crc_itu_t libsas pps_core i2c_algo_bit
aacraid scsi_transport_sas wmi fjes
Aug  4 21:55:06 yemot-4u kernel: [304090.289201] CPU: 1 PID: 4531
Comm: kworker/u16:28 Not tainted 4.4.0-31-generic #50-Ubuntu
Aug  4 21:55:06 yemot-4u kernel: [304090.289203] Hardware name: To Be
Filled By O.E.M. To Be Filled By O.E.M./EPC602D8A, BIOS P1.20
04/16/2014
Aug  4 21:55:06 yemot-4u kernel: [304090.289226] Workqueue:
btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
Aug  4 21:55:06 yemot-4u kernel: [304090.289229]  0286
b56c494e 880744f4bc98 813f1143
Aug  4 21:55:06 yemot-4u kernel: [304090.289232]  880744f4bce0
c06d8468 880744f4bcd0 81081102
Aug  4 21:55:06 yemot-4u kernel: [304090.289234]  8807fbff3000
880859b13000 880729f30b90 
Aug  4 21:55:06 yemot-4u kernel: [304090.289237] Call Trace:
Aug  4 21:55:06 yemot-4u kernel: [304090.289244]  []
dump_stack+0x63/0x90
Aug  4 21:55:06 yemot-4u kernel: [304090.289249]  []
warn_slowpath_common+0x82/0xc0
Aug  4 21:55:06 yemot-4u kernel: [304090.289252]  []
warn_slowpath_fmt+0x5c/0x80
Aug  4 21:55:06 yemot-4u kernel: [304090.289268]  []
btrfs_run_delayed_refs+0x26b/0x2a0 [btrfs]
Aug  4 21:55:06 yemot-4u kernel: [304090.289284]  []
delayed_ref_async_start+0x37/0x90 [btrfs]
Aug  4 21:55:06 yemot-4u kernel: [304090.289303]  []
btrfs_scrubparity_helper+0xca/0x2f0 [btrfs]
Aug  4 21:55:06 yemot-4u kernel: [304090.289307]  []
? tty_ldisc_deref+0x16/0x20
Aug  4 21:55:06 yemot-4u kernel: [304090.289326]  []
btrfs_extent_refs_helper+0xe/0x10 [btrfs]
Aug  4 21:55:06 yemot-4u kernel: [304090.289330]  []
process_one_work+0x165/0x480
Aug  4 21:55:06 yemot-4u kernel: [304090.289333]  []
worker_thread+0x4b/0x4c0
Aug  4 21:55:06 yemot-4u kernel: [304090.289336]  []
? process_one_work+0x480/0x480
Aug  4 21:55:06 yemot-4u kernel: [304090.289339]  []
kthread+0xd8/0xf0
Aug  4 21:55:06 yemot-4u kernel: [304090.289341]  []
? kthread_create_on_node+0x1e0/0x1e0
Aug  4 21:55:06 yemot-4u kernel: [304090.289345]  []
ret_from_fork+0x3f/0x70
Aug  4 21:55:06 yemot-4u kernel: [304090.289348]  []
? kthread_create_on_node+0x1e0/0x1e0
Aug  4 21:55:06 yemot-4u kernel: [304090.289350] ---[ end trace
90c37e7522254f86 ]---
Aug  4 21:55:06 yemot-4u kernel: [304090.289353] BTRFS: error (device
md1) in btrfs_run_delayed_refs:2927: errno=-28 No space left
Aug  4 21:55:06 yemot-4u kernel: [304090.328312] BTRFS: error (device
md1) in __btrfs_free_extent:6552: errno=-28 No space left
Aug  4 21:55:06 yemot-4u kernel: [304090.328344] BTRFS: error (device
md1) in btrfs_run_delayed_refs:2927: errno=-28 No space left

root@yemot-4u:~# uname -a
Linux yemot-4u 4.4.0-31-ge

possible bug - wrong path in 'btrfs subvolume show' when snapshot is in path below subvolume.

2016-08-04 Thread Peter Holm
'btrfs subvolumee show' gives no path to btrfs system root (volid=5)
when snapshot is in the folder of subvolume.

Step to reproduce.
1.btrfs subvolume create xyz
2.btrfs subvolume snapshot xyz xyz/xyz
3.btrfs subvolume snapshot /xyz
4.btrfs subvolumme show xyz
output
.
Snapshot(s)
 xyz
 xyz
.
picture from my console reproducing this. Whatchout for my personal fs-layout
my mountpoint for volid=5 is - as seen in the findmount command  r at
top of the photo /mnt/btrfs/sdc16-svid-5
https://s31.postimg.org/9f0d7xb7f/is_this_a_bug.png

If that can add anything, same thing happends when rootvolume is
mounted by path. (for the moment it is mounted by volid).
/Peter Holm
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: possible bug - wrong path in 'btrfs subvolume show' when snapshot is in path below subvolume.

2016-08-04 Thread Peter Holm
writing error.
replace "gives no path to" with "same path as"

/Peter Holm

2016-08-05 1:32 GMT+02:00, Peter Holm :
> 'btrfs subvolumee show' gives no path to btrfs system root (volid=5)
> when snapshot is in the folder of subvolume.
>
> Step to reproduce.
> 1.btrfs subvolume create xyz
> 2.btrfs subvolume snapshot xyz xyz/xyz
> 3.btrfs subvolume snapshot /xyz
> 4.btrfs subvolumme show xyz
> output
> .
> Snapshot(s)
>  xyz
>  xyz
> .
> picture from my console reproducing this. Whatchout for my personal
> fs-layout
> my mountpoint for volid=5 is - as seen in the findmount command  r at
> top of the photo /mnt/btrfs/sdc16-svid-5
> https://s31.postimg.org/9f0d7xb7f/is_this_a_bug.png
>
> If that can add anything, same thing happends when rootvolume is
> mounted by path. (for the moment it is mounted by volid).
> /Peter Holm
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] exportfs: be careful to only return expected errors.

2016-08-04 Thread NeilBrown
On Thu, Aug 04 2016, Christoph Hellwig wrote:

> On Thu, Aug 04, 2016 at 10:19:06AM +1000, NeilBrown wrote:
>> 
>> 
>> When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors.
>> In particular it can be tempting to return ENOENT, but this is not
>> handled well by nfsd.
>> 
>> Rather than requiring strict adherence to error code code filesystems,
>> treat all unexpected error codes the same as ESTALE.  This is safest.
>> 
>> Signed-off-by: NeilBrown 
>> ---
>> 
>> I didn't add a dprintk for unexpected error messages, partly
>> because dprintk isn't usable in exportfs.  I could have used pr_debug()
>> but I really didn't see much value.
>> 
>> This has been tested together with the btrfs change, and it restores
>> correct functionality.
>
> I don't really like all this magic which is partially historic.  I think
> we should instead allow the fs to return any error from the export
> operations, and forbid returning NULL entirely.  Then the actualy caller
> (nfsd) can sort out which errors it wants to send over the wire.

I'm certainly open to that possibility.
But is the "actual caller":
  nfsd_set_fh_dentry(), or
  fh_verify() or
  the various callers of fh_verify() which might have different rules
  about which error codess are acceptable?

I could probably make an argument for having fh_verify() be careful
about error codes, but as exportfs_decode_fh() is a more public
interface, I think it is more important that it have well defined error
options.

Are there *any* errors that could sensibly be returned from
exportfs_decode_fh() other than
  -ESTALE (there is no such file), or
  -ENOMEM (there probably is a file, but I cannot allocate a dentry for
   it) or
  -EACCES (there is such a file, but it isn't "acceptable")

???

If there aren't, why should we let them through?

NeilBrown


signature.asc
Description: PGP signature


[PATCH v3] xfs: test attr_list_by_handle cursor iteration

2016-08-04 Thread Darrick J. Wong
Apparently the XFS attr_list_by_handle ioctl has never actually copied
the cursor contents back to user space, which means that iteration has
never worked.  Add a test case for this and see
"xfs: in _attrlist_by_handle, copy the cursor back to userspace".

v2: Use BULKSTAT_SINGLE for less confusion, fix build errors on RHEL6.
v3: Use path_to_handle instead of bulkstat.

Signed-off-by: Darrick J. Wong 
---
 .gitignore|1 
 src/Makefile  |3 +
 src/attr-list-by-handle-cursor-test.c |  118 +
 tests/xfs/700 |   64 ++
 tests/xfs/700.out |5 +
 tests/xfs/group   |1 
 6 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 src/attr-list-by-handle-cursor-test.c
 create mode 100755 tests/xfs/700
 create mode 100644 tests/xfs/700.out

diff --git a/.gitignore b/.gitignore
index 28bd180..e184a6f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -38,6 +38,7 @@
 /src/alloc
 /src/append_reader
 /src/append_writer
+/src/attr-list-by-handle-cursor-test
 /src/bstat
 /src/bulkstat_unlink_test
 /src/bulkstat_unlink_test_modified
diff --git a/src/Makefile b/src/Makefile
index 1bf318b..ae06d50 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -20,7 +20,8 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize 
preallo_rw_pattern_reader \
bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
-   renameat2 t_getcwd e4compact test-nextquota punch-alternating
+   renameat2 t_getcwd e4compact test-nextquota punch-alternating \
+   attr-list-by-handle-cursor-test
 
 SUBDIRS =
 
diff --git a/src/attr-list-by-handle-cursor-test.c 
b/src/attr-list-by-handle-cursor-test.c
new file mode 100644
index 000..4269d1e
--- /dev/null
+++ b/src/attr-list-by-handle-cursor-test.c
@@ -0,0 +1,118 @@
+/*
+ * Copyright (C) 2016 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define ATTRBUFSZ  1024
+#define BSTATBUF_NR32
+
+/* Read all the extended attributes of a file handle. */
+void
+read_handle_xattrs(
+   struct xfs_handle   *handle)
+{
+   struct attrlist_cursor  cur;
+   charattrbuf[ATTRBUFSZ];
+   char*firstname = NULL;
+   struct attrlist *attrlist = (struct attrlist *)attrbuf;
+   struct attrlist_ent *ent;
+   int i;
+   int flags = 0;
+   int error;
+
+   memset(&cur, 0, sizeof(cur));
+   while ((error = attr_list_by_handle(handle, sizeof(*handle),
+   attrbuf, ATTRBUFSZ, flags,
+   &cur)) == 0) {
+   for (i = 0; i < attrlist->al_count; i++) {
+   ent = ATTR_ENTRY(attrlist, i);
+
+   if (i != 0)
+   continue;
+
+   if (firstname == NULL) {
+   firstname = malloc(ent->a_valuelen);
+   memcpy(firstname, ent->a_name, ent->a_valuelen);
+   } else {
+   if (memcmp(firstname, ent->a_name,
+  ent->a_valuelen) == 0)
+   fprintf(stderr,
+   "Saw duplicate xattr \"%s\", 
buggy XFS?\n",
+   ent->a_name);
+   else
+   fprintf(stderr,
+   "Test passes.\n");
+   goto out;
+   }
+   }
+
+   if (!attrlist->al_more)
+   break;
+   }
+
+out:
+   if (firstname)
+   free(firstname);
+   if (error)
+   p

[PATCH v2 0/3] Qgroup fix for dirty hack routines

2016-08-04 Thread Qu Wenruo
This patchset introduce 2 fixes for data extent owner hacks.

One can be triggered by balance, another one can be trigged by log replay
after power loss.

Root cause are all similar: EXTENT_DATA owner is changed by dirty
hacks, from swapping tree blocks containing EXTENT_DATA to manually
update extent backref without using inc/dec_extent_ref.

The first patch introduces needed functions, then 2 fixes.

The reproducer are all merged into xfstests, btrfs/123 and btrfs/119.

3rd patch stay untouched while 2nd patch get update thanks for the report
from Goldwyn.

Changelog:
v2:
  Update 2nd patch to handle cases where the whole subtree, not only
  level 2 nodes get updated.

Qu Wenruo (3):
  btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
  btrfs: relocation: Fix leaking qgroups numbers on data extents
  btrfs: qgroup: Fix qgroup incorrectness caused by log replay

 fs/btrfs/delayed-ref.c |   5 +--
 fs/btrfs/extent-tree.c |  36 +++-
 fs/btrfs/qgroup.c  |  39 ++---
 fs/btrfs/qgroup.h  |  44 +--
 fs/btrfs/relocation.c  | 114 ++---
 fs/btrfs/tree-log.c|  16 +++
 6 files changed, 205 insertions(+), 49 deletions(-)

-- 
2.9.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()

2016-08-04 Thread Qu Wenruo
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
1. _btrfs_qgroup_insert_dirty_extent()
   Almost the same with original code.
   For delayed_ref usage, which has delayed refs locked.

   Change the return value type to int, since caller never needs the
   pointer, but only needs to know if they need to free the allocated
   memory.

2. btrfs_qgroup_record_dirty_extent()
   The more encapsulated version.

   Will do the delayed_refs lock, memory allocation, quota enabled check
   and other misc things.

The original design is to keep exported functions to minimal, but since
more btrfs hacks exposed, like replacing path in balance, needs us to
record dirty extents manually, so we have to add such functions.

Also, add comment for both functions, to info developers how to keep
qgroup correct when doing hacks.

Cc: Mark Fasheh 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c |  5 +
 fs/btrfs/extent-tree.c | 36 +---
 fs/btrfs/qgroup.c  | 39 ++-
 fs/btrfs/qgroup.h  | 44 +---
 4 files changed, 81 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..5eed597 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -541,7 +541,6 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_head *existing;
struct btrfs_delayed_ref_head *head_ref = NULL;
struct btrfs_delayed_ref_root *delayed_refs;
-   struct btrfs_qgroup_extent_record *qexisting;
int count_mod = 1;
int must_insert_reserved = 0;
 
@@ -606,9 +605,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
qrecord->num_bytes = num_bytes;
qrecord->old_roots = NULL;
 
-   qexisting = btrfs_qgroup_insert_dirty_extent(delayed_refs,
-qrecord);
-   if (qexisting)
+   if(_btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
kfree(qrecord);
}
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9fcb8c9..47c85ff 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8519,34 +8519,6 @@ reada:
wc->reada_slot = slot;
 }
 
-/*
- * These may not be seen by the usual inc/dec ref code so we have to
- * add them here.
- */
-static int record_one_subtree_extent(struct btrfs_trans_handle *trans,
-struct btrfs_root *root, u64 bytenr,
-u64 num_bytes)
-{
-   struct btrfs_qgroup_extent_record *qrecord;
-   struct btrfs_delayed_ref_root *delayed_refs;
-
-   qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS);
-   if (!qrecord)
-   return -ENOMEM;
-
-   qrecord->bytenr = bytenr;
-   qrecord->num_bytes = num_bytes;
-   qrecord->old_roots = NULL;
-
-   delayed_refs = &trans->transaction->delayed_refs;
-   spin_lock(&delayed_refs->lock);
-   if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
-   kfree(qrecord);
-   spin_unlock(&delayed_refs->lock);
-
-   return 0;
-}
-
 static int account_leaf_items(struct btrfs_trans_handle *trans,
  struct btrfs_root *root,
  struct extent_buffer *eb)
@@ -8580,7 +8552,8 @@ static int account_leaf_items(struct btrfs_trans_handle 
*trans,
 
num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
 
-   ret = record_one_subtree_extent(trans, root, bytenr, num_bytes);
+   ret = btrfs_qgroup_record_dirty_extent(trans, root->fs_info,
+   bytenr, num_bytes, GFP_NOFS);
if (ret)
return ret;
}
@@ -8729,8 +8702,9 @@ walk_down:
btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
 
-   ret = record_one_subtree_extent(trans, root, 
child_bytenr,
-   root->nodesize);
+   ret = btrfs_qgroup_record_dirty_extent(trans,
+   root->fs_info, child_bytenr,
+   root->nodesize, GFP_NOFS);
if (ret)
goto out;
}
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 9d4c05b..76d4f67 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1453,9 +1453,9 @@ int btrfs_qgroup_prepare_account_extents(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
-struct btrfs_qgroup_extent_record
-*btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs,
- struct btrfs_qgroup_extent_record *record)
+int _btrfs_qgroup_insert

[PATCH v2 2/3] btrfs: relocation: Fix leaking qgroups numbers on data extents

2016-08-04 Thread Qu Wenruo
When balancing data extents, qgroup will leak all its numbers for
relocated data extents.

The relocation is done in the following steps for data extents:
1) Create data reloc tree and inode
2) Copy all data extents to data reloc tree
   And commit transaction
3) Create tree reloc tree(special snapshot) for any related subvolumes
4) Replace file extent in tree reloc tree with new extents in data reloc
   tree
   And commit transaction
5) Merge tree reloc tree with original fs, by swapping tree blocks

For 1)~4), since tree reloc tree and data reloc tree doesn't count to
qgroup, everything is OK.

But for 5), the swapping of tree blocks will only info qgroup to track
metadata extents.

If metadata extents contain file extents, qgroup number for file extents
will get lost, leading to corrupted qgroup accounting.

The fix is, before commit transaction of step 5), manually info qgroup to
track all file extents in data reloc tree.
Since at commit transaction time, the tree swapping is done, and qgroup
will account these data extents correctly.

Cc: Mark Fasheh 
Reported-by: Mark Fasheh 
Reported-by: Filipe Manana 
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Iterate all file extents in data reloc tree, instead of iterating
  leafs of a swapped level 1 tree block.
  This fixes case where a level 2 or higher tree block is merged with
  original subvolume.
---
 fs/btrfs/relocation.c | 114 +++---
 1 file changed, 108 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index fc067b0..def7c9c 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -31,6 +31,7 @@
 #include "async-thread.h"
 #include "free-space-cache.h"
 #include "inode-map.h"
+#include "qgroup.h"
 
 /*
  * backref_node, mapping_node and tree_block start with this
@@ -3912,6 +3913,95 @@ int prepare_to_relocate(struct reloc_control *rc)
return 0;
 }
 
+/*
+ * Qgroup fixer for data chunk relocation.
+ * The data relocation is done in the following steps
+ * 1) Copy data extents into data reloc tree
+ * 2) Create tree reloc tree(special snapshot) for related subvolumes
+ * 3) Modify file extents in tree reloc tree
+ * 4) Merge tree reloc tree with original fs tree, by swapping tree blocks
+ *
+ * The problem is, data and tree reloc tree are not accounted to qgroup,
+ * and 4) will only info qgroup to track tree blocks change, not file extents
+ * in the tree blocks.
+ *
+ * The good news is, related data extents are all in data reloc tree, so we
+ * only need to info qgroup to track all file extents in data reloc tree
+ * before commit trans.
+ */
+static int qgroup_fix_relocated_data_extents(struct btrfs_trans_handle *trans,
+struct reloc_control *rc)
+{
+   struct btrfs_fs_info *fs_info = rc->extent_root->fs_info;
+   struct inode *inode = rc->data_inode;
+   struct btrfs_root *data_reloc_root = BTRFS_I(inode)->root;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   int ret = 0;
+
+   if (!fs_info->quota_enabled)
+   return 0;
+
+   /*
+* Only for stage where we update data pointers the qgroup fix is
+* valid.
+* For MOVING_DATA stage, we will miss the timing of swapping tree
+* blocks, and won't fix it.
+*/
+   if (!(rc->stage == UPDATE_DATA_PTRS && rc->extents_found))
+   return 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+   key.objectid = btrfs_ino(inode);
+   key.type = BTRFS_EXTENT_DATA_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, data_reloc_root, &key, path, 0, 0);
+   if (ret < 0)
+   goto out;
+
+   lock_extent(&BTRFS_I(inode)->io_tree, 0, (u64)-1);
+   while (1) {
+   struct btrfs_file_extent_item *fi;
+
+   btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+   if (key.objectid > btrfs_ino(inode))
+   break;
+   if (key.type != BTRFS_EXTENT_DATA_KEY)
+   goto next;
+   fi = btrfs_item_ptr(path->nodes[0], path->slots[0],
+   struct btrfs_file_extent_item);
+   if (btrfs_file_extent_type(path->nodes[0], fi) !=
+   BTRFS_FILE_EXTENT_REG)
+   goto next;
+   /*
+   pr_info("disk bytenr: %llu, num_bytes: %llu\n",
+   btrfs_file_extent_disk_bytenr(path->nodes[0], fi),
+   btrfs_file_extent_disk_num_bytes(path->nodes[0], fi));
+   */
+   ret = btrfs_qgroup_record_dirty_extent(trans, fs_info,
+   btrfs_file_extent_disk_bytenr(path->nodes[0], fi),
+   btrfs_file_extent_disk_num_bytes(path->nodes[0], fi),
+   GFP_NOFS);
+   if (ret < 0)
+

[PATCH v2 3/3] btrfs: qgroup: Fix qgroup incorrectness caused by log replay

2016-08-04 Thread Qu Wenruo
When doing log replay at mount time(after power loss), qgroup will leak
numbers of replayed data extents.

The cause is almost the same of balance.
So fix it by manually informing qgroup for owner changed extents.

The bug can be detected by btrfs/119 test case.

Cc: Mark Fasheh 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/tree-log.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c05f69a..80f8345 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -27,6 +27,7 @@
 #include "backref.h"
 #include "hash.h"
 #include "compression.h"
+#include "qgroup.h"
 
 /* magic values for the inode_only field in btrfs_log_inode:
  *
@@ -680,6 +681,21 @@ static noinline int replay_one_extent(struct 
btrfs_trans_handle *trans,
ins.type = BTRFS_EXTENT_ITEM_KEY;
offset = key->offset - btrfs_file_extent_offset(eb, item);
 
+   /*
+* Manually record dirty extent, as here we did a shallow
+* file extent item copy and skip normal backref update,
+* but modify extent tree all by ourselves.
+* So need to manually record dirty extent for qgroup,
+* as the owner of the file extent changed from log tree
+* (doesn't affect qgroup) to fs/file tree(affects qgroup)
+*/
+   ret = btrfs_qgroup_record_dirty_extent(trans, root->fs_info,
+   btrfs_file_extent_disk_bytenr(eb, item),
+   btrfs_file_extent_disk_num_bytes(eb, item),
+   GFP_NOFS);
+   if (ret < 0)
+   goto out;
+
if (ins.objectid > 0) {
u64 csum_start;
u64 csum_end;
-- 
2.9.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.8] btrfs heats my room with lock contention

2016-08-04 Thread Dave Chinner
On Thu, Aug 04, 2016 at 10:28:44AM -0400, Chris Mason wrote:
> 
> 
> On 08/04/2016 02:41 AM, Dave Chinner wrote:
> >
> >Simple test. 8GB pmem device on a 16p machine:
> >
> ># mkfs.btrfs /dev/pmem1
> ># mount /dev/pmem1 /mnt/scratch
> ># dbench -t 60 -D /mnt/scratch 16
> >
> >And heat your room with the warm air rising from your CPUs. Top
> >half of the btrfs profile looks like:
.
> >Performance vs CPu usage is:
> >
> >nprocs   throughput  cpu usage
> >1440MB/s  50%
> >2770MB/s 100%
> >4880MB/s 250%
> >8690MB/s 450%
> >16   280MB/s 950%
> >
> >In comparision, at 8-16 threads ext4 is running at ~2600MB/s and
> >XFS is running at ~3800MB/s. Even if I throw 300-400 processes at
> >ext4 and XFS, they only drop to ~1500-2000MB/s as they hit internal
> >limits.
> >
> Yes, with dbench btrfs does much much better if you make a subvol
> per dbench dir.  The difference is pretty dramatic.  I'm working on
> it this month, but focusing more on database workloads right now.

You've been giving this answer to lock contention reports for the
past 6-7 years, Chris.  I really don't care about getting big
benchmark numbers with contrived setups - the "use multiple
subvolumes" solution is simply not practical for users or their
workloads.  The default config should behave sanely and not not
contribute to global warming like this.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html