Re: Extents for a particular subvolume
On 2016-08-03 17:55, Graham Cobb wrote: On 03/08/16 21:37, Adam Borowski wrote: On Wed, Aug 03, 2016 at 08:56:01PM +0100, Graham Cobb wrote: Are there any btrfs commands (or APIs) to allow a script to create a list of all the extents referred to within a particular (mounted) subvolume? And is it a reasonably efficient process (i.e. doesn't involve backrefs and, preferably, doesn't involve following directory trees)? Since the size of your output is linear to the number of extents which is between the number of files and sum of their sizes, I see no gain in trying to avoid following the directory tree. Thanks for the help, Adam. There are a lot of files and a lot of directories - find, "ls -R" and similar operations take a very long time. I was hoping that I could query some sort of extent tree for the subvolume and get the answer back in seconds instead of multiple minutes. But I can follow the directory tree if I need to. I am not looking to relate the extents to files/inodes/paths. My particular need, at the moment, is to work out how much of two snapshots is shared data, but I can think of other uses for the information. Thus, unlike the question you asked above, you're not interested in _all_ extents, merely those which changed. You may want to look at "btrfs subv find-new" and "btrfs send --no-data". Unfortunately, the subvolumes do not have an ancestor-descendent relationship (although they do have some common ancestors), so I don't think find-new is much help (as far as I can see). But just looking at the size of the output from "send -c" would work well enough for the particular problem I am trying to solve tonight! Although I will need to take read-only snapshots of the subvolumes to allow send to work. Thanks for the suggestion. FWIW, if you're not using any files in the subvolumes, you can run: btrfs property set ro true to mark them read-only so you don't need the snapshots, and then run the same command with 'false' at the end instead of true to mark them writable again. I would still be interested in the extent list, though. The main problem with find-new and send is that they don't tell me how much has been deleted, only added. I am thinking about using the extents to get a much better handle on what is using up space and what I could recover if I removed (or moved to another volume) various groups of related subvolumes. You may want to look into 'btrfs filesystem usage' and 'btrfs filesystem du' commands. I'm not sure if they'll cover what you need, but they can show info about how much is shared. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memory overflow or undeflow in free space tree / space_info?
Am 29.07.2016 um 23:03 schrieb Josef Bacik: > On 07/29/2016 03:14 PM, Omar Sandoval wrote: >> On Fri, Jul 29, 2016 at 12:11:53PM -0700, Omar Sandoval wrote: >>> On Fri, Jul 29, 2016 at 08:40:26PM +0200, Stefan Priebe - Profihost >>> AG wrote: Dear list, i'm seeing btrfs no space messages frequently on big filesystems (> 30TB). In all cases i'm getting a trace like this one a space_info warning. (since commit [1]). Could someone please be so kind and help me debugging / fixing this bug? I'm using space_cache=v2 on all those systems. >>> >>> Hm, so I think this indicates a bug in space accounting somewhere else >>> rather than the free space tree itself. I haven't debugged one of these >>> issues before, I'll see if I can reproduce it. Cc'ing Josef, too. >> >> I should've asked, what sort of filesystem activity triggers this? >> > > Chris just fixed this I think, try his next branch from his git tree > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git Thanks now running a 4.4 with those patches backported. If that still shows an error i will try that vanilla tree. Thanks! Stefan > and see if it still happens. Thanks, > > Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] exportfs: be careful to only return expected errors.
On Thu, Aug 04, 2016 at 10:19:06AM +1000, NeilBrown wrote: > > > When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors. > In particular it can be tempting to return ENOENT, but this is not > handled well by nfsd. > > Rather than requiring strict adherence to error code code filesystems, > treat all unexpected error codes the same as ESTALE. This is safest. > > Signed-off-by: NeilBrown > --- > > I didn't add a dprintk for unexpected error messages, partly > because dprintk isn't usable in exportfs. I could have used pr_debug() > but I really didn't see much value. > > This has been tested together with the btrfs change, and it restores > correct functionality. I don't really like all this magic which is partially historic. I think we should instead allow the fs to return any error from the export operations, and forbid returning NULL entirely. Then the actualy caller (nfsd) can sort out which errors it wants to send over the wire. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.8] btrfs heats my room with lock contention
On 08/04/2016 02:41 AM, Dave Chinner wrote: Simple test. 8GB pmem device on a 16p machine: # mkfs.btrfs /dev/pmem1 # mount /dev/pmem1 /mnt/scratch # dbench -t 60 -D /mnt/scratch 16 And heat your room with the warm air rising from your CPUs. Top half of the btrfs profile looks like: 36.71% [kernel] [k] _raw_spin_unlock_irqrestore ¿ 32.29% [kernel] [k] native_queued_spin_lock_slowpath ¿ 5.14% [kernel] [k] queued_write_lock_slowpath ¿ 2.46% [kernel] [k] _raw_spin_unlock_irq ¿ 2.15% [kernel] [k] queued_read_lock_slowpath ¿ 1.54% [kernel] [k] _find_next_bit.part.0 ¿ 1.06% [kernel] [k] __crc32c_le ¿ 0.82% [kernel] [k] btrfs_tree_lock ¿ 0.79% [kernel] [k] steal_from_bitmap.part.29 ¿ 0.70% [kernel] [k] __copy_user_nocache ¿ 0.69% [kernel] [k] btrfs_tree_read_lock ¿ 0.69% [kernel] [k] delay_tsc ¿ 0.64% [kernel] [k] btrfs_set_lock_blocking_rw ¿ 0.63% [kernel] [k] copy_user_generic_string ¿ 0.51% [kernel] [k] do_raw_read_unlock ¿ 0.48% [kernel] [k] do_raw_spin_lock ¿ 0.47% [kernel] [k] do_raw_read_lock ¿ 0.46% [kernel] [k] btrfs_clear_lock_blocking_rw ¿ 0.44% [kernel] [k] do_raw_write_lock ¿ 0.41% [kernel] [k] __do_softirq ¿ 0.28% [kernel] [k] __memcpy ¿ 0.24% [kernel] [k] map_private_extent_buffer ¿ 0.23% [kernel] [k] find_next_zero_bit ¿ 0.22% [kernel] [k] btrfs_tree_read_unlock ¿ Performance vs CPu usage is: nprocs throughput cpu usage 1 440MB/s 50% 2 770MB/s 100% 4 880MB/s 250% 8 690MB/s 450% 16 280MB/s 950% In comparision, at 8-16 threads ext4 is running at ~2600MB/s and XFS is running at ~3800MB/s. Even if I throw 300-400 processes at ext4 and XFS, they only drop to ~1500-2000MB/s as they hit internal limits. Yes, with dbench btrfs does much much better if you make a subvol per dbench dir. The d
Re: [PATCH 37/45] drivers: use req op accessor
On Wed, Aug 03, 2016 at 07:30:29PM -0500, Shaun Tancheff wrote: > I think the translation in loop.c is suspicious here: > > "if use DIO && not (a flush_flag or discard_flag)" > should translate to: > "if use DIO && not ((a flush_flag) || op == discard)" > > But in the patch I read: > "if use DIO && ((not a flush_flag) || op == discard) > > Which would have DIO && discards follow the AIO path? Indeed. Sorry for missing out on your patch, I just sent a fix in reply to Dave's other report earlier which is pretty similar to yours. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/45] drivers: use req op accessor
On Thu, Aug 4, 2016 at 10:46 AM, Christoph Hellwig wrote: > On Wed, Aug 03, 2016 at 07:30:29PM -0500, Shaun Tancheff wrote: >> I think the translation in loop.c is suspicious here: >> >> "if use DIO && not (a flush_flag or discard_flag)" >> should translate to: >> "if use DIO && not ((a flush_flag) || op == discard)" >> >> But in the patch I read: >> "if use DIO && ((not a flush_flag) || op == discard) >> >> Which would have DIO && discards follow the AIO path? > > Indeed. Sorry for missing out on your patch, I just sent a fix > in reply to Dave's other report earlier which is pretty similar to > yours. No worries. I prefer your switch to a an if conditional here. -- Shaun Tancheff -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem
Hi, I was today hit by what I think is probably the same bug: A btrfs on a close-to-4TB sized block device, only half filled to almost exactly 2 TB, suddenly says "no space left on device" upon any attempt to write to it. The filesystem was NOT automatically switched to read-only by the kernel, I should mention. Re-mounting (which is a pain as this filesystem is used for $HOMEs of a multitude of active users who I have to kick from the server for doing things like re-mounting) removed the symptom for now, but from what I can read in linux-btrfs mailing list archives, it pretty likely the symptom will re-appear. Here are some more details: Software versions: linux-4.6.1 (vanilla from kernel.org) btrfs-progs v4.1 Info obtained while the symptom occured (before re-mount): > btrfs filesystem show /data3 Label: 'data3' uuid: f4c69d29-62ac-4e15-a825-c6283c8fd74c Total devices 1 FS bytes used 2.05TiB devid1 size 3.64TiB used 2.16TiB path /dev/mapper/cryptedResourceData3 (/dev/mapper/cryptedResourceData3 is a dm-crypt device, which is based on a DRBD block device, which is based on locally attached SATA disks on two servers - no trouble with that setup for years, no I/O-errors or such, same kind of block-device stack also used for another btrfs and some XFS filesystems.) > btrfs filesystem df /data3 Data, single: total=2.11TiB, used=2.01TiB System, single: total=4.00MiB, used=256.00KiB Metadata, single: total=48.01GiB, used=36.67GiB GlobalReserve, single: total=512.00MiB, used=5.52MiB Currently and at the time the bug occured no snapshots existed on "/data3". A snapshot is created once per night, a backup created, then the snapshot is removed again. There is lots of mixed I/O-activity during the day, both from interactive users and from automatic build processes and such. dmesg output from the time the "no space left on device"-symptom appeared: [5171203.601620] WARNING: CPU: 4 PID: 23208 at fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x263/0x2a0 [btrfs] [5171203.602719] Modules linked in: dm_snapshot dm_bufio fuse btrfs xor raid6_pq nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter drbd lru_cache bridge stp llc kvm_amd kvm irqbypass ghash_clmulni_intel amd64_edac_mod ses edac_mce_amd enclosure edac_core sp5100_tco pcspkr k10temp fam15h_power sg i2c_piix4 shpchp acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c dm_crypt mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe crct10dif_pclmul crc32_pclmul crc32c_intel igb ahci libahci aesni_intel glue_helper libata lrw gf128mul ablk_helper mdio cryptd ptp serio_raw i2c_algo_bit pps_core i2c_core dca sd_mod dm_mirror dm_region_hash dm_log dm_mod ... [5171203.617358] Call Trace: [5171203.618543] [] dump_stack+0x4d/0x6c [5171203.619568] [] __warn+0xe3/0x100 [5171203.620660] [] warn_slowpath_null+0x1d/0x20 [5171203.621779] [] btrfs_destroy_inode+0x263/0x2a0 [btrfs] [5171203.622716] [] destroy_inode+0x3b/0x60 [5171203.623774] [] evict+0x11c/0x180 ... [5171230.306037] WARNING: CPU: 18 PID: 12656 at fs/btrfs/extent-tree.c:4233 btrfs_free_reserved_data_space_noquota+0xf3/0x100 [btrfs] [5171230.310298] Modules linked in: dm_snapshot dm_bufio fuse btrfs xor raid6_pq nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter drbd lru_cache bridge stp llc kvm_amd kvm irqbypass ghash_clmulni_intel amd64_edac_mod ses edac_mce_amd enclosure edac_core sp5100_tco pcspkr k10temp fam15h_power sg i2c_piix4 shpchp acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c dm_crypt mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe crct10dif_pclmul crc32_pclmul crc32c_intel igb ahci libahci aesni_intel glue_helper libata lrw gf128mul ablk_helper mdio cryptd ptp serio_raw i2c_algo_bit pps_core i2c_core dca sd_mod dm_mirror dm_region_hash dm_log dm_mod ... [5171230.341755] Call Trace: [5171230.344119] [] dump_stack+0x4d/0x6c [5171230.346444] [] __warn+0xe3/0x100 [5171230.348709] [] warn_slowpath_null+0x1d/0x20 [5171230.350976] [] btrfs_free_reserved_data_space_noquota+0xf3/0x100 [btrfs] [5171230.353212] [] btrfs_clear_bit_hook+0x27f/0x350 [btrfs] [5171230.355392] [] ? free_extent_state+0x1a/0x20 [btrfs] [5171230.357556] [] clear_state_bit+0x66/0x1d0 [btrfs] [5171230.359698] [] __clear_extent_bit+0x224/0x3a0 [btrfs] [5171230.361810] [] ? btrfs_update_reserved_bytes+0x45/0x130 [btrfs] [5171230.363960] [] extent_clear_unlock_delalloc+0x7a/0x2d0 [btrfs] [5171230.366079] [] ? kmem_cache_alloc+0x17d/0x1f0 [5171230.368204] [] ? __btrfs_add_ordered_extent+0x43/0x310 [btrfs] [5171230.370350] [] ? __btrfs_add_ordered_extent+0x1fb/0x310 [btrfs] [5171230.372491] [] cow_file_range+0x28a/0x460 [btrfs] [517
How to stress test raid6 on 122 disk array
Hi, I would like to find rare raid6 bugs in btrfs, where I have the following hw: * 2x 8 core CPU * 128GB ram * 70 FC disk array (56x 500GB + 14x 1TB SATA disks) * 24 FC or 2x SAS disk array (1TB SAS disks) * 16 FC disk array (1TB SATA disks) * 12 SAS disk array (3TB SATA disks) The test can run for a month or so. I prefer CentOS/Fedora, but if someone will write a script that configures and compiles a preferred kernel, then we can do that on any preferred OS. Can anyone give recommendations on how the setup should be configured to most likely find rare raid6 bugs? And does there exist a script that is good for testing this sort of thing? Best regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to stress test raid6 on 122 disk array
On 2016-08-04 13:43, Martin wrote: Hi, I would like to find rare raid6 bugs in btrfs, where I have the following hw: * 2x 8 core CPU * 128GB ram * 70 FC disk array (56x 500GB + 14x 1TB SATA disks) * 24 FC or 2x SAS disk array (1TB SAS disks) * 16 FC disk array (1TB SATA disks) * 12 SAS disk array (3TB SATA disks) The test can run for a month or so. I prefer CentOS/Fedora, but if someone will write a script that configures and compiles a preferred kernel, then we can do that on any preferred OS. Can anyone give recommendations on how the setup should be configured to most likely find rare raid6 bugs? And does there exist a script that is good for testing this sort of thing? I'm glad to hear there people interested in testing BTRFS for the purpose of finding bugs. Sadly I can't provide much help in this respect (I do testing, but it's all regression testing these days). Regarding OS, I'd avoid CentOS for testing something like BTRFS unless you specifically want to help their development team fix issues. They have a large number of back-ported patches, and it's not all that practical for us to chase down bugs in such a situation, because it could just as easily be a bug introduced by the back-porting process or may be fixed in the mainline kernel anyway. Fedora should be fine (they're good about staying up to date), but if possible you should probably use Rawhide instead of a regular release, as that will give you quite possibly one of the closest distribution kernels to a mainline Linux kernel available, and will make sure everything is as up to date as possible. As far as testing, I don't know that there are any scripts for this type of thing, you may want to look into dbench, fio, iozone, and similar tools though, as well as xfstests (which is more about regression testing, but is still worth looking at). Most of the big known issues with RAID6 in BTRFS at the moment involve device failures and array recovery, but most of them aren't well characterized and nobody's really sure why they're happening, so if you want to look for something specific, figuring out those issues would be a great place to start (even if they aren't rare bugs). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, This is part two of my btrfs pull, which is some cleanups and a batch of fixes. Most of the code here is from Jeff Mahoney, making the pointers we pass around internally more consistent and less confusing overall. I noticed a small problem right before I sent this out yesterday, so I fixed it up and re-tested overnight. Please pull my for-linus-4.8 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.8 There are some minor conflicts against Mike Christie's changes in your tree. I've put the conflict resolution I used for testing here: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.8-merged Jeff Mahoney (14) commits (+754/-669): btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root (+19/-21) btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction (+1/-1) btrfs: btrfs_test_opt and friends should take a btrfs_fs_info (+135/-130) btrfs: cleanup, remove prototype for btrfs_find_root_ref (+0/-3) btrfs: btrfs_abort_transaction, drop root parameter (+147/-152) btrfs: convert nodesize macros to static inlines (+33/-15) btrfs: tests, move initialization into tests/ (+48/-77) btrfs: add btrfs_trans_handle->fs_info pointer (+6/-4) btrfs: copy_to_sk drop unused root parameter (+2/-3) btrfs: simpilify btrfs_subvol_inherit_props (+3/-3) btrfs: prefix fsid to all trace events (+186/-158) btrfs: tests, require fs_info for root (+103/-61) btrfs: plumb fs_info into btrfs_work (+63/-31) btrfs: introduce BTRFS_MAX_ITEM_SIZE (+8/-10) Liu Bo (10) commits (+149/-49): Btrfs: change BUG_ON()'s to ASSERT()'s in backref_cache_cleanup() (+6/-6) Btrfs: error out if generic_bin_search get invalid arguments (+8/-0) Btrfs: check inconsistence between chunk and block group (+16/-1) Btrfs: fix unexpected balance crash due to BUG_ON (+24/-4) Btrfs: fix eb memory leak due to readpage failure (+22/-3) Btrfs: fix BUG_ON in btrfs_submit_compressed_write (+8/-2) Btrfs: fix read_node_slot to return errors (+52/-21) Btrfs: fix panic in balance due to EIO (+4/-0) Btrfs: cleanup BUG_ON in merge_bio (+6/-3) Btrfs: fix double free of fs root (+3/-9) Nikolay Borisov (4) commits (+49/-20): btrfs: Ratelimit "no csum found" info message (+1/-1) btrfs: Handle uninitialised inode eviction (+8/-1) btrfs: Add ratelimit to btrfs printing (+24/-2) btrfs: Fix slab accounting flags (+16/-16) Wang Xiaoguang (3) commits (+45/-13): btrfs: expand cow_file_range() to support in-band dedup and subpage-blocksize (+41/-11) btrfs: add missing bytes_readonly attribute file in sysfs (+2/-0) btrfs: fix free space calculation in dump_space_info() (+2/-2) Anand Jain (2) commits (+40/-36): btrfs: make sure device is synced before return (+5/-0) btrfs: reorg btrfs_close_one_device() (+35/-36) David Sterba (2) commits (+4/-3): btrfs: remove obsolete part of comment in statfs (+0/-3) btrfs: hide test-only member under ifdef (+4/-0) Ashish Samant (1) commits (+35/-37): btrfs: Cleanup compress_file_range() Chris Mason (1) commits (+3/-2): Btrfs: fix __MAX_CSUM_ITEMS Chandan Rajendra (1) commits (+1/-1): Btrfs: subpage-blocksize: Rate limit scrub error message Salah Triki (1) commits (+1/-2): btrfs: Replace -ENOENT by -ERANGE in btrfs_get_acl() Hans van Kranenburg (1) commits (+1/-1): Btrfs: use the correct struct for BTRFS_IOC_LOGICAL_INO Total: (40) commits (+1082/-833) fs/btrfs/acl.c | 3 +- fs/btrfs/async-thread.c| 31 +++- fs/btrfs/async-thread.h| 6 +- fs/btrfs/backref.c | 4 +- fs/btrfs/compression.c | 10 +- fs/btrfs/ctree.c | 91 ++ fs/btrfs/ctree.h | 101 ++- fs/btrfs/dedupe.h | 24 +++ fs/btrfs/delayed-inode.c | 4 +- fs/btrfs/delayed-ref.c | 17 +- fs/btrfs/dev-replace.c | 4 +- fs/btrfs/disk-io.c | 101 +-- fs/btrfs/disk-io.h | 3 +- fs/btrfs/extent-tree.c | 124 -- fs/btrfs/extent_io.c | 30 +++- fs/btrfs/extent_map.c | 2 +- fs/btrfs/file-item.c | 4 +- fs/btrfs/file.c| 12 +- fs/btrfs/free-space-cache.c| 8 +- fs/btrfs/free-space-tree.c | 16 +- fs/btrfs/inode-map.c | 16 +- fs/btrfs/inode.c | 218 fs/btrfs/ioctl.c | 40 ++--- fs/btrfs/ordered-data.c| 2 +- fs/btrfs/props.c | 6 +- fs/btrfs/qgroup.c | 25 +-- fs/btrfs/qgroup.h | 9 +- fs/btrfs/relocation.c | 20 ++- fs/btrfs
Re: How to stress test raid6 on 122 disk array
On Thu, Aug 4, 2016 at 1:05 PM, Austin S. Hemmelgarn wrote: >Fedora should be fine (they're good about staying up to > date), but if possible you should probably use Rawhide instead of a regular > release, as that will give you quite possibly one of the closest > distribution kernels to a mainline Linux kernel available, and will make > sure everything is as up to date as possible. Yes. It's possible to run on a release version (currently Fedora 23 and Fedora 24) and run a Rawhide kernel. This is what I often do. > As far as testing, I don't know that there are any scripts for this type of > thing, you may want to look into dbench, fio, iozone, and similar tools > though, as well as xfstests (which is more about regression testing, but is > still worth looking at). > > Most of the big known issues with RAID6 in BTRFS at the moment involve > device failures and array recovery, but most of them aren't well > characterized and nobody's really sure why they're happening, so if you want > to look for something specific, figuring out those issues would be a great > place to start (even if they aren't rare bugs). Yeah it seems pretty reliable to do normal things with raid56 arrays. The problem is when they're degraded, weird stuff seems to happen some of the time. So it might be valid to have several raid56's that are intentionally running in degraded mode with some tests that will tolerate that and see when it breaks and why. There is also in the archives the bug where parity is being computed wrongly when a data strip is wrong (corrupt), and Btrfs sees this, reports the mismatch, fixes the mismatch, recomputes parity for some reason, and the parity is then wrong. It'd be nice to know when else this can happen, if it's possible parity is recomputed (and wrongly) on a normal read, or a balance, or if it's really restricted to scrub. Another test might be raid 1 or raid10 metadata vs raid56 for data. That'd probably be more performance related, but there might be some unexpected behaviors that crop up. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] exportfs: be careful to only return expected errors.
On Thu, Aug 04, 2016 at 05:47:19AM -0700, Christoph Hellwig wrote: > On Thu, Aug 04, 2016 at 10:19:06AM +1000, NeilBrown wrote: > > > > > > When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors. > > In particular it can be tempting to return ENOENT, but this is not > > handled well by nfsd. > > > > Rather than requiring strict adherence to error code code filesystems, > > treat all unexpected error codes the same as ESTALE. This is safest. > > > > Signed-off-by: NeilBrown > > --- > > > > I didn't add a dprintk for unexpected error messages, partly > > because dprintk isn't usable in exportfs. I could have used pr_debug() > > but I really didn't see much value. > > > > This has been tested together with the btrfs change, and it restores > > correct functionality. > > I don't really like all this magic which is partially historic. I think > we should instead allow the fs to return any error from the export > operations, What errors other than ENOENT and ENOMEM do you think are reasonable? ENOENT is going to screw up both nfsd and open_by_fhandle_at, which are the only callers. > and forbid returning NULL entirely. Then the actualy caller > (nfsd) can sort out which errors it wants to send over the wire. The needs of those two callers don't look very different to me, and I can't recall seeing a correct use of an error other than ESTALE or ENOMEM, so I've been thinking of it more of a question of how to best handle a misbehaving filesystem. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem
On Thu, Aug 4, 2016 at 10:53 AM, Lutz Vieweg wrote: > The amount of threads on "lost or unused free space" without resolutions > in the btrfs mailing list archive is really frightening. If these > symptoms commonly re-appear with no fix in sight, I'm afraid I'll have > to either resort to using XFS (with ugly block-device based snapshots > for backup) or try my luck with OpenZFS :-( Keep in mind the list is rather self-selecting for problems. People who aren't having problems are unlikely to post their non-problems to the list. It'll be interesting to see what other suggestions you get, but I see it as basically three options in order of increasing risk+effort. a. Try the clear_cache mount option (one time) and let the file system stay mounted so the cache is recreated. If the problem happens soon after again, try nospace_cache. This might buy you time before 4.8 is out, which has a bunch of new enospc code in it. b. Recreate the file system. For reasons not well understood, some file systems just get stuck in this state with bogus enospc claims. c. Take some risk and use 4.8 rc1 once it's out. Just make sure to keep backups. I have no idea to what degree the new enospc code can help well used existing systems already having enospc issues, vs the code prevents the problem from happening in the first place. So you may end up at b. anyway. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to stress test raid6 on 122 disk array
Thanks for the benchmark tools and tips on where the issues might be. Is Fedora 24 rawhide preferred over ArchLinux? If I want to compile a mainline kernel. Are there anything I need to tune? When I do the tests, how do I log the info you would like to see, if I find a bug? On 4 August 2016 at 22:01, Chris Murphy wrote: > On Thu, Aug 4, 2016 at 1:05 PM, Austin S. Hemmelgarn > wrote: > >>Fedora should be fine (they're good about staying up to >> date), but if possible you should probably use Rawhide instead of a regular >> release, as that will give you quite possibly one of the closest >> distribution kernels to a mainline Linux kernel available, and will make >> sure everything is as up to date as possible. > > Yes. It's possible to run on a release version (currently Fedora 23 > and Fedora 24) and run a Rawhide kernel. This is what I often do. > > >> As far as testing, I don't know that there are any scripts for this type of >> thing, you may want to look into dbench, fio, iozone, and similar tools >> though, as well as xfstests (which is more about regression testing, but is >> still worth looking at). >> >> Most of the big known issues with RAID6 in BTRFS at the moment involve >> device failures and array recovery, but most of them aren't well >> characterized and nobody's really sure why they're happening, so if you want >> to look for something specific, figuring out those issues would be a great >> place to start (even if they aren't rare bugs). > > Yeah it seems pretty reliable to do normal things with raid56 arrays. > The problem is when they're degraded, weird stuff seems to happen some > of the time. So it might be valid to have several raid56's that are > intentionally running in degraded mode with some tests that will > tolerate that and see when it breaks and why. > > There is also in the archives the bug where parity is being computed > wrongly when a data strip is wrong (corrupt), and Btrfs sees this, > reports the mismatch, fixes the mismatch, recomputes parity for some > reason, and the parity is then wrong. It'd be nice to know when else > this can happen, if it's possible parity is recomputed (and wrongly) > on a normal read, or a balance, or if it's really restricted to scrub. > > Another test might be raid 1 or raid10 metadata vs raid56 for data. > That'd probably be more performance related, but there might be some > unexpected behaviors that crop up. > > > > -- > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to stress test raid6 on 122 disk array
On Thu, Aug 4, 2016 at 2:51 PM, Martin wrote: > Thanks for the benchmark tools and tips on where the issues might be. > > Is Fedora 24 rawhide preferred over ArchLinux? I'm not sure what Arch does any differently to their kernels from kernel.org kernels. But bugzilla.kernel.org offers a Mainline and Fedora drop down for identifying the kernel source tree. > > If I want to compile a mainline kernel. Are there anything I need to tune? Fedora kernels do not have these options set. # CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set # CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set # CONFIG_BTRFS_DEBUG is not set # CONFIG_BTRFS_ASSERT is not set The sanity and integrity tests are both compile time and mount time options, i.e. it has to be compiled enabled for the mount option to do anything. I can't recall any thread where a developer asked a user to set any of these options for testing though. > When I do the tests, how do I log the info you would like to see, if I > find a bug? bugzilla.kernel.org for tracking, and then reference the URL for the bug with a summary in an email to list is how I usually do it. The main thing is going to be the exact reproduce steps. It's also better, I think, to have complete dmesg (or journalctl -k) attached to the bug report because not all problems are directly related to Btrfs, they can have contributing factors elsewhere. And various MTAs, or more commonly MUAs, have a tendancy to wrap such wide text as found in kernel or journald messages. And then whatever Austin says. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to stress test raid6 on 122 disk array
Excellent. Thanks. In order to automate it, would it be ok if I dd some zeroes directly to the devices to corrupt them, or do need to physically take the disks out while running? The smallest disk of the 122 is 500GB. Is it possible to have btrfs see each disk as only e.g. 10GB? That way I can corrupt and resilver more disks over a month. On 4 August 2016 at 23:12, Chris Murphy wrote: > On Thu, Aug 4, 2016 at 2:51 PM, Martin wrote: >> Thanks for the benchmark tools and tips on where the issues might be. >> >> Is Fedora 24 rawhide preferred over ArchLinux? > > I'm not sure what Arch does any differently to their kernels from > kernel.org kernels. But bugzilla.kernel.org offers a Mainline and > Fedora drop down for identifying the kernel source tree. > >> >> If I want to compile a mainline kernel. Are there anything I need to tune? > > Fedora kernels do not have these options set. > > # CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set > # CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set > # CONFIG_BTRFS_DEBUG is not set > # CONFIG_BTRFS_ASSERT is not set > > The sanity and integrity tests are both compile time and mount time > options, i.e. it has to be compiled enabled for the mount option to do > anything. I can't recall any thread where a developer asked a user to > set any of these options for testing though. > > >> When I do the tests, how do I log the info you would like to see, if I >> find a bug? > > bugzilla.kernel.org for tracking, and then reference the URL for the > bug with a summary in an email to list is how I usually do it. The > main thing is going to be the exact reproduce steps. It's also better, > I think, to have complete dmesg (or journalctl -k) attached to the bug > report because not all problems are directly related to Btrfs, they > can have contributing factors elsewhere. And various MTAs, or more > commonly MUAs, have a tendancy to wrap such wide text as found in > kernel or journald messages. > > And then whatever Austin says. > > > > -- > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS: Transaction aborted (error -28)
B.H. > On Fri, Jul 29, 2016 at 8:23 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> So I'd recommend upgrading to the latest kernel 4.4 if you want to stay >> with the stable series, or 4.6 or 4.7 if you want current, and then (less >> important) upgrading the btrfs userspace as well. It's possible the >> newer kernel will handle the combined rsync and send stresses better, and >> if not, you're on a better base to provide bug reports, etc. > > OK, upgraded to 4.4 (Ubuntu 16.04 stock kernel) and the fresh > btrfs-progs 4.7. I'm assuming the error was due to some kind of bug or > race condition and the FS is clean. Let's see how it behaves. Thanks! Hello, i'm still getting ENOSPC errors. The latest time the log looks like this: Aug 4 21:55:06 yemot-4u kernel: [304090.288927] [ cut here ] Aug 4 21:55:06 yemot-4u kernel: [304090.288961] WARNING: CPU: 1 PID: 4531 at /build/linux-dcxD3m/linux-4.4.0/fs/btrfs/extent-tree.c:2927 btrfs_run_delayed_refs+0x26b/0 x2a0 [btrfs]() Aug 4 21:55:06 yemot-4u kernel: [304090.288965] BTRFS: error (device md1) in btrfs_run_delayed_refs:2927: errno=-28 No space left Aug 4 21:55:06 yemot-4u kernel: [304090.288968] BTRFS info (device md1): forced readonly Aug 4 21:55:06 yemot-4u kernel: [304090.288972] BTRFS: error (device md1) in btrfs_run_delayed_refs:2927: errno=-28 No space left Aug 4 21:55:06 yemot-4u kernel: [304090.289129] BTRFS: Transaction aborted (error -28) Aug 4 21:55:06 yemot-4u kernel: [304090.289131] Modules linked in: binfmt_misc ipmi_ssif btrfs x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass c rct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd input_leds sb_edac serio_raw joydev edac_core lpc_ich snd_hda_codec_real tek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm mei_me snd_timer mei snd soundcore shpchp ipmi_si 8250_fintek ipmi_msghandler mac_h id nfsd auth_rpcgss nfs_acl lockd grace sunrpc lp parport autofs4 raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid10 raid1 ses enclosure igb dca ast ttm drm_kms_helper syscopyarea hid_generic sysfillrect firewire_ohci sysimgblt fb_sys_fops ahci usbhid firewire_core p tp psmouse libahci isci hid drm crc_itu_t libsas pps_core i2c_algo_bit aacraid scsi_transport_sas wmi fjes Aug 4 21:55:06 yemot-4u kernel: [304090.289201] CPU: 1 PID: 4531 Comm: kworker/u16:28 Not tainted 4.4.0-31-generic #50-Ubuntu Aug 4 21:55:06 yemot-4u kernel: [304090.289203] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EPC602D8A, BIOS P1.20 04/16/2014 Aug 4 21:55:06 yemot-4u kernel: [304090.289226] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] Aug 4 21:55:06 yemot-4u kernel: [304090.289229] 0286 b56c494e 880744f4bc98 813f1143 Aug 4 21:55:06 yemot-4u kernel: [304090.289232] 880744f4bce0 c06d8468 880744f4bcd0 81081102 Aug 4 21:55:06 yemot-4u kernel: [304090.289234] 8807fbff3000 880859b13000 880729f30b90 Aug 4 21:55:06 yemot-4u kernel: [304090.289237] Call Trace: Aug 4 21:55:06 yemot-4u kernel: [304090.289244] [] dump_stack+0x63/0x90 Aug 4 21:55:06 yemot-4u kernel: [304090.289249] [] warn_slowpath_common+0x82/0xc0 Aug 4 21:55:06 yemot-4u kernel: [304090.289252] [] warn_slowpath_fmt+0x5c/0x80 Aug 4 21:55:06 yemot-4u kernel: [304090.289268] [] btrfs_run_delayed_refs+0x26b/0x2a0 [btrfs] Aug 4 21:55:06 yemot-4u kernel: [304090.289284] [] delayed_ref_async_start+0x37/0x90 [btrfs] Aug 4 21:55:06 yemot-4u kernel: [304090.289303] [] btrfs_scrubparity_helper+0xca/0x2f0 [btrfs] Aug 4 21:55:06 yemot-4u kernel: [304090.289307] [] ? tty_ldisc_deref+0x16/0x20 Aug 4 21:55:06 yemot-4u kernel: [304090.289326] [] btrfs_extent_refs_helper+0xe/0x10 [btrfs] Aug 4 21:55:06 yemot-4u kernel: [304090.289330] [] process_one_work+0x165/0x480 Aug 4 21:55:06 yemot-4u kernel: [304090.289333] [] worker_thread+0x4b/0x4c0 Aug 4 21:55:06 yemot-4u kernel: [304090.289336] [] ? process_one_work+0x480/0x480 Aug 4 21:55:06 yemot-4u kernel: [304090.289339] [] kthread+0xd8/0xf0 Aug 4 21:55:06 yemot-4u kernel: [304090.289341] [] ? kthread_create_on_node+0x1e0/0x1e0 Aug 4 21:55:06 yemot-4u kernel: [304090.289345] [] ret_from_fork+0x3f/0x70 Aug 4 21:55:06 yemot-4u kernel: [304090.289348] [] ? kthread_create_on_node+0x1e0/0x1e0 Aug 4 21:55:06 yemot-4u kernel: [304090.289350] ---[ end trace 90c37e7522254f86 ]--- Aug 4 21:55:06 yemot-4u kernel: [304090.289353] BTRFS: error (device md1) in btrfs_run_delayed_refs:2927: errno=-28 No space left Aug 4 21:55:06 yemot-4u kernel: [304090.328312] BTRFS: error (device md1) in __btrfs_free_extent:6552: errno=-28 No space left Aug 4 21:55:06 yemot-4u kernel: [304090.328344] BTRFS: error (device md1) in btrfs_run_delayed_refs:2927: errno=-28 No space left root@yemot-4u:~# uname -a Linux yemot-4u 4.4.0-31-ge
possible bug - wrong path in 'btrfs subvolume show' when snapshot is in path below subvolume.
'btrfs subvolumee show' gives no path to btrfs system root (volid=5) when snapshot is in the folder of subvolume. Step to reproduce. 1.btrfs subvolume create xyz 2.btrfs subvolume snapshot xyz xyz/xyz 3.btrfs subvolume snapshot /xyz 4.btrfs subvolumme show xyz output . Snapshot(s) xyz xyz . picture from my console reproducing this. Whatchout for my personal fs-layout my mountpoint for volid=5 is - as seen in the findmount command r at top of the photo /mnt/btrfs/sdc16-svid-5 https://s31.postimg.org/9f0d7xb7f/is_this_a_bug.png If that can add anything, same thing happends when rootvolume is mounted by path. (for the moment it is mounted by volid). /Peter Holm -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible bug - wrong path in 'btrfs subvolume show' when snapshot is in path below subvolume.
writing error. replace "gives no path to" with "same path as" /Peter Holm 2016-08-05 1:32 GMT+02:00, Peter Holm : > 'btrfs subvolumee show' gives no path to btrfs system root (volid=5) > when snapshot is in the folder of subvolume. > > Step to reproduce. > 1.btrfs subvolume create xyz > 2.btrfs subvolume snapshot xyz xyz/xyz > 3.btrfs subvolume snapshot /xyz > 4.btrfs subvolumme show xyz > output > . > Snapshot(s) > xyz > xyz > . > picture from my console reproducing this. Whatchout for my personal > fs-layout > my mountpoint for volid=5 is - as seen in the findmount command r at > top of the photo /mnt/btrfs/sdc16-svid-5 > https://s31.postimg.org/9f0d7xb7f/is_this_a_bug.png > > If that can add anything, same thing happends when rootvolume is > mounted by path. (for the moment it is mounted by volid). > /Peter Holm > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] exportfs: be careful to only return expected errors.
On Thu, Aug 04 2016, Christoph Hellwig wrote: > On Thu, Aug 04, 2016 at 10:19:06AM +1000, NeilBrown wrote: >> >> >> When nfsd calls fh_to_dentry, it expect ESTALE or ENOMEM as errors. >> In particular it can be tempting to return ENOENT, but this is not >> handled well by nfsd. >> >> Rather than requiring strict adherence to error code code filesystems, >> treat all unexpected error codes the same as ESTALE. This is safest. >> >> Signed-off-by: NeilBrown >> --- >> >> I didn't add a dprintk for unexpected error messages, partly >> because dprintk isn't usable in exportfs. I could have used pr_debug() >> but I really didn't see much value. >> >> This has been tested together with the btrfs change, and it restores >> correct functionality. > > I don't really like all this magic which is partially historic. I think > we should instead allow the fs to return any error from the export > operations, and forbid returning NULL entirely. Then the actualy caller > (nfsd) can sort out which errors it wants to send over the wire. I'm certainly open to that possibility. But is the "actual caller": nfsd_set_fh_dentry(), or fh_verify() or the various callers of fh_verify() which might have different rules about which error codess are acceptable? I could probably make an argument for having fh_verify() be careful about error codes, but as exportfs_decode_fh() is a more public interface, I think it is more important that it have well defined error options. Are there *any* errors that could sensibly be returned from exportfs_decode_fh() other than -ESTALE (there is no such file), or -ENOMEM (there probably is a file, but I cannot allocate a dentry for it) or -EACCES (there is such a file, but it isn't "acceptable") ??? If there aren't, why should we let them through? NeilBrown signature.asc Description: PGP signature
[PATCH v3] xfs: test attr_list_by_handle cursor iteration
Apparently the XFS attr_list_by_handle ioctl has never actually copied the cursor contents back to user space, which means that iteration has never worked. Add a test case for this and see "xfs: in _attrlist_by_handle, copy the cursor back to userspace". v2: Use BULKSTAT_SINGLE for less confusion, fix build errors on RHEL6. v3: Use path_to_handle instead of bulkstat. Signed-off-by: Darrick J. Wong --- .gitignore|1 src/Makefile |3 + src/attr-list-by-handle-cursor-test.c | 118 + tests/xfs/700 | 64 ++ tests/xfs/700.out |5 + tests/xfs/group |1 6 files changed, 191 insertions(+), 1 deletion(-) create mode 100644 src/attr-list-by-handle-cursor-test.c create mode 100755 tests/xfs/700 create mode 100644 tests/xfs/700.out diff --git a/.gitignore b/.gitignore index 28bd180..e184a6f 100644 --- a/.gitignore +++ b/.gitignore @@ -38,6 +38,7 @@ /src/alloc /src/append_reader /src/append_writer +/src/attr-list-by-handle-cursor-test /src/bstat /src/bulkstat_unlink_test /src/bulkstat_unlink_test_modified diff --git a/src/Makefile b/src/Makefile index 1bf318b..ae06d50 100644 --- a/src/Makefile +++ b/src/Makefile @@ -20,7 +20,8 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \ stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \ seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \ - renameat2 t_getcwd e4compact test-nextquota punch-alternating + renameat2 t_getcwd e4compact test-nextquota punch-alternating \ + attr-list-by-handle-cursor-test SUBDIRS = diff --git a/src/attr-list-by-handle-cursor-test.c b/src/attr-list-by-handle-cursor-test.c new file mode 100644 index 000..4269d1e --- /dev/null +++ b/src/attr-list-by-handle-cursor-test.c @@ -0,0 +1,118 @@ +/* + * Copyright (C) 2016 Oracle. All Rights Reserved. + * + * Author: Darrick J. Wong + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define ATTRBUFSZ 1024 +#define BSTATBUF_NR32 + +/* Read all the extended attributes of a file handle. */ +void +read_handle_xattrs( + struct xfs_handle *handle) +{ + struct attrlist_cursor cur; + charattrbuf[ATTRBUFSZ]; + char*firstname = NULL; + struct attrlist *attrlist = (struct attrlist *)attrbuf; + struct attrlist_ent *ent; + int i; + int flags = 0; + int error; + + memset(&cur, 0, sizeof(cur)); + while ((error = attr_list_by_handle(handle, sizeof(*handle), + attrbuf, ATTRBUFSZ, flags, + &cur)) == 0) { + for (i = 0; i < attrlist->al_count; i++) { + ent = ATTR_ENTRY(attrlist, i); + + if (i != 0) + continue; + + if (firstname == NULL) { + firstname = malloc(ent->a_valuelen); + memcpy(firstname, ent->a_name, ent->a_valuelen); + } else { + if (memcmp(firstname, ent->a_name, + ent->a_valuelen) == 0) + fprintf(stderr, + "Saw duplicate xattr \"%s\", buggy XFS?\n", + ent->a_name); + else + fprintf(stderr, + "Test passes.\n"); + goto out; + } + } + + if (!attrlist->al_more) + break; + } + +out: + if (firstname) + free(firstname); + if (error) + p
[PATCH v2 0/3] Qgroup fix for dirty hack routines
This patchset introduce 2 fixes for data extent owner hacks. One can be triggered by balance, another one can be trigged by log replay after power loss. Root cause are all similar: EXTENT_DATA owner is changed by dirty hacks, from swapping tree blocks containing EXTENT_DATA to manually update extent backref without using inc/dec_extent_ref. The first patch introduces needed functions, then 2 fixes. The reproducer are all merged into xfstests, btrfs/123 and btrfs/119. 3rd patch stay untouched while 2nd patch get update thanks for the report from Goldwyn. Changelog: v2: Update 2nd patch to handle cases where the whole subtree, not only level 2 nodes get updated. Qu Wenruo (3): btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() btrfs: relocation: Fix leaking qgroups numbers on data extents btrfs: qgroup: Fix qgroup incorrectness caused by log replay fs/btrfs/delayed-ref.c | 5 +-- fs/btrfs/extent-tree.c | 36 +++- fs/btrfs/qgroup.c | 39 ++--- fs/btrfs/qgroup.h | 44 +-- fs/btrfs/relocation.c | 114 ++--- fs/btrfs/tree-log.c| 16 +++ 6 files changed, 205 insertions(+), 49 deletions(-) -- 2.9.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent()
Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions: 1. _btrfs_qgroup_insert_dirty_extent() Almost the same with original code. For delayed_ref usage, which has delayed refs locked. Change the return value type to int, since caller never needs the pointer, but only needs to know if they need to free the allocated memory. 2. btrfs_qgroup_record_dirty_extent() The more encapsulated version. Will do the delayed_refs lock, memory allocation, quota enabled check and other misc things. The original design is to keep exported functions to minimal, but since more btrfs hacks exposed, like replacing path in balance, needs us to record dirty extents manually, so we have to add such functions. Also, add comment for both functions, to info developers how to keep qgroup correct when doing hacks. Cc: Mark Fasheh Signed-off-by: Qu Wenruo --- fs/btrfs/delayed-ref.c | 5 + fs/btrfs/extent-tree.c | 36 +--- fs/btrfs/qgroup.c | 39 ++- fs/btrfs/qgroup.h | 44 +--- 4 files changed, 81 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 430b368..5eed597 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -541,7 +541,6 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_head *existing; struct btrfs_delayed_ref_head *head_ref = NULL; struct btrfs_delayed_ref_root *delayed_refs; - struct btrfs_qgroup_extent_record *qexisting; int count_mod = 1; int must_insert_reserved = 0; @@ -606,9 +605,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, qrecord->num_bytes = num_bytes; qrecord->old_roots = NULL; - qexisting = btrfs_qgroup_insert_dirty_extent(delayed_refs, -qrecord); - if (qexisting) + if(_btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord)) kfree(qrecord); } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9fcb8c9..47c85ff 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8519,34 +8519,6 @@ reada: wc->reada_slot = slot; } -/* - * These may not be seen by the usual inc/dec ref code so we have to - * add them here. - */ -static int record_one_subtree_extent(struct btrfs_trans_handle *trans, -struct btrfs_root *root, u64 bytenr, -u64 num_bytes) -{ - struct btrfs_qgroup_extent_record *qrecord; - struct btrfs_delayed_ref_root *delayed_refs; - - qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS); - if (!qrecord) - return -ENOMEM; - - qrecord->bytenr = bytenr; - qrecord->num_bytes = num_bytes; - qrecord->old_roots = NULL; - - delayed_refs = &trans->transaction->delayed_refs; - spin_lock(&delayed_refs->lock); - if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord)) - kfree(qrecord); - spin_unlock(&delayed_refs->lock); - - return 0; -} - static int account_leaf_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *eb) @@ -8580,7 +8552,8 @@ static int account_leaf_items(struct btrfs_trans_handle *trans, num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi); - ret = record_one_subtree_extent(trans, root, bytenr, num_bytes); + ret = btrfs_qgroup_record_dirty_extent(trans, root->fs_info, + bytenr, num_bytes, GFP_NOFS); if (ret) return ret; } @@ -8729,8 +8702,9 @@ walk_down: btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); path->locks[level] = BTRFS_READ_LOCK_BLOCKING; - ret = record_one_subtree_extent(trans, root, child_bytenr, - root->nodesize); + ret = btrfs_qgroup_record_dirty_extent(trans, + root->fs_info, child_bytenr, + root->nodesize, GFP_NOFS); if (ret) goto out; } diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 9d4c05b..76d4f67 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1453,9 +1453,9 @@ int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans, return ret; } -struct btrfs_qgroup_extent_record -*btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs, - struct btrfs_qgroup_extent_record *record) +int _btrfs_qgroup_insert
[PATCH v2 2/3] btrfs: relocation: Fix leaking qgroups numbers on data extents
When balancing data extents, qgroup will leak all its numbers for relocated data extents. The relocation is done in the following steps for data extents: 1) Create data reloc tree and inode 2) Copy all data extents to data reloc tree And commit transaction 3) Create tree reloc tree(special snapshot) for any related subvolumes 4) Replace file extent in tree reloc tree with new extents in data reloc tree And commit transaction 5) Merge tree reloc tree with original fs, by swapping tree blocks For 1)~4), since tree reloc tree and data reloc tree doesn't count to qgroup, everything is OK. But for 5), the swapping of tree blocks will only info qgroup to track metadata extents. If metadata extents contain file extents, qgroup number for file extents will get lost, leading to corrupted qgroup accounting. The fix is, before commit transaction of step 5), manually info qgroup to track all file extents in data reloc tree. Since at commit transaction time, the tree swapping is done, and qgroup will account these data extents correctly. Cc: Mark Fasheh Reported-by: Mark Fasheh Reported-by: Filipe Manana Signed-off-by: Qu Wenruo --- changelog: v2: Iterate all file extents in data reloc tree, instead of iterating leafs of a swapped level 1 tree block. This fixes case where a level 2 or higher tree block is merged with original subvolume. --- fs/btrfs/relocation.c | 114 +++--- 1 file changed, 108 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index fc067b0..def7c9c 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -31,6 +31,7 @@ #include "async-thread.h" #include "free-space-cache.h" #include "inode-map.h" +#include "qgroup.h" /* * backref_node, mapping_node and tree_block start with this @@ -3912,6 +3913,95 @@ int prepare_to_relocate(struct reloc_control *rc) return 0; } +/* + * Qgroup fixer for data chunk relocation. + * The data relocation is done in the following steps + * 1) Copy data extents into data reloc tree + * 2) Create tree reloc tree(special snapshot) for related subvolumes + * 3) Modify file extents in tree reloc tree + * 4) Merge tree reloc tree with original fs tree, by swapping tree blocks + * + * The problem is, data and tree reloc tree are not accounted to qgroup, + * and 4) will only info qgroup to track tree blocks change, not file extents + * in the tree blocks. + * + * The good news is, related data extents are all in data reloc tree, so we + * only need to info qgroup to track all file extents in data reloc tree + * before commit trans. + */ +static int qgroup_fix_relocated_data_extents(struct btrfs_trans_handle *trans, +struct reloc_control *rc) +{ + struct btrfs_fs_info *fs_info = rc->extent_root->fs_info; + struct inode *inode = rc->data_inode; + struct btrfs_root *data_reloc_root = BTRFS_I(inode)->root; + struct btrfs_path *path; + struct btrfs_key key; + int ret = 0; + + if (!fs_info->quota_enabled) + return 0; + + /* +* Only for stage where we update data pointers the qgroup fix is +* valid. +* For MOVING_DATA stage, we will miss the timing of swapping tree +* blocks, and won't fix it. +*/ + if (!(rc->stage == UPDATE_DATA_PTRS && rc->extents_found)) + return 0; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + key.objectid = btrfs_ino(inode); + key.type = BTRFS_EXTENT_DATA_KEY; + key.offset = 0; + + ret = btrfs_search_slot(NULL, data_reloc_root, &key, path, 0, 0); + if (ret < 0) + goto out; + + lock_extent(&BTRFS_I(inode)->io_tree, 0, (u64)-1); + while (1) { + struct btrfs_file_extent_item *fi; + + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); + if (key.objectid > btrfs_ino(inode)) + break; + if (key.type != BTRFS_EXTENT_DATA_KEY) + goto next; + fi = btrfs_item_ptr(path->nodes[0], path->slots[0], + struct btrfs_file_extent_item); + if (btrfs_file_extent_type(path->nodes[0], fi) != + BTRFS_FILE_EXTENT_REG) + goto next; + /* + pr_info("disk bytenr: %llu, num_bytes: %llu\n", + btrfs_file_extent_disk_bytenr(path->nodes[0], fi), + btrfs_file_extent_disk_num_bytes(path->nodes[0], fi)); + */ + ret = btrfs_qgroup_record_dirty_extent(trans, fs_info, + btrfs_file_extent_disk_bytenr(path->nodes[0], fi), + btrfs_file_extent_disk_num_bytes(path->nodes[0], fi), + GFP_NOFS); + if (ret < 0) +
[PATCH v2 3/3] btrfs: qgroup: Fix qgroup incorrectness caused by log replay
When doing log replay at mount time(after power loss), qgroup will leak numbers of replayed data extents. The cause is almost the same of balance. So fix it by manually informing qgroup for owner changed extents. The bug can be detected by btrfs/119 test case. Cc: Mark Fasheh Signed-off-by: Qu Wenruo --- fs/btrfs/tree-log.c | 16 1 file changed, 16 insertions(+) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c05f69a..80f8345 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -27,6 +27,7 @@ #include "backref.h" #include "hash.h" #include "compression.h" +#include "qgroup.h" /* magic values for the inode_only field in btrfs_log_inode: * @@ -680,6 +681,21 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans, ins.type = BTRFS_EXTENT_ITEM_KEY; offset = key->offset - btrfs_file_extent_offset(eb, item); + /* +* Manually record dirty extent, as here we did a shallow +* file extent item copy and skip normal backref update, +* but modify extent tree all by ourselves. +* So need to manually record dirty extent for qgroup, +* as the owner of the file extent changed from log tree +* (doesn't affect qgroup) to fs/file tree(affects qgroup) +*/ + ret = btrfs_qgroup_record_dirty_extent(trans, root->fs_info, + btrfs_file_extent_disk_bytenr(eb, item), + btrfs_file_extent_disk_num_bytes(eb, item), + GFP_NOFS); + if (ret < 0) + goto out; + if (ins.objectid > 0) { u64 csum_start; u64 csum_end; -- 2.9.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.8] btrfs heats my room with lock contention
On Thu, Aug 04, 2016 at 10:28:44AM -0400, Chris Mason wrote: > > > On 08/04/2016 02:41 AM, Dave Chinner wrote: > > > >Simple test. 8GB pmem device on a 16p machine: > > > ># mkfs.btrfs /dev/pmem1 > ># mount /dev/pmem1 /mnt/scratch > ># dbench -t 60 -D /mnt/scratch 16 > > > >And heat your room with the warm air rising from your CPUs. Top > >half of the btrfs profile looks like: . > >Performance vs CPu usage is: > > > >nprocs throughput cpu usage > >1440MB/s 50% > >2770MB/s 100% > >4880MB/s 250% > >8690MB/s 450% > >16 280MB/s 950% > > > >In comparision, at 8-16 threads ext4 is running at ~2600MB/s and > >XFS is running at ~3800MB/s. Even if I throw 300-400 processes at > >ext4 and XFS, they only drop to ~1500-2000MB/s as they hit internal > >limits. > > > Yes, with dbench btrfs does much much better if you make a subvol > per dbench dir. The difference is pretty dramatic. I'm working on > it this month, but focusing more on database workloads right now. You've been giving this answer to lock contention reports for the past 6-7 years, Chris. I really don't care about getting big benchmark numbers with contrived setups - the "use multiple subvolumes" solution is simply not practical for users or their workloads. The default config should behave sanely and not not contribute to global warming like this. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html