Re: free space is missing after dist upgrade on lzo compressed vol
Ubuntu create snapshot before each release upgrade sudo mount /dev/sda6 /mnt -o rw,subvol=/; ls /mnt 2015-11-14 9:16 GMT+03:00 Brenton Chapin: > Thanks for the ideas. Sadly, no snapshots, unless btrfs does that by > default. Never heard of snapper before. > > Don't see how open files could be a problem, since the computer has > been rebooted several times. > > I wonder... could the distribution upgrade have moved all the old > files into a hidden trash directory, rather than deleting them? But > du picks up hidden directories, I believe. Doesn't seem like that > could be it either. > > On Fri, Nov 13, 2015 at 4:38 PM, Hugo Mills wrote: >> On Fri, Nov 13, 2015 at 04:33:23PM -0600, Brenton Chapin wrote: >>> I was running Lubuntu 14.04 on btrfs with lzo compresssion on, with >>> the following partition scheme: >>> >>> sda5 232M /boot >>> sda6 16G / >>> sda7 104G /home >>> >>> (sda5 is ext4) >>> >>> I did 2 distribution upgrades, one after the other, to 15.04, then >>> 15.10, since the upgrade utility would not go directly to the latest >>> version. This process did a whole lot of reading and writing to the >>> root volume of course. Everything seems to be working, except most of >>> the free space I had on sda6 is gone. Was using about 4G, now df >>> reports that the usage is 12G. At first, I thought Lubuntu had not >>> removed old files, but I can't find anything old left behind. I began >>> to suspect btrfs, and checking, find that du shows only 4G used on >>> sda6. Where'd the other 8G go? >> >>Do you have snapshots? Are you running snapper, for example? >> >>The other place that large amounts of space can go over an upgrade >> is in orphans -- files that are deleted, but still held open by >> processes, and which therefore can't be reclaimed until the process is >> restarted. I've been bitten by that one before. >> >>Hugo. >> >>> "btrfs fi df /" reports the following: >>> >>> Data, single: total=11.01GiB, used=10.58GiB >>> System, DUP: total=8.00MiB, used=16.00KiB >>> System, single: total=4.00MiB, used=0.00B >>> Metadata, DUP: total=1.00GiB, used=397.80MiB >>> Metadata, single: total=8.00MiB, used=0.00B >>> GlobalReserve, single: total=144.00MiB, used=0.00B >>> >>> "btrfs filesystem show /" gives: >>> >>> Label: none uuid: 4ea4ac08-ff37-4b51-b1a3-d8b21fd43ddd >>> Total devices 1 FS bytes used 10.97GiB >>> devid1 size 15.02GiB used 13.04GiB path /dev/sda6 >>> >>> btrfs-progs v4.0 >>> >>> "du --max-depth=1 -h -x" on / shows: >>> >>> 29M./etc >>> 0./media >>> 16M./bin >>> 354M./lib >>> 4.0K./lib64 >>> 0./mnt >>> 160K./root >>> 12M./sbin >>> 0./srv >>> 4.0K./tmp >>> 3.1G./usr >>> 442M./var >>> 0./cdrom >>> 3.8M./lib32 >>> 3.9G. >>> >>> And of course df: >>> >>> /dev/sda616G 12G 2.5G 83% / >>> /dev/sda5 232M 53M 163M 25% /boot >>> /dev/sda7 104G 46G 57G 45% /home >>> >>> And mount: >>> >>> mount |grep sda >>> /dev/sda6 on / type btrfs >>> (rw,relatime,compress=lzo,space_cache,subvolid=257,subvol=/@) >>> /dev/sda5 on /boot type ext4 (rw,relatime,data=ordered) >>> /dev/sda7 on /home type btrfs >>> (rw,relatime,compress=lzo,space_cache,subvolid=257,subvol=/@home) >>> >>> uname -a >>> Linux ichor 4.2.0-18-generic #22-Ubuntu SMP Fri Nov 6 18:25:50 UTC >>> 2015 x86_64 x86_64 x86_64 GNU/Linux >>> >>> I can live with the situation, but recovering that space would be nice. >> >> -- >> Hugo Mills | Happiness is mandatory. Are you happy? >> hugo@... carfax.org.uk | >> http://carfax.org.uk/ | >> PGP: E2AB1DE4 | >> Paranoia > > > > -- > http://brentonchapin.no-ip.biz > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFCv3.2 00/12] xfstests: test the nfs/cifs/btrfs/xfs reflink/dedupe ioctls
Looks good, Acked-by: Christoph Hellwig-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Using Btrfs on single drives
I'm looking to make a "production copy" of my music and video library for use in our media server. It is not my intent to create any form of RAID array, but rather to treat each drive independently where filesystem is concerned and then to create a single view of the drives using mhddfs. As the data will remain relatively static I may also deploy Snapraid in conjunction with mhddfs. I'm considering using Btrfs as the underlying filesystem on each of the individual drives, principally to take advantage of metadata redundancy. Am I corect in surmising that ?I can turn checksumming off given it's of no utility where a Btrfs volume is comprised of a single device only? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using Btrfs on single drives
On 2015-11-14 11:43, audio muze wrote: > I can turn checksumming > off given it's of no utility where a Btrfs volume is comprised of a > single device only? The checksums are used to detect a data corruption; in case of a btrfs-raid, the checksums are used *also* to pick the good copy. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] Btrfs: fix the number of transaction units needed to remove a block group
From: Filipe MananaWe were using only 1 transaction unit when attempting to delete an unused block group but in reality we need 3 + N units, where N corresponds to the number of stripes. We were accounting only for the addition of the orphan item (for the block group's free space cache inode) but we were not accounting that we need to delete one block group item from the extent tree, one free space item from the tree of tree roots and N device extent items from the device tree. While one unit is not enough, it worked most of the time because for each single unit we are too pessimistic and assume an entire tree path, with the highest possible heigth (8), needs to be COWed with eventual node splits at every possible level in the tree, so there was usually enough reserved space for removing all the items and adding the orphan item. However after adding the orphan item, writepages() can by called by the VM subsystem against the btree inode when we are under memory pressure, which causes writeback to start for the nodes we COWed before, this forces the operation to remove the free space item to COW again some (or all of) the same nodes (in the tree of tree roots). Even without writepages() being called, we could fail with ENOSPC because these items are located in multiple trees and one of them might have a higher heigth and require node/leaf splits at many levels, exhausting all the reserved space before removing all the items and adding the orphan. In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group"), we attempted to fix a BUG_ON due to ENOSPC when trying to add the orphan item by making the cleaner kthread reserve one transaction unit before attempting to remove the block group, but this was not enough. We had a couple user reports still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on a 4.2-rc6 kernel for example: http://www.spinics.net/lists/linux-btrfs/msg46070.html So fix this by reserving all the necessary units of metadata. Reported-by: Stefan Priebe Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group") Signed-off-by: Filipe Manana --- V2: Added missing units to account for removing the device extent items from the device tree (done at btrfs_remove_chunk through btrfs_free_dev_extent). fs/btrfs/ctree.h | 3 ++- fs/btrfs/extent-tree.c | 37 ++--- fs/btrfs/volumes.c | 3 ++- 3 files changed, 38 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1573be6..d88994f 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3480,7 +3480,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 type, u64 chunk_objectid, u64 chunk_offset, u64 size); struct btrfs_trans_handle *btrfs_start_trans_remove_block_group( - struct btrfs_fs_info *fs_info); + struct btrfs_fs_info *fs_info, + const u64 chunk_offset); int btrfs_remove_block_group(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 group_start, struct extent_map *em); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7820093..e97d6d6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10257,14 +10257,44 @@ out: } struct btrfs_trans_handle * -btrfs_start_trans_remove_block_group(struct btrfs_fs_info *fs_info) +btrfs_start_trans_remove_block_group(struct btrfs_fs_info *fs_info, +const u64 chunk_offset) { + struct extent_map_tree *em_tree = _info->mapping_tree.map_tree; + struct extent_map *em; + struct map_lookup *map; + unsigned int num_items; + + read_lock(_tree->lock); + em = lookup_extent_mapping(em_tree, chunk_offset, 1); + read_unlock(_tree->lock); + ASSERT(em && em->start == chunk_offset); + /* +* We need to reserve 3 + N units from the metadata space info in order +* to remove a block group (done at btrfs_remove_chunk() and at +* btrfs_remove_block_group()), which are used for: +* * 1 unit for adding the free space inode's orphan (located in the tree * of tree roots). +* 1 unit for deleting the block group item (located in the extent +* tree). +* 1 unit for deleting the free space item (located in tree of tree +* roots). +* N units for deleting N device extent items corresponding to each +* stripe (located in the device tree). +* +* In order to remove a block group we also need to reserve units in the +* system space info in order to update the chunk tree (update one or +* more device items
[PATCH v2 0/2] Btrfs: fixes for an ENOSPC issue that left a fs unusable
From: Filipe MananaThe following pair of changes fix an issue observed in a production environment where any file operations done by a package manager failed with ENOSPC. Forcing a commit of the current transaction (through "sync") didn't help, a balance operation with the filters -dusage=0 didn't help either and the issue persisted even after rebooting the machine. There were many data blocks groups that were unused, but they weren't getting deleted by the cleaner kthread because whenever it tried to start a transaction to delete a block group it got -ENOSPC error, which it silently ignores (as it does for any other error). So these just make sure we fallback to use the global reserve, if -ENOSPC is encountered through the standard allocation path, to delete block groups as we do already for inode unlink operations. Another issue fixed is hitting a BUG_ON() when removing a block group due to -ENSPC failure when creating the orphan item for its free space cache inode. This second issue has been reported by a few users in the mailing list and bugzilla (for example at http://www.spinics.net/lists/linux-btrfs/msg46070.html). These changes are also available at: http://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=integration-4.4 Thanks. Changes in v2: Updated the second patch to account for the space required to remove the device extents from the device tree (was previously ignored). Filipe Manana (2): Btrfs: use global reserve when deleting unused block group after ENOSPC Btrfs: fix the number of transaction units needed to remove a block group fs/btrfs/ctree.h | 3 +++ fs/btrfs/extent-tree.c | 45 +++-- fs/btrfs/inode.c | 24 +--- fs/btrfs/transaction.c | 32 fs/btrfs/transaction.h | 4 fs/btrfs/volumes.c | 3 ++- 6 files changed, 85 insertions(+), 26 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] Btrfs: use global reserve when deleting unused block group after ENOSPC
From: Filipe MananaIt's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana Signed-off-by: Jeff Mahoney --- V2: No changes. Only the second patch in the series was updated to account for the space required to remove device extent items. fs/btrfs/ctree.h | 2 ++ fs/btrfs/extent-tree.c | 14 -- fs/btrfs/inode.c | 24 +--- fs/btrfs/transaction.c | 32 fs/btrfs/transaction.h | 4 fs/btrfs/volumes.c | 2 +- 6 files changed, 52 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a2e73f6..1573be6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3479,6 +3479,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytes_used, u64 type, u64 chunk_objectid, u64 chunk_offset, u64 size); +struct btrfs_trans_handle *btrfs_start_trans_remove_block_group( + struct btrfs_fs_info *fs_info); int btrfs_remove_block_group(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 group_start, struct extent_map *em); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index acf3ed1..7820093 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10256,6 +10256,17 @@ out: return ret; } +struct btrfs_trans_handle * +btrfs_start_trans_remove_block_group(struct btrfs_fs_info *fs_info) +{ + /* +* 1 unit for adding the free space inode's orphan (located in the tree +* of tree roots). +*/ + return btrfs_start_transaction_fallback_global_rsv(fs_info->extent_root, + 1, 1); +} + /* * Process the unused_bgs list and remove any that don't have any allocated * space inside of them. @@ -10322,8 +10333,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) * Want to do this before we do anything else so we can recover * properly if we fail to join the transaction. */ - /* 1 for btrfs_orphan_reserve_metadata() */ - trans = btrfs_start_transaction(root, 1); + trans = btrfs_start_trans_remove_block_group(fs_info); if (IS_ERR(trans)) { btrfs_dec_block_group_ro(root, block_group); ret = PTR_ERR(trans); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 6e93349..f82d1f4 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4046,9 +4046,7 @@ int btrfs_unlink_inode(struct btrfs_trans_handle *trans, */ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir) { - struct btrfs_trans_handle *trans; struct btrfs_root *root = BTRFS_I(dir)->root; - int ret; /* * 1 for the possible orphan item @@ -4057,27 +4055,7 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir) * 1 for the inode ref * 1 for the inode */ - trans = btrfs_start_transaction(root, 5); - if (!IS_ERR(trans) || PTR_ERR(trans) != -ENOSPC) - return trans; - - if (PTR_ERR(trans) == -ENOSPC) { - u64 num_bytes = btrfs_calc_trans_metadata_size(root, 5); - - trans = btrfs_start_transaction(root, 0); - if (IS_ERR(trans)) - return trans; - ret = btrfs_cond_migrate_bytes(root->fs_info, - >fs_info->trans_block_rsv, - num_bytes, 5); - if (ret) { - btrfs_end_transaction(trans, root); - return ERR_PTR(ret); - } - trans->block_rsv = >fs_info->trans_block_rsv; -
Re: [PATCH 00/15] btrfs: Hot spare and Auto replace
On 2015-11-13 11:20, Anand Jain wrote: > > Thanks for comments. > > On 11/13/2015 03:21 AM, Goffredo Baroncelli wrote: >> On 2015-11-09 11:56, Anand Jain wrote: >>> These set of patches provides btrfs hot spare and auto replace support >>> for you review and comments. >> >> Hi Anand, >> >> is there any reason to put this kind of logic in the kernel space ? [...] > >> Another feature of this daemon could be to add a disk when the disk >> space is too low, > > That will be at the cost of a spare device which user should review > the trade-offs and do it manually ? I am not sure. If you have more than one spare, you can do automatically both: a new disk is added when the space is low, and a disk is replaced in case of failure. If you have only one spare: you may decide to reserve it only for replacing a failed disk. But this should be a configurable option: a low space leads to a not available filesystem, a failed disk means a higher likelihood to loosing all the filesystem. I am not sure which should be the more critical. >> or to start a balance when there is no space to >> allocate further chunk. > > Yep. As you notice, the thread created here is casualty_kthread() > (instead of replace_kthread()) over the long run I wish to provide > that feature in this thread, as it is a mutually exclusive operations > with replace. A disk replacing should be an higher priority operation. In case of disk failure during a balance/defrag, these operation should be stopped to allow a replace. If you want to start a replace, you should stop others (long time) operations like balance and defrag. > >> Of course all these logic could be implemented in kernel space, >> but I think that we should avoid that when possible. > > Easy to handle the mutually_exclusive parts with in the kernel > and Its better to have the important logic at one place. Two heads > operating on an org looking and feeling different things will lead > to wrong decisions. Which is the other logic which you are referring ? > >> Moreover in user space the logging is more easy > > Thanks, Anand -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More memory more jitters?
Duncan posted on Sat, 14 Nov 2015 16:37:14 + as excerpted: > Hugo Mills posted on Sat, 14 Nov 2015 14:31:12 + as excerpted: > >>> I have read the Gotcha[1] page: >>> >>>Files with a lot of random writes can become heavily fragmented >>> (1+ extents) causing trashing on HDDs and excessive multi-second >>> spikes of CPU load on systems with an SSD or **large amount a RAM**. >>> >>> Why could large amount of memory worsen the problem? >> >>Because the kernel will hang on to lots of changes in RAM for >> longer. With less memory, there's more pressure to write out dirty >> pages to disk, so the changes get written out in smaller pieces more >> often. With more memory, the changes being written out get "lumpier". >> >>> If **too much** memory is a problem, is it possible to limit the >>> memory btrfs use? >> >>There's some VM knobs you can twiddle, I believe, but I haven't >> really played with them myself -- I'm sure there's more knowledgable >> people around here who can suggest suitable things to play with. > > Yes. Don't have time to explain now, but I will later, if nobody beats > me to it. And now it's later... =:^) The official kernel documentation for this is in $KERNELDIR/Documentation/filesystems/proc.txt, in CHAPTER 2: MODIFYING SYSTEM PARAMETERS (starting at line 1378 in the file as it exists in kernel 4.3), tho that's little more than an intro. As it states, $KERNELDIR/Documentation/sysctl/* contains rather more information. Of course there's also various resources on the net covering this material, and if google finds this post I suppose it might become one of them. =:^] So in that Documentation/sysctl dir, the README file contains an intro, but what we're primarily interested in is covered in vm.txt. The files discussed there are found in /proc/sys/vm, tho your distro almost certainly has an init service, sysctl (the systemd-sysctl.service on systemd based systems, configured with *.conf files in /usr/lib/ssctl.d/ and /etc/sysctl.d/), that pokes non-kernel-default distro-configured and admin-configured values into the appropriate /proc/sys/vm/* files at boot. Also check /etc/sysctl.conf, which at least here is symlinked from /etc/sysctl.d/99-sysctl.conf so systemd-sysctl loads it. That's actually the file with my settings, here. So (as root) you can poke the files directly for experimentation, and when you've settled on values that work for you, you can put them in /etc/ sysctl.d/*.conf or in /etc/sysctl.conf, or whatever your distro uses instead. But keep in mind that (for systemd based systems anyway) the settings in /usr/lib/sysctl.d/*.conf will be loaded first and thus will apply if not overridden by your own config, so you might want to check there too, to see what's being applied there, before going too wild on your overrides. Of course the sysctl mechanism loads various other settings as well, network, core-file, magic-srq, others, but what we're focused on here are the vm files and settings. In particular, our files of interest are the /proc/sys/vm/dirty_* files and corresponding vm.dirty_* settings, tho while we're here, I'll mention that /proc/sys/vm/swappiness and the corresponding vm.swappiness setting is also quite commonly changed by users. Basically, these dirty_* files control the amount of cached writes that can accumulate before the kernel will start writing them to storage at two different priority levels, the maximum time they are allowed to age before they're written back regardless, and the balance between these two writeback priorities. Now, one thing that's important to keep in mind here is that the kernel defaults were originally setup back when 128 MiB RAM was a *LOT* of memory, and they aren't necessarily appropriate for systems with the GiB or often double-digit GiB RAM that most non-embedded systems come with today, particularly where people are still using legacy spinning rust -- SSDs are enough faster that the problem doesn't show up to the same degree, tho admins may still want to tweak the defaults in some cases. Another thing to keep in mind for mobile systems in particular is that writing data out will of course spin up the drives, so you might want rather larger caches and longer timeouts on laptops and the like, and/or if you spin down your drives. But balance that against the knowledge that data still in the write cache will be lost if the system crashes before it hits storage, so don't go /too/ overboard on extending your timeouts. Timeouts of an hour could well save quite a bit of power, but they also risk losing an hour's worth of writes! OK, from that rather high level view, let's jump to the lower level actual settings, tho not yet the actual values. I'll group the settings in my discussion, but you can read the description for each individual setting in the vm.txt file mentioned above, if you like. Note that there's a two-dimension parallel among the four
Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk
On Sun, 2015-11-15 at 09:29 +0800, Qu Wenruo wrote: > > > If type is wrong, all the extents inside the chunk should be > > > reported > > > as > > > mismatch type with chunk. > > Isn't that the case? At least there are so many reported extents... > > If you posted all the output Sure, I posted everything that the dump gave :) > , that's just a little more than nothing. > Just tens of error reported, compared to millions of extents. > And in your case, if a chunk is really bad, it will report about 65K > errors. I see.. > I think it's a btrfsck issue, at least from the dump info, your > extent > tree is OK. > And if there is no other error reported from btrfsck, your filesystem > should be OK. Nope.. there were no further errors. > > In any case, I'll keep the fs in question for a while, so that I > > can do > > verifications in case you have patches. > > Nice. Just tell me if you have something. btw: I saw these: Nov 15 02:01:42 heisenberg kernel: INFO: task btrfs-transacti:28379 blocked for more than 120 seconds. Nov 15 02:01:42 heisenberg kernel: Not tainted 4.2.0-1-amd64 #1 Nov 15 02:01:42 heisenberg kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 15 02:01:42 heisenberg kernel: btrfs-transacti D 8109a1b0 0 28379 2 0x Nov 15 02:01:42 heisenberg kernel: 88016e3e6500 0046 007a 88040be88f00 Nov 15 02:01:42 heisenberg kernel: 2659 88013807 88041e355840 7fff Nov 15 02:01:42 heisenberg kernel: 815508e0 88013806fbb8 0007 815500ff Nov 15 02:01:42 heisenberg kernel: Call Trace: Nov 15 02:01:42 heisenberg kernel: [] ? bit_wait_timeout+0x70/0x70 Nov 15 02:01:42 heisenberg kernel: [] ? schedule+0x2f/0x70 Nov 15 02:01:42 heisenberg kernel: [] ? schedule_timeout+0x1f7/0x290 Nov 15 02:01:42 heisenberg kernel: [] ? extent_write_cache_pages.isra.28.constprop.43+0x222/0x330 [btrfs] Nov 15 02:01:42 heisenberg kernel: [] ? read_tsc+0x5/0x10 Nov 15 02:01:42 heisenberg kernel: [] ? bit_wait_timeout+0x70/0x70 Nov 15 02:01:42 heisenberg kernel: [] ? io_schedule_timeout+0x9d/0x110 Nov 15 02:01:42 heisenberg kernel: [] ? bit_wait_io+0x35/0x60 Nov 15 02:01:42 heisenberg kernel: [] ? __wait_on_bit+0x5a/0x90 Nov 15 02:01:42 heisenberg kernel: [] ? find_get_pages_tag+0x116/0x150 Nov 15 02:01:42 heisenberg kernel: [] ? wait_on_page_bit+0xb6/0xc0 Nov 15 02:01:42 heisenberg kernel: [] ? autoremove_wake_function+0x40/0x40 Nov 15 02:01:42 heisenberg kernel: [] ? filemap_fdatawait_range+0xc7/0x140 Nov 15 02:01:42 heisenberg kernel: [] ? btrfs_wait_ordered_range+0x73/0x110 [btrfs] Nov 15 02:01:42 heisenberg kernel: [] ? btrfs_wait_cache_io+0x5d/0x1e0 [btrfs] Nov 15 02:01:42 heisenberg kernel: [] ? btrfs_start_dirty_block_groups+0x17c/0x3f0 [btrfs] Nov 15 02:01:42 heisenberg kernel: [] ? btrfs_commit_transaction+0x1b4/0xa90 [btrfs] Nov 15 02:01:42 heisenberg kernel: [] ? start_transaction+0x90/0x580 [btrfs] Nov 15 02:01:42 heisenberg kernel: [] ? transaction_kthread+0x224/0x240 [btrfs] Nov 15 02:01:42 heisenberg kernel: [] ? btrfs_cleanup_transaction+0x510/0x510 [btrfs] Nov 15 02:01:42 heisenberg kernel: [] ? kthread+0xc1/0xe0 Nov 15 02:01:42 heisenberg kernel: [] ? kthread_create_on_node+0x170/0x170 Nov 15 02:01:42 heisenberg kernel: [] ? ret_from_fork+0x3f/0x70 Nov 15 02:01:42 heisenberg kernel: [] ? kthread_create_on_node+0x170/0x170 Nov 15 02:03:42 heisenberg kernel: INFO: task btrfs-transacti:28379 blocked for more than 120 seconds. Nov 15 02:03:42 heisenberg kernel: Not tainted 4.2.0-1-amd64 #1 Nov 15 02:03:42 heisenberg kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 15 02:03:42 heisenberg kernel: btrfs-transacti D 8109a1b0 0 28379 2 0x Nov 15 02:03:42 heisenberg kernel: 88016e3e6500 0046 007a 88040be88f00 Nov 15 02:03:42 heisenberg kernel: 2659 88013807 88041e355840 7fff Nov 15 02:03:42 heisenberg kernel: 815508e0 88013806fbb8 0007 815500ff Nov 15 02:03:42 heisenberg kernel: Call Trace: Nov 15 02:03:42 heisenberg kernel: [] ? bit_wait_timeout+0x70/0x70 Nov 15 02:03:42 heisenberg kernel: [] ? schedule+0x2f/0x70 Nov 15 02:03:42 heisenberg kernel: [] ? schedule_timeout+0x1f7/0x290 Nov 15 02:03:42 heisenberg kernel: [] ? extent_write_cache_pages.isra.28.constprop.43+0x222/0x330 [btrfs] Nov 15 02:03:42 heisenberg kernel: [] ? read_tsc+0x5/0x10 Nov 15 02:03:42 heisenberg kernel: [] ? bit_wait_timeout+0x70/0x70 Nov 15 02:03:42 heisenberg kernel: [] ? io_schedule_timeout+0x9d/0x110 Nov 15 02:03:42 heisenberg kernel: [] ? bit_wait_io+0x35/0x60 Nov 15 02:03:42 heisenberg kernel: [] ? __wait_on_bit+0x5a/0x90 Nov 15 02:03:42 heisenberg kernel: [] ? find_get_pages_tag+0x116/0x150 Nov 15 02:03:42 heisenberg kernel: [] ?
Re: Using Btrfs on single drives
I've gone ahead and created a single drive Btrfs filesystem on a 3TB drive and started copying content from a raid5 array to the Btrfs volume. Initially copy speeds were very good sustained at ~145MB/s and I left it to run overnight. This morning I ran btrfs fi usage /mnt/btrfs and it reported around 700GB free. I selected another folder containing 204GB and started a copy operation, again from the raid5 array to the Btrfs volume. Copying is now materially slower and slowing further...it started at ~105MB/s and after 141GB has slowed to around 97MB/s. Is this to be expected with Btrfs of have I come across a bug of some sort? On Sat, Nov 14, 2015 at 12:43 PM, audio muzewrote: > I'm looking to make a "production copy" of my music and video library > for use in our media server. It is not my intent to create any form > of RAID array, but rather to treat each drive independently where > filesystem is concerned and then to create a single view of the drives > using mhddfs. As the data will remain relatively static I may also > deploy Snapraid in conjunction with mhddfs. > > I'm considering using Btrfs as the underlying filesystem on each of > the individual drives, principally to take advantage of metadata > redundancy. Am I corect in surmising that ?I can turn checksumming > off given it's of no utility where a Btrfs volume is comprised of a > single device only? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using Btrfs on single drives
audio muze posted on Sun, 15 Nov 2015 05:27:00 +0200 as excerpted: > I've gone ahead and created a single drive Btrfs filesystem on a 3TB > drive and started copying content from a raid5 array to the Btrfs > volume. Initially copy speeds were very good sustained at ~145MB/s and > I left it to run overnight. This morning I ran btrfs fi usage > /mnt/btrfs and it reported around 700GB free. I selected another folder > containing 204GB and started a copy operation, again from the raid5 > array to the Btrfs volume. Copying is now materially slower and slowing > further...it started at ~105MB/s and after 141GB has slowed to around > 97MB/s. Is this to be expected with Btrfs of have I come across a bug > of some sort? That looks to /me/ like native drive limitations. Due to the fact that a modern hard drive spins at the same speed no matter where the read/write head is located, when it's reading/writing to the first part of the drive -- the outside -- much more linear drive distance will pass under the read/write heads in say a tenth of a second than will be the case as the last part of the drive is filled -- the inside -- and throughput will be much higher at the first of the drive. You report a 3 TB drive with initial/outside speeds of ~145 MB/s, then after copying quite some data, in the morning it had ~700 GB free, so presumably you had written something over 2 TB to it. I'll leave the precise math to someone else, but you report that it started the second copy at 105 MB/s and was down to 97 MB/s after another 141 GB, so presumably ~550 GB free. That's a slowdown of roughly a third from the initial outside edge where it was covering perhaps twice as much linear drive distance per unit of time, so it doesn't sound at all unreasonable to me. What's the actual extended sequential write throughput rating on the drive? What do the online reviews of the product say it does? Have you used hdparm to test it? It's kinda late for this test now, but if before creating a big filesystem out of the whole thing, if for testing you had created a small partition at the beginning of the drive, and another at the end, you could have then used hdparm to test each to see what the relative speed difference was between them, and further, if desired, you could have created small partitions at specific size locations into the drive, and done similar testing, to find the speed at say 1 TB into the drive, 2 TB in, etc. Of course after testing you could erase those temporary partitions and make one big filesystem out of it, if desired. Of course this is one of the big differences with SSDs, since they aren't spinning any longer and have direct access to any part of the device with just an address change, so speeds for them, in addition to being far faster, should normally be the same across the device. But of course they cost far more per GB or TB, and tend to be vastly more expensive in the TB+ size ranges, tho you can of course combine many smaller ones using raid technologies to create a larger logical one, but you'll still be paying a marked premium for the SSD technology. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using Btrfs on single drives
On Sunday 15 November 2015 04:01:57 Duncan wrote: >audio muze posted on Sun, 15 Nov 2015 05:27:00 +0200 as excerpted: >> I've gone ahead and created a single drive Btrfs filesystem on a 3TB >> drive and started copying content from a raid5 array to the Btrfs >> volume. Initially copy speeds were very good sustained at ~145MB/s and >> I left it to run overnight. This morning I ran btrfs fi usage >> /mnt/btrfs and it reported around 700GB free. I selected another folder >> containing 204GB and started a copy operation, again from the raid5 >> array to the Btrfs volume. Copying is now materially slower and slowing >> further...it started at ~105MB/s and after 141GB has slowed to around >> 97MB/s. Is this to be expected with Btrfs of have I come across a bug >> of some sort? > >That looks to /me/ like native drive limitations. > [Snip nice explanation] I'll just add that I see this with my 3TB USB3 HDD, too, but also with my internal HDDs. Old drives (the oldest I had were about 10 years old) also had this problem, only scaled appropriately (the worst was something like 40/60 GB/s min./max.). You can also see this very nicely with scrub runs (I use dstat for this): they start out at the max., but gradually slow down as they progress. HTH -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup signature.asc Description: This is a digitally signed message part.
Re: Where is the disk space?
Hi, On Fri, Nov 13, 2015 at 09:41:01AM -0800, Marc MERLIN wrote: > root@polgara:/mnt/btrfs_root# du -sh * > 28G @ > 28G @_hourly.20151113_08:04:01 > 4.0K@_last > 4.0K@_last_rw > 28G @_rw.20151113_00:02:01 > root@polgara:/mnt/btrfs_root# df -h . > Filesystem Size Used Avail Use% Mounted on > /dev/sdb556G 40G 5.4G 89% /mnt/btrfs_root > > root@polgara:/mnt/btrfs_root# btrfs fi df . > Data, single: total=39.85GiB, used=38.52GiB > System, DUP: total=8.00MiB, used=16.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, DUP: total=6.00GiB, used=579.17MiB > Metadata, single: total=8.00MiB, used=0.00B > GlobalReserve, single: total=208.00MiB, used=0.00B > > root@polgara:/mnt/btrfs_root# btrfs fi show . > Label: 'btrfs_root' uuid: a2a1ed7b-6bfe-4e83-bc10-727126ed17bf > Total devices 1 FS bytes used 39.09GiB > devid1 size 55.88GiB used 51.88GiB path /dev/sdb5 > > btrfs-progs v4.0-dirty > root@polgara:/mnt/btrfs_root# > > root@polgara:/mnt/btrfs_root# btrfs balance start -dusage=80 -v > /mnt/btrfs_root > Dumping filters: flags 0x1, state 0x0, force is off > DATA (flags 0x2): balancing, usage=80 > Done, had to relocate 1 out of 55 chunks > > Sadly, it's only running 3.17.8 because of complicated reasons, but still, > > 1) I have 28GB used (modulo a few files between the btrfs send snapshots and > current status) > > 2) fi show shows I'm using 39GB, not sure where the extra 11GB came from > > 3) fi df agrees with fi show > > 4) regular df agrees on used too, but shows 5GB free instead of 15GB despite > the filesystem being balanced. > > I did have a bunch of snapshots that I did delete a while ago now, but it > looks like their blocks aren't being reclaimed. > > Any ideas? > Since you said you have some snapshots in between...I can think of one case to prove where the space goes, Say, you have a file with size=10M on a freshly created partition(the total used data space is 10M), and you have a snapshot which owns this file, then you modify the original file by overwrite the range [3M, 5M], and right now you can find that the total used data space increases to 15M or maybe more (because of unaliged write and extent pads to 4K length). This comes from our COW and extent references implementation, so you get the benefit of COW, meanwhile have to live with the un-reclaimed space. It's sort of something I was trying to fix, but I found that my approach led to other problems so I decided to give it up. Thanks, -liubo > Thanks, > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > what McDonalds is to gourmet > cooking > Home page: http://marc.merlins.org/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
More memory more jitters?
Hi List, I have read the Gotcha[1] page: Files with a lot of random writes can become heavily fragmented (1+ extents) causing trashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or **large amount a RAM**. Why could large amount of memory worsen the problem? If **too much** memory is a problem, is it possible to limit the memory btrfs use? Background info: I am running a heavy-write database server with 96GB ram. In the worse case it cause multi minutes of high cpu loads. Systemd keeping kill and restarting services, and old job don't die because they stuck in uninterruptable wait... etc. Tried with nodatacow, but it seems only affect new file. It is not an subvolume option either... Regards, Daniel [1] https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs device initialisation is quite slow
It might be that your metadata is quite scattered and if the 320GB is a HDD and not an SSD, than this 11s is just what it takes. Scattered metadata might be caused by the autodefrag mount option I think (and by fs getting older and changing often). What is the output of btrfs fi df / You could run btrfs balance start -musage=50 / or a bit higher number to compact the metadata If this does not help, it could be there is some error in the filesystem that makes btrfs take time to figure out, but I done have example or experience with it. The only thing that could cause even more excessive mount delays is when you have an interupted (full) balance restarting, but that would not be the case every time you boot. Maybe a btrfs scrub start / could lead to identifying HDD sectors going bad, but it is unlikely the case. On Sat, Nov 14, 2015 at 5:38 AM, Robbie Smithwrote: > Hey all > > I've been trying to figure out why my system (home desktop) is taking > so long to boot. Systemd-analyze tells me that my root filesystem > partition (which is btrfs) takes ~11 seconds to become active, and I'm > curious as to why and whether or not I can optimise this. > > The primary disk has 4 partitions: a EFI/BIOS boot partion (for GRUB); > a /boot partition (ext4); a swap partition; and the root partition. The > disk itself is not particularly large (320 GB), and I'm using > subvolumes to emulate partitions in btrfs. There are three top-level > subvolumes, for /, /home, and /var, none of which have quotas, and I'm > not at present doing snapshots because I backup every day to an > external drive formatted with ext4. > > I've got a second 5 TB drive for multimedia that is also btrfs, but it > only takes ~3 seconds to come online. I had been using a number of bind > mounts from the multimedia drive to my home folder, so that $HOME/music > and $HOME/videos point to the library, and replacing them with symlinks > reduced the time by ~3 seconds, but it still doesn't account for why > the root device takes so long. > > My fstab contains the following: > > # /dev/sdc4 LABEL=filesystem > UUID=4ec80601-4799-4fa8-a711-0171c180f25b / > btrfs rw,noatime,space_cache,autodefrag,subvol=rootvol 0 0 > > # /dev/sdc4 LABEL=filesystem > UUID=4ec80601-4799-4fa8-a711-0171c180f25b /home btrfs > rw,noatime,space_cache,autodefrag,subvol=homevol 0 0 > > # /dev/sdc4 LABEL=filesystem > UUID=4ec80601-4799-4fa8-a711-0171c180f25b /var btrfs > rw,noatime,space_cache,autodefrag,subvol=var 0 0 > > # /dev/sdc2 LABEL=boot > UUID=ca281471-0aac-4090-8660-33b8b9fee5a3 /boot ext4 rw,relatime,data=ordered > 0 2 > > # /dev/sdb1 LABEL=library > UUID=97226949-50e0-4a78-899e-863f5b436bcc /mnt/library btrfs > rw,noatime,space_cache,autodefrag 0 0 > > > Can anyone offer any insights or advice? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using Btrfs on single drives
Goffredo Baroncelli posted on Sat, 14 Nov 2015 12:09:21 +0100 as excerpted: > On 2015-11-14 11:43, audio muze wrote: >> I can turn checksumming off given it's of no utility where a Btrfs >> volume is comprised of a single device only? > > The checksums are used to detect a data corruption; in case of a > btrfs-raid, the checksums are used *also* to pick the good copy. And yes, you can turn them off (for data, not metadata), using the nodatasum mount option. Tho personally, I prefer raid1, not just for the normal raid1 capacities, but for the ability to scrub corrupt data as well, and thus would never turn off checksumming here (except possibly in the context of nocow, for vm images, etc). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More memory more jitters?
On Sat, Nov 14, 2015 at 10:11:31PM +0800, CHENG Yuk-Pong, Daniel wrote: > Hi List, > > > I have read the Gotcha[1] page: > >Files with a lot of random writes can become heavily fragmented > (1+ extents) causing trashing on HDDs and excessive multi-second > spikes of CPU load on systems with an SSD or **large amount a RAM**. > > Why could large amount of memory worsen the problem? Because the kernel will hang on to lots of changes in RAM for longer. With less memory, there's more pressure to write out dirty pages to disk, so the changes get written out in smaller pieces more often. With more memory, the changes being written out get "lumpier". > If **too much** memory is a problem, is it possible to limit the > memory btrfs use? There's some VM knobs you can twiddle, I believe, but I haven't really played with them myself -- I'm sure there's more knowledgable people around here who can suggest suitable things to play with. Hugo. > Background info: > > I am running a heavy-write database server with 96GB ram. In the worse > case it cause multi minutes of high cpu loads. Systemd keeping kill > and restarting services, and old job don't die because they stuck in > uninterruptable wait... etc. > > Tried with nodatacow, but it seems only affect new file. It is not an > subvolume option either... > > > Regards, > Daniel > > > [1] https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation -- Hugo Mills | Anyone who says their system is completely secure hugo@... carfax.org.uk | understands neither systems nor security. http://carfax.org.uk/ | PGP: E2AB1DE4 |Bruce Schneier signature.asc Description: Digital signature
Re: More memory more jitters?
Hugo Mills posted on Sat, 14 Nov 2015 14:31:12 + as excerpted: >> I have read the Gotcha[1] page: >> >>Files with a lot of random writes can become heavily fragmented >> (1+ extents) causing trashing on HDDs and excessive multi-second >> spikes of CPU load on systems with an SSD or **large amount a RAM**. >> >> Why could large amount of memory worsen the problem? > >Because the kernel will hang on to lots of changes in RAM for > longer. With less memory, there's more pressure to write out dirty pages > to disk, so the changes get written out in smaller pieces more often. > With more memory, the changes being written out get "lumpier". > >> If **too much** memory is a problem, is it possible to limit the memory >> btrfs use? > >There's some VM knobs you can twiddle, I believe, but I haven't > really played with them myself -- I'm sure there's more knowledgable > people around here who can suggest suitable things to play with. Yes. Don't have time to explain now, but I will later, if nobody beats me to it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk
在 2015年11月14日 10:29, Christoph Anton Mitterer 写道: On Sat, 2015-11-14 at 09:22 +0800, Qu Wenruo wrote: Manually checked they all. thanks a lot :-) Strangely, they are all OK... although it's a good news for you. Oh man... you're s mean ;-D They are all tree blocks and are all in metadata block group. and I guess that's... expected/intended? Yes, that's the expected behavior. But dismatch with btrfsck error report. It seems to be a btrfsck false alert that's a relieve (for me) Well I've already started to copy all files from the device to a new one... unfortunately I'll loose all older snapshots (at least on the new fs) but instead I get skinny-metadata, which wasn't the default back then. Skinny metadata is quite nice feature, hugely reduce the space of metadata extent item size. (being able to copy a full fs, with all subvols/snapshots is IMHO really something that should be worked on) If type is wrong, all the extents inside the chunk should be reported as mismatch type with chunk. Isn't that the case? At least there are so many reported extents... If you posted all the output, that's just a little more than nothing. Just tens of error reported, compared to millions of extents. And in your case, if a chunk is really bad, it will report about 65K errors. And according to the dump result, the reported ones are not continuous even they have adjacent extents but adjacent ones are not reported. I'm not so deep into btrfs... is this kinda expected and if not how could all this happen? Or is it really just a check issue and filesystem-wise fully as it should be? I think it's a btrfsck issue, at least from the dump info, your extent tree is OK. And if there is no other error reported from btrfsck, your filesystem should be OK. Did you have any smaller btrfs with the same false alert? Uhm... I can check, but I don't think so, especially as all other btrfs I have are newer and already have skinny-metadata. The only ones I had without are those two big 8TB HDDs... Unfortunately they contain sensitive data from work, which I don't think I can copy, otherwise could have sent you the device or so... Although I'll check the code to find what's wrong, but if you have any small enough image, debugging will be much much faster. In any case, I'll keep the fs in question for a while, so that I can do verifications in case you have patches. Nice. Thanks, Qu thanks a lot, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html