Re: [PATCH] btrfs: scrub: use do_div() for 64-by-32 division
On Sat, Apr 08, 2017 at 11:07:37PM +0200, Adam Borowski wrote: > Unbreaks ARM and possibly other 32-bit architectures. Turns out those "other 32-bit architectures" happen to include i386. A modular build: ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined! With the patch, i386 builds fine. > Tested on amd64 where all is fine, and on arm (Odroid-U2) where scrub > sometimes works, but, like most operations, randomly dies with some badness > that doesn't look related: io_schedule, kunmap_high. That badness wasn't > there in 4.11-rc5, needs investigating, but since it's not connected to our > issue at hand, I consider this patch sort-of tested. Looks like current -next is pretty broken: while amd64 is ok, on an i386 box (non-NX Pentium 4) it hangs very early during boot, way before filesystem modules would be loaded. Qemu boots but has random hangs. So it looks like it's compile only for now... -- ⢀⣴⠾⠻⢶⣦⠀ Meow! ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second ⠈⠳⣄ preimage for double rot13! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About free space fragmentation, metadata write amplification and (no)ssd
Am Sun, 9 Apr 2017 02:21:19 +0200 schrieb Hans van Kranenburg: > On 04/08/2017 11:55 PM, Peter Grandi wrote: > >> [ ... ] This post is way too long [ ... ] > > > > Many thanks for your report, it is really useful, especially the > > details. > > Thanks! > > >> [ ... ] using rsync with --link-dest to btrfs while still > >> using rsync, but with btrfs subvolumes and snapshots [1]. [ > >> ... ] Currently there's ~35TiB of data present on the example > >> filesystem, with a total of just a bit more than 9 > >> subvolumes, in groups of 32 snapshots per remote host (daily > >> for 14 days, weekly for 3 months, montly for a year), so > >> that's about 2800 'groups' of them. Inside are millions and > >> millions and millions of files. And the best part is... it > >> just works. [ ... ] > > > > That kind of arrangement, with a single large pool and very many > > many files and many subdirectories is a worst case scanario for > > any filesystem type, so it is amazing-ish that it works well so > > far, especially with 90,000 subvolumes. > > Yes, this is one of the reasons for this post. Instead of only hearing > about problems all day on the mailing list and IRC, we need some more > reports of success. > > The fundamental functionality of doing the cow snapshots, moo, and the > related subvolume removal on filesystem trees is so awesome. I have no > idea how we would have been able to continue this type of backup > system when btrfs was not available. Hardlinks and rm -rf was a total > dead end road. I'm absolutely no expert with arrays of sizes that you use but I also stopped using the hardlink-and-remove approach: It was slow to manage (rsync works slow for it, rm works slow for it) and it was error-probe (due to the nature of hardlinks). I used btrfs with snapshots and rsync for a while in my personal testbed, and experienced great slowness over time: rsync started to become slower and slower, full backup took 4 hours with huge %IO usage, maintaining the backup history was also slow (removing backups took a while), rebalancing was needed due to huge wasted space. I used rsync with --inplace and --no-whole-file to waste as few space as possible. What I first found was an adaptive rebalancer script which I still use for the main filesystem: https://www.spinics.net/lists/linux-btrfs/msg52076.html (thanks to Lionel) It works pretty well and has no such big IO overhead due to the adaptive multi-pass approach. But it still did not help the slowness. I now tested borgbackup for a while, and it's fast: It does the same job in 30 minutes or less instead of 4 hours, and it has much better backup density and comes with easy history maintenance, too. I can now store much more backup history in the same space. Full restore time is about the same as copying back with rsync. For a professional deployment I'm planning to use XFS as the storage backend and borgbackup as the backup frontend, because my findings showed that XFS allocation groups are spanning diagonally across the disk array, that is if you'd use simple JBOD of your iSCSI LUNs, XFS will spread writes across all the LUNs without you needing to do normal RAID striping, which should eliminate the need to migrate when adding more LUNs, and the underlaying storage layer on the NetApp side will probably already do RAID for redundancy anyways. Just feed more space to XFS using LVM. Borgbackup can do everything that btrfs can do for you but is targetting the job of doing backups only: It can compress, deduplicate, encrypt and do history thinning. The only downside I found is that only one backup job at a time can access the backup repository. So you'd have to use one backup repo per source machine. That way you cannot benefit from deduplication across multiple sources. But I'm sure NetApp can do that. OTOH, maybe backup duration drops to a point that you could serialize the backup of some machines. > OTOH, what we do with btrfs (taking a bulldozer and drive across all > the boundaries of sanity according to all recommendations and > warnings) on this scale of individual remotes is something that the > NetApp people should totally be jealous of. Backups management > (manual create, restore etc on top of the nightlies) is self service > functionality for our customers, and being able to implement the > magic behind the APIs with just a few commands like a btrfs sub snap > and some rsync gives the right amount of freedom and flexibility we > need. This is something I'm planning here, too: Self-service backups, do a btrfs snap, but then use borgbackup for archiving purposes. BTW: I think the 2M size comes from the assumption that SSDs manage their storage in groups of erase block sizes. The optimization here would be that btrfs deallocates (and maybe trims) only whole erase blocks which typically are 2M. This has a performance benefit. But if your underlying storage layer is RAID anyways, this no longer maps
Re: About free space fragmentation, metadata write amplification and (no)ssd
On 04/09/2017 02:21 AM, Hans van Kranenburg wrote: > [...] > Notice that everyone who has rotational 0 in /sys is experiencing this > behaviour right now, when removing snapshots... [...] Eh, 1 -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About free space fragmentation, metadata write amplification and (no)ssd
On 04/08/2017 11:55 PM, Peter Grandi wrote: >> [ ... ] This post is way too long [ ... ] > > Many thanks for your report, it is really useful, especially the > details. Thanks! >> [ ... ] using rsync with --link-dest to btrfs while still >> using rsync, but with btrfs subvolumes and snapshots [1]. [ >> ... ] Currently there's ~35TiB of data present on the example >> filesystem, with a total of just a bit more than 9 >> subvolumes, in groups of 32 snapshots per remote host (daily >> for 14 days, weekly for 3 months, montly for a year), so >> that's about 2800 'groups' of them. Inside are millions and >> millions and millions of files. And the best part is... it >> just works. [ ... ] > > That kind of arrangement, with a single large pool and very many > many files and many subdirectories is a worst case scanario for > any filesystem type, so it is amazing-ish that it works well so > far, especially with 90,000 subvolumes. Yes, this is one of the reasons for this post. Instead of only hearing about problems all day on the mailing list and IRC, we need some more reports of success. The fundamental functionality of doing the cow snapshots, moo, and the related subvolume removal on filesystem trees is so awesome. I have no idea how we would have been able to continue this type of backup system when btrfs was not available. Hardlinks and rm -rf was a total dead end road. The growth has been slow but steady (oops, fast and steady, I immediately got corrected by our sales department), but anyway, steady. This makes it possible to just let it do its thing every day and spot small changes in behaviour over time, detect patterns that could be a ticking time bomb and then deal with them in a way that allows conscious decisions, well-tested changes and continous measurements of the result. But, ok, it's surely not for the faint of heart, and the devil is in the details. If it breaks, you keep the pieces. Using the NetApp hardware is one of the relevant decisions made here. The shameful state of the most basic case of recovering (or not be able to recover) a failure in a two disk btrfs RAID1 is enough of a sign that the whole multi-disk handling is a nice idea, but didn't get the amount of attention yet that it would deserve to be something to be able to rely on (for me). Having the data safe in my NetApp filer gives me the opportunity to do regular (like, monthly) snapshots of the complete thing, so that I have something to go back to if disaster would strike in linux land. Yes, it's a bit inconvenient because I want to umount for a few minutes in a silent moment of the week, but it's worth the effort, since I can keep the eggs in a shadow basket. OTOH, what we do with btrfs (taking a bulldozer and drive across all the boundaries of sanity according to all recommendations and warnings) on this scale of individual remotes is something that the NetApp people should totally be jealous of. Backups management (manual create, restore etc on top of the nightlies) is self service functionality for our customers, and being able to implement the magic behind the APIs with just a few commands like a btrfs sub snap and some rsync gives the right amount of freedom and flexibility we need. And, monitoring of trends is so. super. important. It's not a secret that when I work with technology, I want to see what's going on in there, crack the black box open and try to understand why the lights are blinking in a specific pattern. What does this balance -dusage=75 mean? Why does it know what's 75% full and I don't? Where does it get that information from? The open source kernel code and the IOCTL API is a source for many hours of happy hacking, because it allows all of this to be done. > As I mentioned elsewhere > I would rather do a rotation of smaller volumes, to reduce risk, > like "Duncan" also on this mailing list likes to do (perhaps to > the opposite extreme). Well, like seen in my 'keeps allocating new chunks for no apparent reason' thread... even small filesystems can have really weird problems. :) > As to the 'ssd'/'nossd' issue that is as described in 'man 5 > btrfs' (and I wonder whether 'ssd_spread' was tried too) but it > is not at all obvious it should impact so much metadata > handling. I'll add a new item in the "gotcha" list. I suspect that the -o ssd behaviour is a decent source of the "help! my filesystem is full but df says it's not" problems we see about every week. But, I can't just argue that. Apart from that it was the very same problem being the first thing that btrfs greeted me with when trying it out for the first time a few years ago, (and it still is one of the first problems people who start using btrfs encounter) I haven't spent time to debug the behaviour when running fully allocated. OTOH the two-step allocation process is also a nice thing, because I *know* when I still have unallocated space available, which makes for example the free space fragmentation debugging process much more
[PATCH] btrfs-progs: Fix missing newline in man 5 btrfs
The text compress_lzo:: would show up directly after 'bigger than the page size' on the same line. --- Documentation/btrfs-man5.asciidoc | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/btrfs-man5.asciidoc b/Documentation/btrfs-man5.asciidoc index c8ef1c96..90f16057 100644 --- a/Documentation/btrfs-man5.asciidoc +++ b/Documentation/btrfs-man5.asciidoc @@ -455,6 +455,7 @@ big_metadata:: (since: 3.4) + the filesystem uses 'nodesize' bigger than the page size + compress_lzo:: (since: 2.6.38) + -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About free space fragmentation, metadata write amplification and (no)ssd
> [ ... ] This post is way too long [ ... ] Many thanks for your report, it is really useful, especially the details. > [ ... ] using rsync with --link-dest to btrfs while still > using rsync, but with btrfs subvolumes and snapshots [1]. [ > ... ] Currently there's ~35TiB of data present on the example > filesystem, with a total of just a bit more than 9 > subvolumes, in groups of 32 snapshots per remote host (daily > for 14 days, weekly for 3 months, montly for a year), so > that's about 2800 'groups' of them. Inside are millions and > millions and millions of files. And the best part is... it > just works. [ ... ] That kind of arrangement, with a single large pool and very many many files and many subdirectories is a worst case scanario for any filesystem type, so it is amazing-ish that it works well so far, especially with 90,000 subvolumes. As I mentioned elsewhere I would rather do a rotation of smaller volumes, to reduce risk, like "Duncan" also on this mailing list likes to do (perhaps to the opposite extreme). As to the 'ssd'/'nossd' issue that is as described in 'man 5 btrfs' (and I wonder whether 'ssd_spread' was tried too) but it is not at all obvious it should impact so much metadata handling. I'll add a new item in the "gotcha" list. It is sad that 'ssd' is used by default in your case, and it is quite perplexing that tghe "wandering trees" problem (that is "write amplification") is so large with 64KiB write clusters for metadata (and 'dup' profile for metadata). * Probably the metadata and data cluster sizes should be create or mount parameters instead of being implicit in the 'ssd' option. * A cluster size of 2MiB for metadata and/or data presumably has some downsides, otrherwise it would be the default. I wonder whether the downsides related to barriers... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: scrub: use do_div() for 64-by-32 division
Unbreaks ARM and possibly other 32-bit architectures. Fixes: 7d0ef8b4d: Btrfs: update scrub_parity to use u64 stripe_len Reported-by: Icenowy ZhengSigned-off-by: Adam Borowski --- You'd probably want to squash this with Liu's commit, to be nice to future bisects. Tested on amd64 where all is fine, and on arm (Odroid-U2) where scrub sometimes works, but, like most operations, randomly dies with some badness that doesn't look related: io_schedule, kunmap_high. That badness wasn't there in 4.11-rc5, needs investigating, but since it's not connected to our issue at hand, I consider this patch sort-of tested. fs/btrfs/scrub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b6fe1cd08048..95372e3679f3 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2407,7 +2407,7 @@ static inline void __scrub_mark_bitmap(struct scrub_parity *sparity, start -= sparity->logic_start; start = div64_u64_rem(start, sparity->stripe_len, ); - offset /= sectorsize; + do_div(offset, sectorsize); nsectors = (int)len / sectorsize; if (offset + nsectors <= sparity->nsectors) { -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does btrfs get nlink on directories wrong? -- was Re: [PATCH 2/4] xfstests: Add first statx test [ver #5]
Eryu Guanwrote: > > Overlayfs uses nlink = 1 for merge dirs to silence 'find' et al. > > Ext4 uses nlink = 1 for directories with more than 32K subdirs > > (EXT4_FEATURE_RO_COMPAT_DIR_NLINK). > > > > But in both those fs newly created directories will have nlink = 2. > > Is there a conclusion on this? Seems the test should be updated > accordingly? I've dropped the nlink check on directories. David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
About free space fragmentation, metadata write amplification and (no)ssd
So... today a real life story / btrfs use case example from the trenches at work... tl;dr 1) btrfs is awesome, but you have to carefully choose which parts of it you want to use or avoid 2) improvements can be made, but at least the problems relevant for this use case are managable and behaviour is quite predictable. This post is way too long, but I hope it's a fun read for a lazy sunday afternoon. :) Otherwise, skip some sections, they have headers. ... The example filesystem for this post is one of the backup server filesystems we have, running btrfs for the data storage. == About == In Q4 2014, we converted all our backup storage from ext4 and using rsync with --link-dest to btrfs while still using rsync, but with btrfs subvolumes and snapshots [1]. For every new backup, it creates a writable snapshot of the previous backup and then uses rsync on the file tree to get changes from the remote. Currently there's ~35TiB of data present on the example filesystem, with a total of just a bit more than 9 subvolumes, in groups of 32 snapshots per remote host (daily for 14 days, weekly for 3 months, montly for a year), so that's about 2800 'groups' of them. Inside are millions and millions and millions of files. And the best part is... it just works. Well, almost, given the title of the post. But, the effort needed for creating all backups and doing subvolume removal for expiries scales linearly with the amount of them. == Hardware and filesystem setup == The actual disk storage is done using NetApp storage equipment, in this case a FAS2552 with 1.2T SAS disks and some extra disk shelves. Storage is exported over multipath iSCSI over ethernet, and then grouped together again with multipathd and LVM, striping (like, RAID0) over active/active controllers. We've been using this setup for years now in different places, and it works really well. So, using this, we keep the whole RAID / multiple disks / hardware disk failure part outside the reach of btrfs. And yes, checksums are done twice, but who cares. ;] Since the maximum iSCSI lun size is 16TiB, the maximum block device size that we use by combining two is 32TiB. This filesystem is already bigger, so at some point we added two new luns in a new LVM volume group, and added the result to the btrfs filesystem (yay!): Total devices 2 FS bytes used 35.10TiB devid1 size 29.99TiB used 29.10TiB path /dev/xvdb devid2 size 12.00TiB used 11.29TiB path /dev/xvdc Data, single: total=39.50TiB, used=34.67TiB System, DUP: total=40.00MiB, used=6.22MiB Metadata, DUP: total=454.50GiB, used=437.36GiB GlobalReserve, single: total=512.00MiB, used=0.00B Yes, DUP metadata, more about that later... I can also umount the filesystem for a short time, take a snapshot on NetApp level from the luns, clone them and then have a writable clone of a 40TiB btrfs filesystem, to be able to do crazy things and tests before really doing changes, like kernel version or things like converting to the free space tree etc. >From end 2014 to september 2016, we used the 3.16 LTS kernel from Debian Jessie. Since september 2016, it's 4.7.5, after torturing it for two weeks on such a clone, replaying the daily workload on it. == What's not so great... Allocated but unused space... == Since the beginning it showed that the filesystem had a tendency to accumulate allocated but unused space that didn't get reused again by writes. In the last months of using kernel 3.16 the situation worsened, ending up with about 30% allocated but unused space (11TiB...), while the filesystem kept allocating new space all the time instead of reusing it: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q23.png Using balance with the 3.16 kernel and space cache v1 to fight this was almost not possible because of the amount of scattered out metadata writes + amplification (1:40 overall read/write ratio during balance) and writing space cache information over and over again on every commit. When making the switch to the 4.7 kernel I also switched to the free space tree, eliminating the space cache flush problems and did a mega-balance operation which brought it back down quite a bit. Here's what it looked like for the last 6 months: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q4-17-Q1.png This is not too bad, but also not good enough. I want my picture to become brighter white than this: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-03-14-backups-heatmap-chunks.png The picture shows that the unused space is scattered all around the whole filesystem. So about a month ago, I continued searching kernel code for the cause of this behaviour. This is a fun, but time consuming and often mind boggling activity, because you run into 10 different interesting things at the same time and want to start to find out about all of them at the same time etc. :D The two first things I found out about were: 1) the 'free space cluster' code, which is responsible
Re: Linux next-20170407 failed to build on ARM due to usage of mod in btrfs code
On Sat, Apr 08, 2017 at 02:45:34PM -0300, Fabio Estevam wrote: > On Sat, Apr 8, 2017 at 1:02 PM, Icenowy Zhengwrote: > > Hello everyone, > > Today I tried to build a kernel with btrfs enabled on ARM, then when linking > > I met such an error: > > > > ``` > > fs/built-in.o: In function `scrub_bio_end_io_worker': > > acl.c:(.text+0x2f0450): undefined reference to `__aeabi_uldivmod' > > fs/built-in.o: In function `scrub_extent_for_parity': > > acl.c:(.text+0x2f0bcc): undefined reference to `__aeabi_uldivmod' > > fs/built-in.o: In function `scrub_raid56_parity': > > acl.c:(.text+0x2f12a8): undefined reference to `__aeabi_uldivmod' > > acl.c:(.text+0x2f15c4): undefined reference to `__aeabi_uldivmod' > > ``` > > > > These functions are found at fs/btrfs/scrub.c . > > > > After disabling btrfs the kernel is successfully built. > > I see the same error with ARM imx_v6_v7_defconfig + btrfs support. > > Looks like it is caused by commit 7d0ef8b4dbbd220 ("Btrfs: update > scrub_parity to use u64 stripe_len"). +1, my bisect just finished, same bad commit. -- ⢀⣴⠾⠻⢶⣦⠀ Meow! ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second ⠈⠳⣄ preimage for double rot13! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12] audit: Use timespec64 to represent audit timestamps
> I have no problem merging this patch into audit/next for v4.12, would > you prefer me to do that so at least this patch is merged? This would be fine. But, I think whoever takes the last 2 deletion patches should also take them. I'm not sure how that part works out. > It would probably make life a small bit easier for us in the audit > world too as it would reduce the potential merge conflict. However, > that's a relatively small thing to worry about. -Deepa -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux next-20170407 failed to build on ARM due to usage of mod in btrfs code
On Sat, Apr 8, 2017 at 1:02 PM, Icenowy Zhengwrote: > Hello everyone, > Today I tried to build a kernel with btrfs enabled on ARM, then when linking > I met such an error: > > ``` > fs/built-in.o: In function `scrub_bio_end_io_worker': > acl.c:(.text+0x2f0450): undefined reference to `__aeabi_uldivmod' > fs/built-in.o: In function `scrub_extent_for_parity': > acl.c:(.text+0x2f0bcc): undefined reference to `__aeabi_uldivmod' > fs/built-in.o: In function `scrub_raid56_parity': > acl.c:(.text+0x2f12a8): undefined reference to `__aeabi_uldivmod' > acl.c:(.text+0x2f15c4): undefined reference to `__aeabi_uldivmod' > ``` > > These functions are found at fs/btrfs/scrub.c . > > After disabling btrfs the kernel is successfully built. I see the same error with ARM imx_v6_v7_defconfig + btrfs support. Looks like it is caused by commit 7d0ef8b4dbbd220 ("Btrfs: update scrub_parity to use u64 stripe_len"). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Linux next-20170407 failed to build on ARM due to usage of mod in btrfs code
Hello everyone, Today I tried to build a kernel with btrfs enabled on ARM, then when linking I met such an error: ``` fs/built-in.o: In function `scrub_bio_end_io_worker': acl.c:(.text+0x2f0450): undefined reference to `__aeabi_uldivmod' fs/built-in.o: In function `scrub_extent_for_parity': acl.c:(.text+0x2f0bcc): undefined reference to `__aeabi_uldivmod' fs/built-in.o: In function `scrub_raid56_parity': acl.c:(.text+0x2f12a8): undefined reference to `__aeabi_uldivmod' acl.c:(.text+0x2f15c4): undefined reference to `__aeabi_uldivmod' ``` These functions are found at fs/btrfs/scrub.c . After disabling btrfs the kernel is successfully built. For this problem, see also [1], which used to be a similar bug in PL330 driver code. [1] https://patchwork.kernel.org/patch/5299081/ Thanks, Icenowy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does btrfs get nlink on directories wrong? -- was Re: [PATCH 2/4] xfstests: Add first statx test [ver #5]
On Wed, Apr 05, 2017 at 03:32:30PM +0300, Amir Goldstein wrote: > On Wed, Apr 5, 2017 at 3:30 PM, David Sterbawrote: > > On Wed, Apr 05, 2017 at 11:53:41AM +0100, David Howells wrote: > >> I've added a test to xfstests that exercises the new statx syscall. > >> However, > >> it fails on btrfs: > >> > >> Test statx on a directory > >> +[!] stx_nlink differs, 1 != 2 > >> +Failed > >> +stat_test failed > >> > >> because a new directory it creates has an nlink of 1, not 2. Is this a > >> case > >> of my making an incorrect assumption or is it an fs bug? > > > > Afaik nlink == 1 means that there's no accounting of subdirectories, and > > it's a valid value. The 'find' utility can use nlink to optimize > > directory traversal but otherwise I'm not aware of other usage. > > > > All directories in btrfs have nlink == 1. > > FYI, > > Overlayfs uses nlink = 1 for merge dirs to silence 'find' et al. > Ext4 uses nlink = 1 for directories with more than 32K subdirs > (EXT4_FEATURE_RO_COMPAT_DIR_NLINK). > > But in both those fs newly created directories will have nlink = 2. Is there a conclusion on this? Seems the test should be updated accordingly? Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12] audit: Use timespec64 to represent audit timestamps
On Fri, Apr 7, 2017 at 8:57 PM, Deepa Dinamaniwrote: > struct timespec is not y2038 safe. > Audit timestamps are recorded in string format into > an audit buffer for a given context. > These mark the entry timestamps for the syscalls. > Use y2038 safe struct timespec64 to represent the times. > The log strings can handle this transition as strings can > hold upto 1024 characters. > > Signed-off-by: Deepa Dinamani > Reviewed-by: Arnd Bergmann > Acked-by: Paul Moore > Acked-by: Richard Guy Briggs > --- > include/linux/audit.h | 4 ++-- > kernel/audit.c| 10 +- > kernel/audit.h| 2 +- > kernel/auditsc.c | 6 +++--- > 4 files changed, 11 insertions(+), 11 deletions(-) I have no problem merging this patch into audit/next for v4.12, would you prefer me to do that so at least this patch is merged? It would probably make life a small bit easier for us in the audit world too as it would reduce the potential merge conflict. However, that's a relatively small thing to worry about. > diff --git a/include/linux/audit.h b/include/linux/audit.h > index 6fdfefc..f830508 100644 > --- a/include/linux/audit.h > +++ b/include/linux/audit.h > @@ -332,7 +332,7 @@ static inline void audit_ptrace(struct task_struct *t) > /* Private API (for audit.c only) */ > extern unsigned int audit_serial(void); > extern int auditsc_get_stamp(struct audit_context *ctx, > - struct timespec *t, unsigned int *serial); > + struct timespec64 *t, unsigned int *serial); > extern int audit_set_loginuid(kuid_t loginuid); > > static inline kuid_t audit_get_loginuid(struct task_struct *tsk) > @@ -511,7 +511,7 @@ static inline void __audit_seccomp(unsigned long syscall, > long signr, int code) > static inline void audit_seccomp(unsigned long syscall, long signr, int code) > { } > static inline int auditsc_get_stamp(struct audit_context *ctx, > - struct timespec *t, unsigned int *serial) > + struct timespec64 *t, unsigned int *serial) > { > return 0; > } > diff --git a/kernel/audit.c b/kernel/audit.c > index 2f4964c..fcbf377 100644 > --- a/kernel/audit.c > +++ b/kernel/audit.c > @@ -1625,10 +1625,10 @@ unsigned int audit_serial(void) > } > > static inline void audit_get_stamp(struct audit_context *ctx, > - struct timespec *t, unsigned int *serial) > + struct timespec64 *t, unsigned int *serial) > { > if (!ctx || !auditsc_get_stamp(ctx, t, serial)) { > - *t = CURRENT_TIME; > + ktime_get_real_ts64(t); > *serial = audit_serial(); > } > } > @@ -1652,7 +1652,7 @@ struct audit_buffer *audit_log_start(struct > audit_context *ctx, gfp_t gfp_mask, > int type) > { > struct audit_buffer *ab; > - struct timespec t; > + struct timespec64 t; > unsigned int uninitialized_var(serial); > > if (audit_initialized != AUDIT_INITIALIZED) > @@ -1705,8 +1705,8 @@ struct audit_buffer *audit_log_start(struct > audit_context *ctx, gfp_t gfp_mask, > } > > audit_get_stamp(ab->ctx, , ); > - audit_log_format(ab, "audit(%lu.%03lu:%u): ", > -t.tv_sec, t.tv_nsec/100, serial); > + audit_log_format(ab, "audit(%llu.%03lu:%u): ", > +(unsigned long long)t.tv_sec, t.tv_nsec/100, > serial); > > return ab; > } > diff --git a/kernel/audit.h b/kernel/audit.h > index 0f1cf6d..cdf96f4 100644 > --- a/kernel/audit.h > +++ b/kernel/audit.h > @@ -112,7 +112,7 @@ struct audit_context { > enum audit_statestate, current_state; > unsigned intserial; /* serial number for record */ > int major; /* syscall number */ > - struct timespec ctime; /* time of syscall entry */ > + struct timespec64 ctime; /* time of syscall entry */ > unsigned long argv[4];/* syscall arguments */ > longreturn_code;/* syscall return code */ > u64 prio; > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > index e59ffc7..a2d9217 100644 > --- a/kernel/auditsc.c > +++ b/kernel/auditsc.c > @@ -1532,7 +1532,7 @@ void __audit_syscall_entry(int major, unsigned long a1, > unsigned long a2, > return; > > context->serial = 0; > - context->ctime = CURRENT_TIME; > + ktime_get_real_ts64(>ctime); > context->in_syscall = 1; > context->current_state = state; > context->ppid = 0; > @@ -1941,13 +1941,13 @@ EXPORT_SYMBOL_GPL(__audit_inode_child); > /** > * auditsc_get_stamp - get local copies of audit_context values > * @ctx:
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
On 04/08/2017 01:16 PM, Hans van Kranenburg wrote: > On 04/07/2017 11:25 PM, Hans van Kranenburg wrote: >> Ok, I'm going to revive a year old mail thread here with interesting new >> info: >> >> [...] >> >> Now, another surprise: >> >> From the exact moment I did mount -o remount,nossd on this filesystem, >> the problem vanished. >> >> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png >> >> I don't have a new video yet, but I'll set up a cron tonight and post it >> later. >> >> I'm going to send another mail specifically about the nossd/ssd >> behaviour and other things I found out last week, but that'll probably >> be tomorrow. > > Well, there it is: > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4 > > Amazing... :) I'll update the file later with extra frames. By the way, 1. For the log files in /var/log... logrotate behaves as a defrag tool of course. The small free space gaps left behind when scraping the current log file together and rewriting it as 1 big gzipped file can be reused throughout the next day or whatever interval by the slow writes again. 2. For the /var/spool/postfix... small files come and go, and that's fine now. 3. For the mailman mbox files, which get appended all the time... They can either stay where they are, having some more extents scattered around, or, an entry in the monthly cron to point defrag at the files of last month (which will never change again) will solve that efficiently. All of that doesn't sound like abnormal things to do when punishing the filesystem with a 'slow small write' workload. I'm happy to be able to keep this thing on btrfs. When moving all the mailman stuff over from a previous VM, I first made it ext4 again, then immediately ended up with no inodes left (of course!) while copying the mailman archive, and then thought .. arg .. mkfs.btrfs, yay, unlimited inodes! :) I was almost at the point of converting it back to ext4 after all because of the exploding unused free space problems, but now that's prevented just in time. :D Moo, -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
On 04/07/2017 11:25 PM, Hans van Kranenburg wrote: > Ok, I'm going to revive a year old mail thread here with interesting new > info: > > [...] > > Now, another surprise: > > From the exact moment I did mount -o remount,nossd on this filesystem, > the problem vanished. > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png > > I don't have a new video yet, but I'll set up a cron tonight and post it > later. > > I'm going to send another mail specifically about the nossd/ssd > behaviour and other things I found out last week, but that'll probably > be tomorrow. Well, there it is: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4 Amazing... :) I'll update the file later with extra frames. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: During a btrfs balance nearly all quotas of the subvolumes became exceeded
Markus Baier posted on Fri, 07 Apr 2017 16:17:10 +0200 as excerpted: > Hello btrfs-list, > > today a strange behaviour appered during the btrfs balance process. > > I started a btrfs balance operation on the /home subvolume that > contains, as childs, all the subvolumes for the home directories of the > users, every subvolume with it's own quota. > > A short time after the start of the balance process no user was able to > write into his homedirectory anymore. > All users got the "your disc quota exceeded" message. > > The I checked the qgroups and got the following result: > > btrfs qgroup show -r /home/ > qgroupidrferexcl max_rfer > > 0/5 16.00KiB16.00KiBnone > 0/257 16.00KiB16.00KiBnone > 0/258 16.00EiB16.00EiB200.00GiB > 0/259 16.00EiB16.00EiB200.00GiB > 0/260 16.00EiB16.00EiB200.00GiB > 0/261 16.00EiB16.00EiB200.00GiB > 0/267 28.00KiB28.00KiB200.00GiB > > 1/1 16.00EiB16.00EiB900.00GiB > > For most of the subvolumes btrfs calculated 16.00EiB (I think this is > the maximum possible size of the filesystem) > as the amount of used space. > A few subvolumes, all of them are nearly empty like the 0/267, > were not afected and showed the normal size of 28.00KiB > > I was able to fix the problem with the: > btrfs quota rescan /home command. > But my question is, is this a already known bug and what can I do to > prevent this problem during the next balance run? > > uname -a > Linux condor-control 4.4.39-gentoo [...] Known bug. The btrfs quota subsystem remains somewhat buggy and unstable, with negative quota (IIRC, 16 EiB is the unsigned 64-bit integer representation of a signed-int negative, I believe -1) issues being one of the continuing problems. Tho it's actively being worked on and you may well find that the latest current kernel release (4.10) is better in this regard, tho I'd still not entirely trust it and there remain quota-fix patches in the active submission queue (just check the list). Note that quotas seriously increase btrfs scaling issues as well, typically increasing balance times multi-fold, particularly as they interact with snapshots, which have scaling issues of their own, such that a cap of a couple hundred snapshots per subvolume is strongly recommended, even without quotas on top of it. Both memory usage and processing time are affected, primarily for balance and check. As a result of btrfs-quota's long continuing accuracy issues in addition to the scaling issues, my recommendation has long been the following: Generally, quota users fall into three categories, described here with my recommendations for each: 1) Those who know the quota issues and are actively working with the devs to test and correct them, helping to eventually stabilize this feature into practical usability, tho it has taken some years and the job, while getting closer to finished, remains yet unfinished. Bless them! Keep it up! =:^) 2) Those who may find the quota feature generally useful, but don't actually require it for their use-case. I recommend that these users turn off quotas until such time as they've been generally demonstrated to be reliable and stable. At this point they're simply not worth the hassle. Even then, the scaling issues may remain. 3) Those who actually depend on quotas working correctly as a part of their use-case. These users should really consider a more mature and stable filesystem where the quota feature is known to work as reliably as their use-case requires. Btrfs is certainly stabilizing and maturing, but it's simply not there yet for this use-case. One /possible/ alternative if staying with btrfs for its other features is desired, is the pre-quota solution of creating multiple independent filesystems on top of lvm or partitions, and using the size of the filesystems to enforce restrictions that quotas would otherwise be used for. Of course independent VM images is a more complicated variant of this. Unfortunately, given that you apparently have multiple users and are using quotas as resource-sharing enforcement, you may well fall into this third category. =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
Hans van Kranenburg posted on Fri, 07 Apr 2017 23:25:29 +0200 as excerpted: > So, this is why putting your /var/log, /var/lib/mailman and /var/spool > on btrfs is a terrible idea. > > Because the allocator keeps walking forward every file that is created > and then removed leaves a blank spot behind. > > Autodefrag makes the situation only a little bit better, changing the > resulting pattern from a sky full of stars into a snowstorm. The result > of taking a few small writes and rewriting them again is that again the > small parts of free space are left behind. > [... B]ecause of the pattern we end > up with, a large write apparently fails (the files downloaded when doing > apt-get update by daily cron) which causes a new chunk allocation. This > is clearly visible in the videos. Directly after that, the new chunk > gets filled with the same pattern, because the extent allocator now > continues there and next day same thing happens again etc... > Now, another surprise: > > From the exact moment I did mount -o remount,nossd on this filesystem, > the problem vanished. That large write in the middle of small writes pattern might be why I've not seen the problem on my btrfs', on ssd, here. Remember, I'm the guy who keeps advocating multiple independent small btrfs on partitioned-up larger devices, with the splits between independent btrfs' based on tasks. So I have a quite tiny sub-GiB independent log btrfs handling those slow incremental writes to generally smaller files, a separate / with the main system on it that's mounted read-only unless I'm actively updating it, a separate home with my reasonably small size but written at-once non-media user files, a separate media partition/fs with my much larger but very seldom rewritten media files, and a separate update partition/fs with the local cache of the distro tree and overlays, sources (since it's gentoo), built binpkg cache, etc, with small to medium-large files that are comparatively frequently replaced. So the relatively small slow-written and frequently rotated log files are isolated to their own partition/fs, undisturbed by the much larger update- writes to the updates and / partitions/fs, isolating them from the update- trigger that triggers the chunk allocations on your larger single general purpose filesystem/image, amongst all those fragmenting slow logfile writes. Very interesting and informative thread, BTW. I'm learning quite a bit. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html