[PATCH V9] Btrfs: enhance raid1/10 balance heuristic
From: Timofey Titovets Currently btrfs raid1/10 bаlance requests to mirrors, based on pid % num of mirrors. Add logic to consider: - If one of devices are non rotational - Queue length to devices By default pid % num_mirrors guessing will be used, but: - If one of mirrors are non rotational, reroute requests to non rotational - If other mirror have less queue length then optimal, repick to that mirror For avoid round-robin request balancing, lets use abs diff of queue length, and only if diff greater then threshold try rebalance: - threshold 8 for rotational devs - threshold 2 for all non rotational devs Some bench results from mail list (Dmitrii Tcvetkov ): Benchmark summary (arithmetic mean of 3 runs): Mainline Patch RAID1 | 18.9 MiB/s | 26.5 MiB/s RAID10 | 30.7 MiB/s | 30.7 MiB/s mainline, fio got lucky to read from first HDD (quite slow HDD): Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS] read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec) lat (msec): min=2, max=825, avg=60.17, stdev=65.06 mainline, fio got lucky to read from second HDD (much more modern): Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS] read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec) lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56 mainline, fio got lucky to read from an SSD: Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS] read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec) lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36 With the patch, 2 HDDs: Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS] read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec) lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14 With the patch, HDD(old one)+SSD: Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS] read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec) lat (usec): min=363, max=346752, avg=1381.73, stdev=6948.32 Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue length - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes v3 -> v4: - Rebase on latest misc-next v4 -> v5: - Rebase on latest misc-next v5 -> v6: - Fix spelling - Include bench results v6 -> v7: - Fixes based on Nikolay Borisov review: * Assume num == 2 * Remove "for" loop based on that assumption, where possible v7 -> v8: - Add comment about magic '2' num in guess function v8 -> v9: - Rebase on latest misc-next - Simplify code - Use abs instead of round_down for approximation Abs are more fair Signed-off-by: Timofey Titovets Tested-by: Dmitrii Tcvetkov --- block/genhd.c | 1 + fs/btrfs/volumes.c | 88 +- 2 files changed, 88 insertions(+), 1 deletion(-) diff --git a/block/genhd.c b/block/genhd.c index 703267865f14..fb35c85a7f42 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -84,6 +84,7 @@ unsigned int part_in_flight(struct request_queue *q, struct hd_struct *part) return inflight; } +EXPORT_SYMBOL_GPL(part_in_flight); void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 1c2a6e4b39da..8671c2bdced6 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -29,6 +30,8 @@ #include "sysfs.h" #include "tree-checker.h" +#define BTRFS_RAID_1_10_MAX_MIRRORS 2 + const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { [BTRFS_RAID_RAID10] = { .sub_stripes= 2, @@ -5482,6 +5485,88 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue length of bdev + * + * @bdev: target bdev + */ +static uint32_t bdev_get_queue_len(struct block_device *bdev) +{ + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + + return part_in_flight(rq, bd_part); +} + +/** + * guess_o
Re: psa, wiki needs updating now that Btrfs supports swapfiles in 5.0
FAQ updated. Thanks. -- Have a nice day, Timofey.
Re: [RFC PATCH] raid6_pq: Add module options to prefer algorithm
Gentle ping. пт, 4 мая 2018 г. в 03:15, Timofey Titovets : > > Skip testing unnecessary algorithms to speedup module initialization > > For my systems: > Before: 1.510s (initrd) > After: 977ms (initrd) # I set prefer to fastest algorithm > > Dmesg after patch: > [1.190042] raid6: avx2x4 gen() 28153 MB/s > [1.246683] raid6: avx2x4 xor() 19440 MB/s > [1.246684] raid6: using algorithm avx2x4 gen() 28153 MB/s > [1.246684] raid6: xor() 19440 MB/s, rmw enabled > [1.246685] raid6: using avx2x2 recovery algorithm > > Signed-off-by: Timofey Titovets > CC: linux-btrfs@vger.kernel.org > --- > lib/raid6/algos.c | 28 +--- > 1 file changed, 25 insertions(+), 3 deletions(-) > > diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c > index 5065b1e7e327..abfcb4107fc3 100644 > --- a/lib/raid6/algos.c > +++ b/lib/raid6/algos.c > @@ -30,6 +30,11 @@ EXPORT_SYMBOL(raid6_empty_zero_page); > #endif > #endif > > +static char *prefer_name; > + > +module_param(prefer_name, charp, 0); > +MODULE_PARM_DESC(prefer_name, "Prefer gen/xor() algorithm"); > + > struct raid6_calls raid6_call; > EXPORT_SYMBOL_GPL(raid6_call); > > @@ -155,10 +160,27 @@ static inline const struct raid6_calls > *raid6_choose_gen( > { > unsigned long perf, bestgenperf, bestxorperf, j0, j1; > int start = (disks>>1)-1, stop = disks-3; /* work on the second > half of the disks */ > - const struct raid6_calls *const *algo; > - const struct raid6_calls *best; > + const struct raid6_calls *const *algo = NULL; > + const struct raid6_calls *best = NULL; > + > + if (strlen(prefer_name)) { > + for (algo = raid6_algos; strlen(prefer_name) && *algo; > algo++) { > + if (!strncmp(prefer_name, (*algo)->name, 8)) { > + best = *algo; > + break; > + } > + } > + if (!best) > + pr_info("raid6: %-8s prefer not found\n", > prefer_name); > + } > + > + > + > + if (!algo) { > + algo = raid6_algos; > + } > > - for (bestgenperf = 0, bestxorperf = 0, best = NULL, algo = > raid6_algos; *algo; algo++) { > + for (bestgenperf = 0, bestxorperf = 0; *algo; algo++) { > if (!best || (*algo)->prefer >= best->prefer) { > if ((*algo)->valid && !(*algo)->valid()) > continue; > -- > 2.17.0 -- Have a nice day, Timofey.
Re: [PATCH V7] Btrfs: enhance raid1/10 balance heuristic
Oh, just forgot to answer. ср, 14 нояб. 2018 г. в 04:27, Anand Jain : > > I am ok with the least used path approach here for the IO routing > that's probably most reasonable in generic configurations. It can > be default read mirror policy as well. (thanks, that's pleasant for me %) ) > But as I mentioned. Not all configurations would agree to the heuristic > approach here. For example: To make use of the SAN storage cache to > get high IO throughput read must access based on the LBA, And this > heuristic would make matter worst. There are plans to add more > options read_mirror_policy [1]. > > [1] > https://patchwork.kernel.org/patch/10403299/ Can you please add some example of SAN stack were that will make something 'worst'? Moreover pid lb will also not play good for your example. In SAN stack client always see one device with N path. And no raid1 balancing can happen. Maybe i didn't see all setups in world, but configure raid1 from 2 remote devices sounds very bad for me. Even drbd only provide one logical device to end user. > I would rather provide the configuration tune-ables to the use > cases rather than fixing it using heuristic. Heuristic are good > only with the known set of IO pattern for which heuristic is > designed for. Yep, but how compex that tunables must be? i.e. we'll _always_ have cerner cases with bad behaviour. (Also i prefer have sysfs tunables for that, instead of adding another mount option.) > This is not the first time you are assuming heuristic would provide > the best possible performance in all use cases. As I mentioned > in the compression heuristic there was no problem statement that > you wanted to address using heuristic, theoretically the integrated > compression heuristic would have to do a lot more computation when > all the file-extents are compressible, its not convenience to me > how compression heuristic would help on a desktop machine where > most of the files are compressible. Different tools exists, because we have different use cases. If something adds more problems than it solves, it must be changed or just purged. Moreover, claim what on every desktop machine most of files are compressible not true. I don't want to make long discussion about "spherical cow" in space. Just my example: ➜ ~ sudo compsize /mnt Processed 1198830 files, 1404382 regular extents (1481132 refs), 675870 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 77% 240G 308G 285G none 100% 202G 202G 176G zlib37% 1.8G 5.1G 5.6G lzo 61% 64M 104M 126M zstd35% 36G 100G 103G That's are system + home. Home have different type of trash in it: videos, photos, source repos, steam games, docker images. DE Apps DB and other random things. Some data have NOCOW because i just lack of "mental strength" to finish fix of bad behaviour autodefrag with compressed data. As you can see most volume of data are not compressed. > IMO heuristic are good only for a set of types of workload. Giving > an option to move away from it for the manual tuning is desired. > > Thanks, Anand Any way, may be you right about demand in adding some control about internal behaviour. And we can combine our work to properly support that. (I don't like over engineering, and just try avoid way where users will start to switch random flags to make things better.) But before that we need some feedback from upstream. Bad or good. Because currently core btrfs devs works on companies, which internal use btrfs and/or sell that to customers (suse?). I'm afraid what devs afraid to change internal behaviour without 100% confident that it will be better. Thanks! > On 11/12/2018 07:58 PM, Timofey Titovets wrote: > > From: Timofey Titovets > > > > Currently btrfs raid1/10 balancer bаlance requests to mirrors, > > based on pid % num of mirrors. > > > > Make logic understood: > > - if one of underline devices are non rotational > > - Queue length to underline devices > > > > By default try use pid % num_mirrors guessing, but: > > - If one of mirrors are non rotational, repick optimal to it > > - If underline mirror have less queue length then optimal, > > repick to that mirror > > > > For avoid round-robin request balancing, > > lets round down queue length: > > - By 8 for rotational devs > > - By 2 for all non rotational devs > > > > Some bench results from mail list > > (Dmitrii Tcvetkov ): > > Benchmark summary (arithmetic mean of 3 runs): > > Mainline Patch > > > > RAID1 | 18.9 MiB/s | 26.5 MiB/s &
Re: Better data balancing over multiple devices for raid1/10 (read)
вт, 18 дек. 2018 г. в 19:43, Zdenek Kaspar : > > Hello, regarding Wiki/Idea pool: > Better data balancing over multiple devices for raid1/10 (read) > > By looking at [1] (which I currently use) and [2, 3, 4] > is there agreement to solve this? > > [1] https://patchwork.kernel.org/patch/10681671/ > [2] https://patchwork.kernel.org/patch/10403299/ > [3] https://patchwork.kernel.org/patch/10403303/ > [4] https://patchwork.kernel.org/patch/10403301/ > > TIA, Z. I think none of that, because moreover that patches try solve different problems from my point of view. [1] Just try to squeeze more iops from same hardware by simple playing around with some rules of queueing theory. [2..4] Try add some duct tape to control over raid1 mirror read for automatization and testing purposes (i'm really prefer use sysfs knobs for that type of things). If you try to read some mailing list conversations about above patches, you will see what currently we have no consensus about that solutions. With no consensus with some majority of devs no progress will be made. In theory that will be cool to do something similar to mdraid1, where guys implement internal scheduling policy, to help balance request over more complex and smart underline disk io schedulers. That solution can get consensus from devs POV. But currently no one work on that, so I guess what in the near future no agreement will be made. -- Have a nice day, Timofey.
[PATCH V8] Btrfs: enhance raid1/10 balance heuristic
From: Timofey Titovets Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue length to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue length then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue length: - By 8 for rotational devs - By 2 for all non rotational devs Some bench results from mail list (Dmitrii Tcvetkov ): Benchmark summary (arithmetic mean of 3 runs): Mainline Patch RAID1 | 18.9 MiB/s | 26.5 MiB/s RAID10 | 30.7 MiB/s | 30.7 MiB/s mainline, fio got lucky to read from first HDD (quite slow HDD): Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS] read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec) lat (msec): min=2, max=825, avg=60.17, stdev=65.06 mainline, fio got lucky to read from second HDD (much more modern): Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS] read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec) lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56 mainline, fio got lucky to read from an SSD: Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS] read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec) lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36 With the patch, 2 HDDs: Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS] read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec) lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14 With the patch, HDD(old one)+SSD: Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS] read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec) lat (usec): min=363, max=346752, avg=1381.73, stdev=6948.32 Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue length - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes v3 -> v4: - Rebased on latest misc-next v4 -> v5: - Rebased on latest misc-next v5 -> v6: - Fix spelling - Include bench results v6 -> v7: - Fixes based on Nikolay Borisov review: * Assume num == 2 * Remove "for" loop based on that assumption, where possible v7 -> v8: - Add comment about magic '2' num in guess function Signed-off-by: Timofey Titovets Tested-by: Dmitrii Tcvetkov Reviewed-by: Dmitrii Tcvetkov --- block/genhd.c | 1 + fs/btrfs/volumes.c | 104 - 2 files changed, 104 insertions(+), 1 deletion(-) diff --git a/block/genhd.c b/block/genhd.c index cff6bdf27226..4ba5ede8969e 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f435d397019e..d9b5cf31514a 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -28,6 +29,8 @@ #include "dev-replace.h" #include "sysfs.h" +#define BTRFS_RAID_1_10_MAX_MIRRORS 2 + const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { [BTRFS_RAID_RAID10] = { .sub_stripes= 2, @@ -5166,6 +5169,104 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue length of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + +
Re: [PATCH v2] btrfs: add zstd compression level support
вт, 13 нояб. 2018 г. в 04:52, Nick Terrell : > > > > > On Nov 12, 2018, at 4:33 PM, David Sterba wrote: > > > > On Wed, Oct 31, 2018 at 11:11:08AM -0700, Nick Terrell wrote: > >> From: Jennifer Liu > >> > >> Adds zstd compression level support to btrfs. Zstd requires > >> different amounts of memory for each level, so the design had > >> to be modified to allow set_level() to allocate memory. We > >> preallocate one workspace of the maximum size to guarantee > >> forward progress. This feature is expected to be useful for > >> read-mostly filesystems, or when creating images. > >> > >> Benchmarks run in qemu on Intel x86 with a single core. > >> The benchmark measures the time to copy the Silesia corpus [0] to > >> a btrfs filesystem 10 times, then read it back. > >> > >> The two important things to note are: > >> - The decompression speed and memory remains constant. > >> The memory required to decompress is the same as level 1. > >> - The compression speed and ratio will vary based on the source. > >> > >> LevelRatio Compression Decompression Compression Memory > >> 12.59153 MB/s112 MB/s0.8 MB > >> 22.67136 MB/s113 MB/s1.0 MB > >> 32.72106 MB/s115 MB/s1.3 MB > >> 42.7886 MB/s109 MB/s0.9 MB > >> 52.8369 MB/s109 MB/s1.4 MB > >> 62.8953 MB/s110 MB/s1.5 MB > >> 72.9140 MB/s112 MB/s1.4 MB > >> 82.9234 MB/s110 MB/s1.8 MB > >> 92.9327 MB/s109 MB/s1.8 MB > >> 10 2.9422 MB/s109 MB/s1.8 MB > >> 11 2.9517 MB/s114 MB/s1.8 MB > >> 12 2.9513 MB/s113 MB/s1.8 MB > >> 13 2.9510 MB/s111 MB/s2.3 MB > >> 14 2.997 MB/s110 MB/s2.6 MB > >> 15 3.036 MB/s110 MB/s2.6 MB > >> > >> [0] > >> https://urldefense.proofpoint.com/v2/url?u=http-3A__sun.aei.polsl.pl_-7Esdeor_index.php-3Fpage-3Dsilesia&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=HQM5IQdWOB8WaMoii2dYTw&m=5LQRTUqZnx_a8dGSa5bGsd0Fm4ejQQOcH50wi7nRewY&s=gFUm-SA3aeQI7PBe3zmxUuxk4AEEZegB0cRsbjWUToo&e= > >> > >> Signed-off-by: Jennifer Liu > >> Signed-off-by: Nick Terrell > >> Reviewed-by: Omar Sandoval > >> --- > >> v1 -> v2: > >> - Don't reflow the unchanged line. > >> > >> fs/btrfs/compression.c | 169 + > >> fs/btrfs/compression.h | 18 +++-- > >> fs/btrfs/lzo.c | 5 +- > >> fs/btrfs/super.c | 7 +- > >> fs/btrfs/zlib.c| 33 > >> fs/btrfs/zstd.c| 74 +- > >> 6 files changed, 202 insertions(+), 104 deletions(-) > >> > >> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > >> index 2955a4ea2fa8..b46652cb653e 100644 > >> --- a/fs/btrfs/compression.c > >> +++ b/fs/btrfs/compression.c > >> @@ -822,9 +822,12 @@ void __init btrfs_init_compress(void) > >> > >> /* > >> * Preallocate one workspace for each compression type so > >> - * we can guarantee forward progress in the worst case > >> + * we can guarantee forward progress in the worst case. > >> + * Provide the maximum compression level to guarantee large > >> + * enough workspace. > >> */ > >> -workspace = btrfs_compress_op[i]->alloc_workspace(); > >> +workspace = btrfs_compress_op[i]->alloc_workspace( > >> +btrfs_compress_op[i]->max_level); > > We provide the max level here, so we have at least one workspace per > compression type that is large enough. > > >> if (IS_ERR(workspace)) { > >> pr_warn("BTRFS: cannot preallocate compression > >> workspace, will try later\n"); > >> } else { > >> @@ -835,23 +838,78 @@ void __init btrfs_init_compress(void) > >> } > >> } > >> > >> +/* > >> + * put a workspace struct back on the list or free it if we have enough > >> + * idle ones sitting around > >> + */ > >> +static void __free_workspace(int type, struct list_head *workspace, > >> + bool heuristic) > >> +{ > >> +int idx = type - 1; > >> +struct list_head *idle_ws; > >> +spinlock_t *ws_lock; > >> +atomic_t *total_ws; > >> +wait_queue_head_t *ws_wait; > >> +int *free_ws; > >> + > >> +if (heuristic) { > >> +idle_ws = &btrfs_heuristic_ws.idle_ws; > >> +ws_lock = &btrfs_heuristic_ws.ws_lock; > >> +total_ws = &btrfs_heuristic_ws.total_ws; > >> +ws_wait = &btrfs_heuristic_ws.ws_wait; > >> +free_ws = &btrfs_heuristic_ws.free_ws; > >> +} else { > >> +idle_ws = &btrfs_comp_ws[idx].idle_ws; > >> +ws_lock = &btrfs_comp_ws[id
[PATCH V7] Btrfs: enhance raid1/10 balance heuristic
From: Timofey Titovets Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue length to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue length then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue length: - By 8 for rotational devs - By 2 for all non rotational devs Some bench results from mail list (Dmitrii Tcvetkov ): Benchmark summary (arithmetic mean of 3 runs): Mainline Patch RAID1 | 18.9 MiB/s | 26.5 MiB/s RAID10 | 30.7 MiB/s | 30.7 MiB/s mainline, fio got lucky to read from first HDD (quite slow HDD): Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS] read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec) lat (msec): min=2, max=825, avg=60.17, stdev=65.06 mainline, fio got lucky to read from second HDD (much more modern): Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS] read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec) lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56 mainline, fio got lucky to read from an SSD: Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS] read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec) lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36 With the patch, 2 HDDs: Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS] read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec) lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14 With the patch, HDD(old one)+SSD: Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS] read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec) lat (usec): min=363, max=346752, avg=1381.73, stdev=6948.32 Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue length - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes v3 -> v4: - Rebased on latest misc-next v4 -> v5: - Rebased on latest misc-next v5 -> v6: - Fix spelling - Include bench results v6 -> v7: - Fixes based on Nikolay Borisov review: * Assume num == 2 * Remove "for" loop based on that assumption, where possible * No functional changes Signed-off-by: Timofey Titovets Tested-by: Dmitrii Tcvetkov Reviewed-by: Dmitrii Tcvetkov --- block/genhd.c | 1 + fs/btrfs/volumes.c | 100 - 2 files changed, 100 insertions(+), 1 deletion(-) diff --git a/block/genhd.c b/block/genhd.c index be5bab20b2ab..939f0c6a2d79 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f4405e430da6..a6632cc2bfab 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5159,6 +5160,102 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue length of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* +* Try prevent switch for every sneeze +* By roundup output num by some value +*/ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_strip
Re: [PATCH V6] Btrfs: enhance raid1/10 balance heuristic
пн, 12 нояб. 2018 г. в 10:28, Nikolay Borisov : > > > > On 25.09.18 г. 21:38 ч., Timofey Titovets wrote: > > Currently btrfs raid1/10 balancer bаlance requests to mirrors, > > based on pid % num of mirrors. > > > > Make logic understood: > > - if one of underline devices are non rotational > > - Queue length to underline devices > > > > By default try use pid % num_mirrors guessing, but: > > - If one of mirrors are non rotational, repick optimal to it > > - If underline mirror have less queue length then optimal, > >repick to that mirror > > > > For avoid round-robin request balancing, > > lets round down queue length: > > - By 8 for rotational devs > > - By 2 for all non rotational devs > > > > Some bench results from mail list > > (Dmitrii Tcvetkov ): > > Benchmark summary (arithmetic mean of 3 runs): > > Mainline Patch > > > > RAID1 | 18.9 MiB/s | 26.5 MiB/s > > RAID10 | 30.7 MiB/s | 30.7 MiB/s > > > > mainline, fio got lucky to read from first HDD (quite slow HDD): > > Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS] > > read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec) > > lat (msec): min=2, max=825, avg=60.17, stdev=65.06 > > > > mainline, fio got lucky to read from second HDD (much more modern): > > Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS] > > read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec) > > lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56 > > > > mainline, fio got lucky to read from an SSD: > > Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS] > > read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec) > > lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36 > > > > With the patch, 2 HDDs: > > Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS] > > read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec) > > lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14 > > > > With the patch, HDD(old one)+SSD: > > Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS] > > read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec) > > lat (usec): min=363, max=346752, avg=1381.73, stdev=6948.32 > > > > Changes: > > v1 -> v2: > > - Use helper part_in_flight() from genhd.c > > to get queue length > > - Move guess code to guess_optimal() > > - Change balancer logic, try use pid % mirror by default > > Make balancing on spinning rust if one of underline devices > > are overloaded > > v2 -> v3: > > - Fix arg for RAID10 - use sub_stripes, instead of num_stripes > > v3 -> v4: > > - Rebased on latest misc-next > > v4 -> v5: > > - Rebased on latest misc-next > > v5 -> v6: > > - Fix spelling > > - Include bench results > > > > Signed-off-by: Timofey Titovets > > Tested-by: Dmitrii Tcvetkov > > Reviewed-by: Dmitrii Tcvetkov > > --- > > block/genhd.c | 1 + > > fs/btrfs/volumes.c | 111 - > > 2 files changed, 110 insertions(+), 2 deletions(-) > > > > diff --git a/block/genhd.c b/block/genhd.c > > index 9656f9e9f99e..5ea5acc88d3c 100644 > > --- a/block/genhd.c > > +++ b/block/genhd.c > > @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct > > hd_struct *part, > > atomic_read(&part->in_flight[1]); > > } > > } > > +EXPORT_SYMBOL_GPL(part_in_flight); > > > > void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, > > unsigned int inflight[2]) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index c95af358b71f..fa7dd6ac087f 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -16,6 +16,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include "ctree.h" > > #include "extent_map.h" > > @@ -5201,6 +5202,111 @@ int btrfs_i
Re: [PATCH V6] Btrfs: enhance raid1/10 balance heuristic
Gentle ping. вт, 25 сент. 2018 г. в 21:38, Timofey Titovets : > > Currently btrfs raid1/10 balancer bаlance requests to mirrors, > based on pid % num of mirrors. > > Make logic understood: > - if one of underline devices are non rotational > - Queue length to underline devices > > By default try use pid % num_mirrors guessing, but: > - If one of mirrors are non rotational, repick optimal to it > - If underline mirror have less queue length then optimal, >repick to that mirror > > For avoid round-robin request balancing, > lets round down queue length: > - By 8 for rotational devs > - By 2 for all non rotational devs > > Some bench results from mail list > (Dmitrii Tcvetkov ): > Benchmark summary (arithmetic mean of 3 runs): > Mainline Patch > > RAID1 | 18.9 MiB/s | 26.5 MiB/s > RAID10 | 30.7 MiB/s | 30.7 MiB/s > > mainline, fio got lucky to read from first HDD (quite slow HDD): > Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS] > read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec) > lat (msec): min=2, max=825, avg=60.17, stdev=65.06 > > mainline, fio got lucky to read from second HDD (much more modern): > Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS] > read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec) > lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56 > > mainline, fio got lucky to read from an SSD: > Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS] > read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec) > lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36 > > With the patch, 2 HDDs: > Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS] > read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec) > lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14 > > With the patch, HDD(old one)+SSD: > Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS] > read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec) > lat (usec): min=363, max=346752, avg=1381.73, stdev=6948.32 > > Changes: > v1 -> v2: > - Use helper part_in_flight() from genhd.c > to get queue length > - Move guess code to guess_optimal() > - Change balancer logic, try use pid % mirror by default > Make balancing on spinning rust if one of underline devices > are overloaded > v2 -> v3: > - Fix arg for RAID10 - use sub_stripes, instead of num_stripes > v3 -> v4: > - Rebased on latest misc-next > v4 -> v5: > - Rebased on latest misc-next > v5 -> v6: > - Fix spelling > - Include bench results > > Signed-off-by: Timofey Titovets > Tested-by: Dmitrii Tcvetkov > Reviewed-by: Dmitrii Tcvetkov > --- > block/genhd.c | 1 + > fs/btrfs/volumes.c | 111 - > 2 files changed, 110 insertions(+), 2 deletions(-) > > diff --git a/block/genhd.c b/block/genhd.c > index 9656f9e9f99e..5ea5acc88d3c 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct > hd_struct *part, > atomic_read(&part->in_flight[1]); > } > } > +EXPORT_SYMBOL_GPL(part_in_flight); > > void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, >unsigned int inflight[2]) > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index c95af358b71f..fa7dd6ac087f 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > #include > #include "ctree.h" > #include "extent_map.h" > @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info > *fs_info, u64 logical, u64 len) > return ret; > } > > +/** > + * bdev_get_queue_len - return rounded down in flight queue length of bdev > + * > + * @bdev: target bdev > + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 > + */ > +static int bdev_get_queue_len(struct block_device *bdev, int round_down) > +{ > + int sum; > + struct hd_struct *bd_part = bdev->bd_part; > + struct request_queue *rq = bdev_get_queue(
Re: [PATCH v15.1 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree
> 0) { > + /* > +* We only keep one hash in tree to save memory, so if > +* hash conflicts, free the one to insert. > +*/ > + rb_erase(&ihash->bytenr_node, &dedupe_info->bytenr_root); > + kfree(ihash); > + ret = 0; > + goto out; > + } > + > + list_add(&ihash->lru_list, &dedupe_info->lru_list); > + dedupe_info->current_nr++; > + > + /* Remove the last dedupe hash if we exceed limit */ > + while (dedupe_info->current_nr > dedupe_info->limit_nr) { > + struct inmem_hash *last; > + > + last = list_entry(dedupe_info->lru_list.prev, > + struct inmem_hash, lru_list); > + __inmem_del(dedupe_info, last); > + } > +out: > + mutex_unlock(&dedupe_info->lock); > + return 0; > +} > + > +int btrfs_dedupe_add(struct btrfs_fs_info *fs_info, > +struct btrfs_dedupe_hash *hash) > +{ > + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; > + > + if (!fs_info->dedupe_enabled || !hash) > + return 0; > + > + if (WARN_ON(dedupe_info == NULL)) > + return -EINVAL; > + > + if (WARN_ON(!btrfs_dedupe_hash_hit(hash))) > + return -EINVAL; > + > + /* ignore old hash */ > + if (dedupe_info->blocksize != hash->num_bytes) > + return 0; > + > + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) > + return inmem_add(dedupe_info, hash); > + return -EINVAL; > +} > -- > 2.19.1 > > > Reviewed-by: Timofey Titovets Thanks. -- Have a nice day, Timofey.
Re: [PATCH v15.1 02/13] btrfs: dedupe: Introduce function to initialize dedupe info
Backend specific check */ > + if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { > + /* only one limit is accepted for enable*/ > + if (dargs->limit_nr && dargs->limit_mem) { > + dargs->limit_nr = 0; > + dargs->limit_mem = 0; > + return -EINVAL; > + } > + > + if (!limit_nr && !limit_mem) > + dargs->limit_nr = BTRFS_DEDUPE_LIMIT_NR_DEFAULT; > + else { > + u64 tmp = (u64)-1; > + > + if (limit_mem) { > + tmp = div_u64(limit_mem, > + (sizeof(struct inmem_hash)) + > + btrfs_hash_sizes[hash_algo]); > + /* Too small limit_mem to fill a hash item */ > + if (!tmp) { > + dargs->limit_mem = 0; > + dargs->limit_nr = 0; > + return -EINVAL; > + } > + } > + if (!limit_nr) > + limit_nr = (u64)-1; > + > + dargs->limit_nr = min(tmp, limit_nr); > + } > + } > + if (backend == BTRFS_DEDUPE_BACKEND_ONDISK) > + dargs->limit_nr = 0; > + > + return 0; > +} > + > +int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, > + struct btrfs_ioctl_dedupe_args *dargs) > +{ > + struct btrfs_dedupe_info *dedupe_info; > + int ret = 0; > + > + ret = check_dedupe_parameter(fs_info, dargs); > + if (ret < 0) > + return ret; > + > + dedupe_info = fs_info->dedupe_info; > + if (dedupe_info) { > + /* Check if we are re-enable for different dedupe config */ > + if (dedupe_info->blocksize != dargs->blocksize || > + dedupe_info->hash_algo != dargs->hash_algo || > + dedupe_info->backend != dargs->backend) { > + btrfs_dedupe_disable(fs_info); > + goto enable; > + } > + > + /* On-fly limit change is OK */ > + mutex_lock(&dedupe_info->lock); > + fs_info->dedupe_info->limit_nr = dargs->limit_nr; > + mutex_unlock(&dedupe_info->lock); > + return 0; > + } > + > +enable: > + dedupe_info = init_dedupe_info(dargs); > + if (IS_ERR(dedupe_info)) > + return PTR_ERR(dedupe_info); > + fs_info->dedupe_info = dedupe_info; > + /* We must ensure dedupe_bs is modified after dedupe_info */ > + smp_wmb(); > + fs_info->dedupe_enabled = 1; > + return ret; > +} > + > +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) > +{ > + /* Place holder for bisect, will be implemented in later patches */ > + return 0; > +} > diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h > index 222ce7b4d827..87f5b7ce7766 100644 > --- a/fs/btrfs/dedupe.h > +++ b/fs/btrfs/dedupe.h > @@ -52,6 +52,18 @@ static inline int btrfs_dedupe_hash_hit(struct > btrfs_dedupe_hash *hash) > return (hash && hash->bytenr); > } > > +static inline int btrfs_dedupe_hash_size(u16 algo) > +{ > + if (WARN_ON(algo >= ARRAY_SIZE(btrfs_hash_sizes))) > + return -EINVAL; > + return sizeof(struct btrfs_dedupe_hash) + btrfs_hash_sizes[algo]; > +} > + > +static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 algo) > +{ > + return kzalloc(btrfs_dedupe_hash_size(algo), GFP_NOFS); > +} > + > /* > * Initial inband dedupe info > * Called at dedupe enable time. > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h > index 9cd15d2a40aa..ba879ac931f2 100644 > --- a/include/uapi/linux/btrfs.h > +++ b/include/uapi/linux/btrfs.h > @@ -683,6 +683,9 @@ struct btrfs_ioctl_get_dev_stats { > /* Hash algorithm, only support SHA256 yet */ > #define BTRFS_DEDUPE_HASH_SHA256 0 > > +/* Default dedupe limit on number of hash */ > +#define BTRFS_DEDUPE_LIMIT_NR_DEFAULT (32 * 1024) > + > /* > * This structure is used for dedupe enable/disable/configure > * and status ioctl. > -- > 2.19.1 > > > Reviewed-by: Timofey Titovets Thanks. -- Have a nice day, Timofey.
Re: [PATCH v15.1 01/13] btrfs: dedupe: Introduce dedupe framework and its header
eturn <0 for any error > + * (tree operation error for some backends) > + */ > +int btrfs_dedupe_search(struct btrfs_fs_info *fs_info, > + struct inode *inode, u64 file_pos, > + struct btrfs_dedupe_hash *hash); > + > +/* > + * Add a dedupe hash into dedupe info > + * Return 0 for success > + * Return <0 for any error > + * (tree operation error for some backends) > + */ > +int btrfs_dedupe_add(struct btrfs_fs_info *fs_info, > +struct btrfs_dedupe_hash *hash); > + > +/* > + * Remove a dedupe hash from dedupe info > + * Return 0 for success > + * Return <0 for any error > + * (tree operation error for some backends) > + * > + * NOTE: if hash deletion error is not handled well, it will lead > + * to corrupted fs, as later dedupe write can points to non-exist or even > + * wrong extent. > + */ > +int btrfs_dedupe_del(struct btrfs_fs_info *fs_info, u64 bytenr); > #endif > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index b0ab41da91d1..d1fa9d90cc8f 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2678,6 +2678,7 @@ int open_ctree(struct super_block *sb, > mutex_init(&fs_info->reloc_mutex); > mutex_init(&fs_info->delalloc_root_mutex); > mutex_init(&fs_info->cleaner_delayed_iput_mutex); > + mutex_init(&fs_info->dedupe_ioctl_lock); > seqlock_init(&fs_info->profiles_lock); > > INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots); > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h > index 5ca1d21fc4a7..9cd15d2a40aa 100644 > --- a/include/uapi/linux/btrfs.h > +++ b/include/uapi/linux/btrfs.h > @@ -20,6 +20,7 @@ > #ifndef _UAPI_LINUX_BTRFS_H > #define _UAPI_LINUX_BTRFS_H > #include > +#include > #include > > #define BTRFS_IOCTL_MAGIC 0x94 > @@ -667,6 +668,39 @@ struct btrfs_ioctl_get_dev_stats { > __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ > }; > > +/* In-band dedupe related */ > +#define BTRFS_DEDUPE_BACKEND_INMEMORY 0 > +#define BTRFS_DEDUPE_BACKEND_ONDISK1 > + > +/* Only support inmemory yet, so count is still only 1 */ > +#define BTRFS_DEDUPE_BACKEND_COUNT 1 > + > +/* Dedup block size limit and default value */ > +#define BTRFS_DEDUPE_BLOCKSIZE_MAX SZ_8M > +#define BTRFS_DEDUPE_BLOCKSIZE_MIN SZ_16K > +#define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT SZ_128K > + > +/* Hash algorithm, only support SHA256 yet */ > +#define BTRFS_DEDUPE_HASH_SHA256 0 > + > +/* > + * This structure is used for dedupe enable/disable/configure > + * and status ioctl. > + * Reserved range should be set to 0xff. > + */ > +struct btrfs_ioctl_dedupe_args { > + __u16 cmd; /* In: command */ > + __u64 blocksize;/* In/Out: blocksize */ > + __u64 limit_nr; /* In/Out: limit nr for inmem backend */ > + __u64 limit_mem;/* In/Out: limit mem for inmem backend */ > + __u64 current_nr; /* Out: current hash nr */ > + __u16 backend; /* In/Out: current backend */ > + __u16 hash_algo;/* In/Out: hash algorithm */ > + u8 status; /* Out: enabled or disabled */ > + u8 flags; /* In: special flags for ioctl */ > + u8 __unused[472]; /* Pad to 512 bytes */ > +}; > + > #define BTRFS_QUOTA_CTL_ENABLE 1 > #define BTRFS_QUOTA_CTL_DISABLE2 > #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED3 > -- > 2.19.1 Reviewed-by: Timofey Titovets Thanks. -- Have a nice day, Timofey.
Re: [PATCH v2] btrfs: add zstd compression level support
s, > unsigned char *data_in, > return ret; > } > > -static void zlib_set_level(struct list_head *ws, unsigned int type) > -{ > - struct workspace *workspace = list_entry(ws, struct workspace, list); > - unsigned level = (type & 0xF0) >> 4; > - > - if (level > 9) > - level = 9; > - > - workspace->level = level > 0 ? level : 3; > -} > - > const struct btrfs_compress_op btrfs_zlib_compress = { > .alloc_workspace= zlib_alloc_workspace, > .free_workspace = zlib_free_workspace, > .compress_pages = zlib_compress_pages, > .decompress_bio = zlib_decompress_bio, > .decompress = zlib_decompress, > - .set_level = zlib_set_level, > + .set_level = zlib_set_level, > + .max_level = BTRFS_ZLIB_MAX_LEVEL, > + .default_level = BTRFS_ZLIB_DEFAULT_LEVEL, > }; > diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c > index af6ec59972f5..e5d7c2eae65c 100644 > --- a/fs/btrfs/zstd.c > +++ b/fs/btrfs/zstd.c > @@ -19,12 +19,13 @@ > > #define ZSTD_BTRFS_MAX_WINDOWLOG 17 > #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG) > -#define ZSTD_BTRFS_DEFAULT_LEVEL 3 > +#define BTRFS_ZSTD_DEFAULT_LEVEL 3 > +#define BTRFS_ZSTD_MAX_LEVEL 15 > > -static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len) > +static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len, > +unsigned int level) > { > - ZSTD_parameters params = ZSTD_getParams(ZSTD_BTRFS_DEFAULT_LEVEL, > - src_len, 0); > + ZSTD_parameters params = ZSTD_getParams(level, src_len, 0); > > if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG) > params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG; > @@ -37,10 +38,25 @@ struct workspace { > size_t size; > char *buf; > struct list_head list; > + unsigned int level; > ZSTD_inBuffer in_buf; > ZSTD_outBuffer out_buf; > }; > > +static bool zstd_reallocate_mem(struct workspace *workspace, int size) > +{ > + void *new_mem; > + > + new_mem = kvmalloc(size, GFP_KERNEL); > + if (new_mem) { > + kvfree(workspace->mem); > + workspace->mem = new_mem; > + workspace->size = size; > + return true; > + } > + return false; > +} > + > static void zstd_free_workspace(struct list_head *ws) > { > struct workspace *workspace = list_entry(ws, struct workspace, list); > @@ -50,10 +66,34 @@ static void zstd_free_workspace(struct list_head *ws) > kfree(workspace); > } > > -static struct list_head *zstd_alloc_workspace(void) > +static bool zstd_set_level(struct list_head *ws, unsigned int level) > +{ > + struct workspace *workspace = list_entry(ws, struct workspace, list); > + ZSTD_parameters params; > + int size; > + > + if (level > BTRFS_ZSTD_MAX_LEVEL) > + level = BTRFS_ZSTD_MAX_LEVEL; > + > + if (level == 0) > + level = BTRFS_ZSTD_DEFAULT_LEVEL; > + > + params = ZSTD_getParams(level, ZSTD_BTRFS_MAX_INPUT, 0); > + size = max_t(size_t, > + ZSTD_CStreamWorkspaceBound(params.cParams), > + ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT)); > + if (size > workspace->size) { > + if (!zstd_reallocate_mem(workspace, size)) > + return false; > + } > + workspace->level = level; > + return true; > +} > + > +static struct list_head *zstd_alloc_workspace(unsigned int level) > { > ZSTD_parameters params = > - zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT); > + zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT, > level); > struct workspace *workspace; > > workspace = kzalloc(sizeof(*workspace), GFP_KERNEL); > @@ -69,6 +109,7 @@ static struct list_head *zstd_alloc_workspace(void) > goto fail; > > INIT_LIST_HEAD(&workspace->list); > + zstd_set_level(&workspace->list, level); > > return &workspace->list; > fail: > @@ -95,7 +136,8 @@ static int zstd_compress_pages(struct list_head *ws, > unsigned long len = *total_out; > const unsigned long nr_dest_pages = *out_pages; > unsigned long max_out = nr_dest_pages * PAGE_SIZE; > - ZSTD_parameters params = zstd_get_btrfs_parameters(len); > + ZSTD_parameters params = zstd_get_btrfs_parameters(len, > + workspace->level); > > *out_pages = 0; > *total_out = 0; > @@ -419,15 +461,13 @@ static int zstd_decompress(struct list_head *ws, > unsigned char *data_in, > return ret; > } > > -static void zstd_set_level(struct list_head *ws, unsigned int type) > -{ > -} > - > const struct btrfs_compress_op btrfs_zstd_compress = { > - .alloc_workspace = zstd_alloc_workspace, > - .free_workspace = zstd_free_workspace, > - .compress_pages = zstd_compress_pages, > - .decompress_bio = zstd_decompress_bio, > - .decompress = zstd_decompress, > - .set_level = zstd_set_level, > + .alloc_workspace= zstd_alloc_workspace, > + .free_workspace = zstd_free_workspace, > + .compress_pages = zstd_compress_pages, > + .decompress_bio = zstd_decompress_bio, > + .decompress = zstd_decompress, > + .set_level = zstd_set_level, > + .max_level = BTRFS_ZSTD_MAX_LEVEL, > + .default_level = BTRFS_ZSTD_DEFAULT_LEVEL, > }; > -- > 2.17.1 Reviewed-by: Timofey Titovets You didn't mention, so: Did you test compression ratio/performance with compress-force or just compress? Thanks. -- Have a nice day, Timofey.
Re: [PATCH RESEND] Btrfs: make should_defrag_range() understood compressed extents
вт, 18 сент. 2018 г. в 13:09, Timofey Titovets : > > From: Timofey Titovets > > Both, defrag ioctl and autodefrag - call btrfs_defrag_file() > for file defragmentation. > > Kernel default target extent size - 256KiB. > Btrfs progs default - 32MiB. > > Both bigger then maximum size of compressed extent - 128KiB. > That lead to rewrite all compressed data on disk. > > Fix that by check compression extents with different logic. > > As addition, make should_defrag_range() understood compressed extent type, > if requested target compression are same as current extent compression type. > Just don't recompress/rewrite extents. > To avoid useless recompression of compressed extents. > > Signed-off-by: Timofey Titovets > --- > fs/btrfs/ioctl.c | 28 +--- > 1 file changed, 25 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index a990a9045139..0a5ea1ccc89d 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1142,7 +1142,7 @@ static bool defrag_check_next_extent(struct inode > *inode, struct extent_map *em) > > static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, >u64 *last_len, u64 *skip, u64 *defrag_end, > - int compress) > + int compress, int compress_type) > { > struct extent_map *em; > int ret = 1; > @@ -1177,8 +1177,29 @@ static int should_defrag_range(struct inode *inode, > u64 start, u32 thresh, > * real extent, don't bother defragging it > */ > if (!compress && (*last_len == 0 || *last_len >= thresh) && > - (em->len >= thresh || (!next_mergeable && !prev_mergeable))) > + (em->len >= thresh || (!next_mergeable && !prev_mergeable))) { > ret = 0; > + goto out; > + } > + > + > + /* > +* Try not recompress compressed extents > +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to > +* recompress all compressed extents > +*/ > + if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) { > + if (!compress) { > + if (em->len == BTRFS_MAX_UNCOMPRESSED) > + ret = 0; > + } else { > + if (em->compress_type != compress_type) > + goto out; > + if (em->len == BTRFS_MAX_UNCOMPRESSED) > + ret = 0; > + } > + } > + > out: > /* > * last_len ends up being a counter of how many bytes we've defragged. > @@ -1477,7 +1498,8 @@ int btrfs_defrag_file(struct inode *inode, struct file > *file, > > if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, > extent_thresh, &last_len, &skip, > -&defrag_end, do_compress)){ > +&defrag_end, do_compress, > +compress_type)){ > unsigned long next; > /* > * the should_defrag function tells us how much to > skip > -- > 2.19.0 Ok, If no one like that patch, may be at least fix autodefarag on compressed files? By change default extent target size 256K -> 128K? -- Have a nice day, Timofey.
[PATCH V6] Btrfs: enhance raid1/10 balance heuristic
Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue length to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue length then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue length: - By 8 for rotational devs - By 2 for all non rotational devs Some bench results from mail list (Dmitrii Tcvetkov ): Benchmark summary (arithmetic mean of 3 runs): Mainline Patch RAID1 | 18.9 MiB/s | 26.5 MiB/s RAID10 | 30.7 MiB/s | 30.7 MiB/s mainline, fio got lucky to read from first HDD (quite slow HDD): Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS] read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec) lat (msec): min=2, max=825, avg=60.17, stdev=65.06 mainline, fio got lucky to read from second HDD (much more modern): Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS] read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec) lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56 mainline, fio got lucky to read from an SSD: Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS] read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec) lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36 With the patch, 2 HDDs: Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS] read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec) lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14 With the patch, HDD(old one)+SSD: Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS] read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec) lat (usec): min=363, max=346752, avg=1381.73, stdev=6948.32 Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue length - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes v3 -> v4: - Rebased on latest misc-next v4 -> v5: - Rebased on latest misc-next v5 -> v6: - Fix spelling - Include bench results Signed-off-by: Timofey Titovets Tested-by: Dmitrii Tcvetkov Reviewed-by: Dmitrii Tcvetkov --- block/genhd.c | 1 + fs/btrfs/volumes.c | 111 - 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 9656f9e9f99e..5ea5acc88d3c 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c95af358b71f..fa7dd6ac087f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue length of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* +* Try prevent switch for every sneeze +* By roundup output num by some value +*/ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_stripes + * + * That's generaly ok for spread load + * Add some balancer based on queue length to device + * + * Basic ideas: + * - Sequential read generate low amount of request + *so if load of drives are equal
Re: [PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic
чт, 20 сент. 2018 г. в 12:05, Peter Becker : > > i like the idea. > do you have any benchmarks for this change? > > the general logic looks good for me. https://patchwork.kernel.org/patch/10137909/ > > Tested-by: Dmitrii Tcvetkov > > Benchmark summary (arithmetic mean of 3 runs): > Mainline Patch > -- > RAID1 | 18.9 MiB/s | 26.5 MiB/s > RAID10 | 30.7 MiB/s | 30.7 MiB/s > fio configuration: > [global] > ioengine=libaio > buffered=0 > direct=1 > bssplit=32k/100 > size=8G > directory=/mnt/ > iodepth=16 > time_based > runtime=900 > > [test-fio] > rw=randread > > All tests were run on 4 HDD btrfs filesystem in a VM with 4 Gb > of ram on idle host. Full results attached to the email. Also: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg71758.html - - - So, IIRC its works at least.
[PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic
From: Timofey Titovets Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue leght to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue leght then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue leght: - By 8 for rotational devs - By 2 for all non rotational devs Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue lenght - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes v3 -> v4: - Rebased on latest misc-next v4 -> v5: - Rebased on latest misc-next Signed-off-by: Timofey Titovets --- block/genhd.c | 1 + fs/btrfs/volumes.c | 111 - 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 9656f9e9f99e..5ea5acc88d3c 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c95af358b71f..fa7dd6ac087f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* +* Try prevent switch for every sneeze +* By roundup output num by some value +*/ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_stripes + * + * That's generaly ok for spread load + * Add some balancer based on queue leght to device + * + * Basic ideas: + * - Sequential read generate low amount of request + *so if load of drives are equal, use pid % num_stripes balancing + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal + *and repick if other dev have "significant" less queue lenght + * - Repick optimal if queue leght of other mirror are less + */ +static int guess_optimal(struct map_lookup *map, int num, int optimal) +{ + int i; + int round_down = 8; + int qlen[num]; + bool is_nonrot[num]; + bool all_bdev_nonrot = true; + bool all_bdev_rotate = true; + struct block_device *bdev; + + if (num == 1) + return optimal; + + /* Check accessible bdevs */ + for (i = 0; i < num; i++) { + /* Init for missing bdevs */ + is_nonrot[i] = false; + qlen[i] = INT_MAX; + bdev = map->stripes[i].dev->bdev; + if (bdev) { + qlen[i] = 0; + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev)); + if (is_nonrot[i]) + all_bdev_rotate = false; + else + all_bdev_nonrot = false; + } + } + + /* +* Don't bother with computation +* if only one of two bdevs are accessible +*/ + if (num == 2 && qlen[0] != qlen[1]) { + if (qlen[0] < qlen[1]) + return 0; + else + return 1; + } + + if (all_bdev_nonrot) + round_down = 2; + + for (i = 0; i < num; i++) { + if (qlen[i]) + continue; + bdev = map->stripes[i].dev->bdev; + qlen[i] = bdev_get_queue_len(bdev, round_down); + } + + /* For mixed case, pick non rotational dev as optimal
[PATCH RESEND] Btrfs: make should_defrag_range() understood compressed extents
From: Timofey Titovets Both, defrag ioctl and autodefrag - call btrfs_defrag_file() for file defragmentation. Kernel default target extent size - 256KiB. Btrfs progs default - 32MiB. Both bigger then maximum size of compressed extent - 128KiB. That lead to rewrite all compressed data on disk. Fix that by check compression extents with different logic. As addition, make should_defrag_range() understood compressed extent type, if requested target compression are same as current extent compression type. Just don't recompress/rewrite extents. To avoid useless recompression of compressed extents. Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a990a9045139..0a5ea1ccc89d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1142,7 +1142,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em) static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, u64 *last_len, u64 *skip, u64 *defrag_end, - int compress) + int compress, int compress_type) { struct extent_map *em; int ret = 1; @@ -1177,8 +1177,29 @@ static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, * real extent, don't bother defragging it */ if (!compress && (*last_len == 0 || *last_len >= thresh) && - (em->len >= thresh || (!next_mergeable && !prev_mergeable))) + (em->len >= thresh || (!next_mergeable && !prev_mergeable))) { ret = 0; + goto out; + } + + + /* +* Try not recompress compressed extents +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to +* recompress all compressed extents +*/ + if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) { + if (!compress) { + if (em->len == BTRFS_MAX_UNCOMPRESSED) + ret = 0; + } else { + if (em->compress_type != compress_type) + goto out; + if (em->len == BTRFS_MAX_UNCOMPRESSED) + ret = 0; + } + } + out: /* * last_len ends up being a counter of how many bytes we've defragged. @@ -1477,7 +1498,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, extent_thresh, &last_len, &skip, -&defrag_end, do_compress)){ +&defrag_end, do_compress, +compress_type)){ unsigned long next; /* * the should_defrag function tells us how much to skip -- 2.19.0
Re: [PATCH V5] Btrfs: enchanse raid1/10 balance heuristic
сб, 7 июл. 2018 г. в 18:24, Timofey Titovets : > > From: Timofey Titovets > > Currently btrfs raid1/10 balancer bаlance requests to mirrors, > based on pid % num of mirrors. > > Make logic understood: > - if one of underline devices are non rotational > - Queue leght to underline devices > > By default try use pid % num_mirrors guessing, but: > - If one of mirrors are non rotational, repick optimal to it > - If underline mirror have less queue leght then optimal, >repick to that mirror > > For avoid round-robin request balancing, > lets round down queue leght: > - By 8 for rotational devs > - By 2 for all non rotational devs > > Changes: > v1 -> v2: > - Use helper part_in_flight() from genhd.c > to get queue lenght > - Move guess code to guess_optimal() > - Change balancer logic, try use pid % mirror by default > Make balancing on spinning rust if one of underline devices > are overloaded > v2 -> v3: > - Fix arg for RAID10 - use sub_stripes, instead of num_stripes > v3 -> v4: > - Rebased on latest misc-next > v4 -> v5: > - Rebased on latest misc-next > > Signed-off-by: Timofey Titovets > --- > block/genhd.c | 1 + > fs/btrfs/volumes.c | 111 - > 2 files changed, 110 insertions(+), 2 deletions(-) > > diff --git a/block/genhd.c b/block/genhd.c > index 9656f9e9f99e..5ea5acc88d3c 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct > hd_struct *part, > atomic_read(&part->in_flight[1]); > } > } > +EXPORT_SYMBOL_GPL(part_in_flight); > > void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, >unsigned int inflight[2]) > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index c95af358b71f..fa7dd6ac087f 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > #include > #include "ctree.h" > #include "extent_map.h" > @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info > *fs_info, u64 logical, u64 len) > return ret; > } > > +/** > + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev > + * > + * @bdev: target bdev > + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 > + */ > +static int bdev_get_queue_len(struct block_device *bdev, int round_down) > +{ > + int sum; > + struct hd_struct *bd_part = bdev->bd_part; > + struct request_queue *rq = bdev_get_queue(bdev); > + uint32_t inflight[2] = {0, 0}; > + > + part_in_flight(rq, bd_part, inflight); > + > + sum = max_t(uint32_t, inflight[0], inflight[1]); > + > + /* > +* Try prevent switch for every sneeze > +* By roundup output num by some value > +*/ > + return ALIGN_DOWN(sum, round_down); > +} > + > +/** > + * guess_optimal - return guessed optimal mirror > + * > + * Optimal expected to be pid % num_stripes > + * > + * That's generaly ok for spread load > + * Add some balancer based on queue leght to device > + * > + * Basic ideas: > + * - Sequential read generate low amount of request > + *so if load of drives are equal, use pid % num_stripes balancing > + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal > + *and repick if other dev have "significant" less queue lenght > + * - Repick optimal if queue leght of other mirror are less > + */ > +static int guess_optimal(struct map_lookup *map, int num, int optimal) > +{ > + int i; > + int round_down = 8; > + int qlen[num]; > + bool is_nonrot[num]; > + bool all_bdev_nonrot = true; > + bool all_bdev_rotate = true; > + struct block_device *bdev; > + > + if (num == 1) > + return optimal; > + > + /* Check accessible bdevs */ > + for (i = 0; i < num; i++) { > + /* Init for missing bdevs */ > + is_nonrot[i] = false; > + qlen[i] = INT_MAX; > + bdev = map->stripes[i].dev->bdev; > + if (bdev) { > + qlen[i] = 0; > + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev)); > + if (is_nonrot[i]) > + all_bdev_rotate = false; > + else > + all_bdev_no
Re: dduper - Offline btrfs deduplication tool
пт, 24 авг. 2018 г. в 7:41, Lakshmipathi.G : > > Hi - > > dduper is an offline dedupe tool. Instead of reading whole file blocks and > computing checksum, It works by fetching checksum from BTRFS csum tree. This > hugely improves the performance. > > dduper works like: > - Read csum for given two files. > - Find matching location. > - Pass the location to ioctl_ficlonerange directly > instead of ioctl_fideduperange > > By default, dduper adds safty check to above steps by creating a > backup reflink file and compares the md5sum after dedupe. > If the backup file matches new deduped file, then backup file is > removed. You can skip this check by passing --skip option. Here is > sample cli usage [1] and quick demo [2] > > Some performance numbers: (with -skip option) > > Dedupe two 1GB files with same content - 1.2 seconds > Dedupe two 5GB files with same content - 8.2 seconds > Dedupe two 10GB files with same content - 13.8 seconds > > dduper requires `btrfs inspect-internal dump-csum` command, you can use > this branch [3] or apply patch by yourself [4] > > [1] > https://gitlab.collabora.com/laks/btrfs-progs/blob/dump_csum/Documentation/dduper_usage.md > [2] http://giis.co.in/btrfs_dedupe.gif > [3] git clone https://gitlab.collabora.com/laks/btrfs-progs.git -b dump_csum > [4] https://patchwork.kernel.org/patch/10540229/ > > Please remember its version-0.1, so test it out, if you plan to use dduper > real data. > Let me know, if you have suggestions or feedback or bugs :) > > Cheers. > Lakshmipathi.G > One question: Why not ioctl_fideduperange? i.e. you kill most of benefits from that ioctl - atomicity. -- Have a nice day, Timofey.
[PATCH V5] Btrfs: enchanse raid1/10 balance heuristic
From: Timofey Titovets Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue leght to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue leght then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue leght: - By 8 for rotational devs - By 2 for all non rotational devs Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue lenght - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes v3 -> v4: - Rebased on latest misc-next v4 -> v5: - Rebased on latest misc-next Signed-off-by: Timofey Titovets --- block/genhd.c | 1 + fs/btrfs/volumes.c | 111 - 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 9656f9e9f99e..5ea5acc88d3c 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c95af358b71f..fa7dd6ac087f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* +* Try prevent switch for every sneeze +* By roundup output num by some value +*/ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_stripes + * + * That's generaly ok for spread load + * Add some balancer based on queue leght to device + * + * Basic ideas: + * - Sequential read generate low amount of request + *so if load of drives are equal, use pid % num_stripes balancing + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal + *and repick if other dev have "significant" less queue lenght + * - Repick optimal if queue leght of other mirror are less + */ +static int guess_optimal(struct map_lookup *map, int num, int optimal) +{ + int i; + int round_down = 8; + int qlen[num]; + bool is_nonrot[num]; + bool all_bdev_nonrot = true; + bool all_bdev_rotate = true; + struct block_device *bdev; + + if (num == 1) + return optimal; + + /* Check accessible bdevs */ + for (i = 0; i < num; i++) { + /* Init for missing bdevs */ + is_nonrot[i] = false; + qlen[i] = INT_MAX; + bdev = map->stripes[i].dev->bdev; + if (bdev) { + qlen[i] = 0; + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev)); + if (is_nonrot[i]) + all_bdev_rotate = false; + else + all_bdev_nonrot = false; + } + } + + /* +* Don't bother with computation +* if only one of two bdevs are accessible +*/ + if (num == 2 && qlen[0] != qlen[1]) { + if (qlen[0] < qlen[1]) + return 0; + else + return 1; + } + + if (all_bdev_nonrot) + round_down = 2; + + for (i = 0; i < num; i++) { + if (qlen[i]) + continue; + bdev = map->stripes[i].dev->bdev; + qlen[i] = bdev_get_queue_len(bdev, round_down); + } + + /* For mixed case, pick non rotational dev as optimal
[PATCH RESEND V4] Btrfs: enchanse raid1/10 balance heuristic
From: Timofey Titovets Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue leght to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue leght then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue leght: - By 8 for rotational devs - By 2 for all non rotational devs Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue lenght - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes v3 -> v4: - Rebased on latest misc-next Signed-off-by: Timofey Titovets --- block/genhd.c | 1 + fs/btrfs/volumes.c | 111 - 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 9656f9e9f99e..5ea5acc88d3c 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); void part_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c95af358b71f..fa7dd6ac087f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* +* Try prevent switch for every sneeze +* By roundup output num by some value +*/ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_stripes + * + * That's generaly ok for spread load + * Add some balancer based on queue leght to device + * + * Basic ideas: + * - Sequential read generate low amount of request + *so if load of drives are equal, use pid % num_stripes balancing + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal + *and repick if other dev have "significant" less queue lenght + * - Repick optimal if queue leght of other mirror are less + */ +static int guess_optimal(struct map_lookup *map, int num, int optimal) +{ + int i; + int round_down = 8; + int qlen[num]; + bool is_nonrot[num]; + bool all_bdev_nonrot = true; + bool all_bdev_rotate = true; + struct block_device *bdev; + + if (num == 1) + return optimal; + + /* Check accessible bdevs */ + for (i = 0; i < num; i++) { + /* Init for missing bdevs */ + is_nonrot[i] = false; + qlen[i] = INT_MAX; + bdev = map->stripes[i].dev->bdev; + if (bdev) { + qlen[i] = 0; + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev)); + if (is_nonrot[i]) + all_bdev_rotate = false; + else + all_bdev_nonrot = false; + } + } + + /* +* Don't bother with computation +* if only one of two bdevs are accessible +*/ + if (num == 2 && qlen[0] != qlen[1]) { + if (qlen[0] < qlen[1]) + return 0; + else + return 1; + } + + if (all_bdev_nonrot) + round_down = 2; + + for (i = 0; i < num; i++) { + if (qlen[i]) + continue; + bdev = map->stripes[i].dev->bdev; + qlen[i] = bdev_get_queue_len(bdev, round_down); + } + + /* For mixed case, pick non rotational dev as optimal */ + if (all_bdev_rotate == all_bdev_nonro
Re: [PATCH 2/4] Btrfs: make should_defrag_range() understood compressed extents
вт, 19 дек. 2017 г. в 13:02, Timofey Titovets : > Both, defrag ioctl and autodefrag - call btrfs_defrag_file() > for file defragmentation. > Kernel default target extent size - 256KiB. > Btrfs progs default - 32MiB. > Both bigger then maximum size of compressed extent - 128KiB. > That lead to rewrite all compressed data on disk. > Fix that by check compression extents with different logic. > As addition, make should_defrag_range() understood compressed extent type, > if requested target compression are same as current extent compression type. > Just don't recompress/rewrite extents. > To avoid useless recompression of compressed extents. > Signed-off-by: Timofey Titovets > --- > fs/btrfs/ioctl.c | 28 +--- > 1 file changed, 25 insertions(+), 3 deletions(-) > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 45a47d0891fc..b29ea1f0f621 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em) > static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, > u64 *last_len, u64 *skip, u64 *defrag_end, > - int compress) > + int compress, int compress_type) > { > struct extent_map *em; > int ret = 1; > @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, > * real extent, don't bother defragging it > */ > if (!compress && (*last_len == 0 || *last_len >= thresh) && > - (em->len >= thresh || (!next_mergeable && !prev_mergeable))) > + (em->len >= thresh || (!next_mergeable && !prev_mergeable))) { > ret = 0; > + goto out; > + } > + > + > + /* > +* Try not recompress compressed extents > +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to > +* recompress all compressed extents > +*/ > + if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) { > + if (!compress) { > + if (em->len == BTRFS_MAX_UNCOMPRESSED) > + ret = 0; > + } else { > + if (em->compress_type != compress_type) > + goto out; > + if (em->len == BTRFS_MAX_UNCOMPRESSED) > + ret = 0; > + } > + } > + > out: > /* > * last_len ends up being a counter of how many bytes we've defragged. > @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, > if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, > extent_thresh, &last_len, &skip, > -&defrag_end, do_compress)){ > +&defrag_end, do_compress, > +compress_type)){ > unsigned long next; > /* > * the should_defrag function tells us how much to skip > -- > 2.15.1 May be, then, if we don't want add some duck tape for compressed extents and defrag, we can just change default kernel target extent size 256KiB -> 128KiB? That will also fix the issue with autodefrag and compression enabled. Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Any chance to get snapshot-aware defragmentation?
пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn : > On 2018-05-19 04:54, Niccolò Belli wrote: > > On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote: > >> With a bit of work, it's possible to handle things sanely. You can > >> deduplicate data from snapshots, even if they are read-only (you need > >> to pass the `-A` option to duperemove and run it as root), so it's > >> perfectly reasonable to only defrag the main subvolume, and then > >> deduplicate the snapshots against that (so that they end up all being > >> reflinks to the main subvolume). Of course, this won't work if you're > >> short on space, but if you're dealing with snapshots, you should have > >> enough space that this will work (because even without defrag, it's > >> fully possible for something to cause the snapshots to suddenly take > >> up a lot more space). > > > > Been there, tried that. Unfortunately even if I skip the defreg a simple > > > > duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs > > > > is going to eat more space than it was previously available (probably > > due to autodefrag?). > It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME > ioctl). There's two things involved here: > * BTRFS has somewhat odd and inefficient handling of partial extents. > When part of an extent becomes unused (because of a CLONE ioctl, or an > EXTENT_SAME ioctl, or something similar), that part stays allocated > until the whole extent would be unused. > * You're using the default deduplication block size (128k), which is > larger than your filesystem block size (which is at most 64k, most > likely 16k, but might be 4k if it's an old filesystem), so deduplicating > can split extents. That's a metadata node leaf != fs block size. btrfs fs block size == machine page size currently. > Because of this, if a duplicate region happens to overlap the front of > an already shared extent, and the end of said shared extent isn't > aligned with the deduplication block size, the EXTENT_SAME call will > deduplicate the first part, creating a new shared extent, but not the > tail end of the existing shared region, and all of that original shared > region will stick around, taking up extra space that it wasn't before. > Additionally, if only part of an extent is duplicated, then that area of > the extent will stay allocated, because the rest of the extent is still > referenced (so you won't necessarily see any actual space savings). > You can mitigate this by telling duperemove to use the same block size > as your filesystem using the `-b` option. Note that using a smaller > block size will also slow down the deduplication process and greatly > increase the size of the hash file. duperemove -b control "how hash data", not more or less and only support 4KiB..1MiB And size of block for dedup will change efficiency of deduplication, when count of hash-block pairs, will change hash file size and time complexity. Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern. So, example, you have 2 of 2x4KiB blocks: 1: '' 2: '' With -b 8KiB hash of first block not same as second. But with -b 4KiB duperemove will see both '' and '' And then that blocks will be deduped. Even, duperemove have 2 modes of deduping: 1. By extents 2. By blocks Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: inode: Don't compress if NODATASUM or NODATACOW set
пн, 14 мая 2018 г. в 20:32, David Sterba : > On Mon, May 14, 2018 at 03:02:10PM +0800, Qu Wenruo wrote: > > As btrfs(5) specified: > > > > Note > > If nodatacow or nodatasum are enabled, compression is disabled. > > > > If NODATASUM or NODATACOW set, we should not compress the extent. > > > > And in fact, we have bug report about corrupted compressed extent > > leading to memory corruption in mail list. > Link please. > > Although it's mostly buggy lzo implementation causing the problem, btrfs > > still needs to be fixed to meet the specification. > That's very vague, what's the LZO bug? If the input is garbage and lzo > decompression cannot decompress it, it's not a lzo bug. > > Reported-by: James Harvey > > Signed-off-by: Qu Wenruo > > --- > > fs/btrfs/inode.c | 8 > > 1 file changed, 8 insertions(+) > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index d241285a0d2a..dbef3f404559 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -396,6 +396,14 @@ static inline int inode_need_compress(struct inode *inode, u64 start, u64 end) > > { > > struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > > > > + /* > > + * Btrfs doesn't support compression without csum or CoW. > > + * This should have the highest priority. > > + */ > > + if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW || > > + BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) > > + return 0; > This is also the wrong place to fix that, NODATASUM or NODATACOW inode > should never make it to compress_file_range (that calls > inode_need_compress). David, i've talk about that some time ago: https://www.spinics.net/lists/linux-btrfs/msg73137.html NoCow files can be *easy* compressed. ``` ➜ ~ touch test ➜ ~ chattr +C test ➜ ~ lsattr test ---C-- test ➜ ~ dd if=/dev/zero of=./test bs=1M count=1 1+0 records in 1+0 records out 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00099878 s, 1.0 GB/s ➜ ~ sync ➜ ~ filefrag -v test Filesystem type is: 9123683e File size of test is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 255: 88592741.. 88592996:256: last,eof test: 1 extent found ➜ ~ btrfs fi def -vrczstd test test ➜ ~ filefrag -v test Filesystem type is: 9123683e File size of test is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31: 3125.. 3156: 32: encoded 1: 32.. 63: 3180.. 3211: 32: 3157: encoded 2: 64.. 95: 3185.. 3216: 32: 3212: encoded 3: 96.. 127: 3188.. 3219: 32: 3217: encoded 4: 128.. 159: 3263.. 3294: 32: 3220: encoded 5: 160.. 191: 3355.. 3386: 32: 3295: encoded 6: 192.. 223: 3376.. 3407: 32: 3387: encoded 7: 224.. 255: 3411.. 3442: 32: 3408: last,encoded,eof test: 8 extents found ``` -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data
пт, 11 мая 2018 г. в 20:32, Omar Sandoval : > On Fri, May 11, 2018 at 06:49:16PM +0200, David Sterba wrote: > > On Fri, May 11, 2018 at 05:25:50PM +0100, Filipe Manana wrote: > > > On Fri, May 11, 2018 at 4:57 PM, David Sterba wrote: > > > > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the > > > > arrays can be 32KiB large. To avoid allocation failures due to > > > > fragmented memory, use the allocation with fallback to vmalloc. > > > > > > > > Signed-off-by: David Sterba > > > > --- > > > > > > > > This depends on the patches that remove the 16MiB restriction in the > > > > dedupe ioctl, but contextually can be applied to the current code too. > > > > > > > > https://patchwork.kernel.org/patch/10374941/ > > > > > > > > fs/btrfs/ioctl.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > > > > index b572e38b4b64..a7f517009cd7 100644 > > > > --- a/fs/btrfs/ioctl.c > > > > +++ b/fs/btrfs/ioctl.c > > > > @@ -3178,8 +3178,8 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, > > > > * locking. We use an array for the page pointers. Size of the array is > > > > * bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN. > > > > */ > > > > - cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); > > > > - cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); > > > > + cmp.src_pages = kvzalloc(num_pages, sizeof(struct page *), GFP_KERNEL); > > > > + cmp.dst_pages = kvzalloc(num_pages, sizeof(struct page *), GFP_KERNEL); > > > > > > Kvzalloc should take 2 parameters and not 3. > > > > And the right function is kvmalloc_array. > > > > > Also, aren't the corresponding kvfree() calls missing? > > > > Yes, thanks for catching it. The updated version: > > > > From: David Sterba > > Subject: [PATCH] btrfs: use kvzalloc for EXTENT_SAME temporary data > > > > The dedupe range is 16 MiB, with 4KiB pages and 8 byte pointers, the > > arrays can be 32KiB large. To avoid allocation failures due to > > fragmented memory, use the allocation with fallback to vmalloc. > > > > Signed-off-by: David Sterba > > --- > > fs/btrfs/ioctl.c | 16 +--- > > 1 file changed, 9 insertions(+), 7 deletions(-) > > > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > > index b572e38b4b64..4fcfa05ed960 100644 > > --- a/fs/btrfs/ioctl.c > > +++ b/fs/btrfs/ioctl.c > > @@ -3178,12 +3178,13 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, > >* locking. We use an array for the page pointers. Size of the array is > >* bounded by len, which is in turn bounded by BTRFS_MAX_DEDUPE_LEN. > >*/ > > - cmp.src_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); > > - cmp.dst_pages = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); > > + cmp.src_pages = kvmalloc_array(num_pages, sizeof(struct page *), > > +GFP_KERNEL); > > + cmp.dst_pages = kvmalloc_array(num_pages, sizeof(struct page *), > > +GFP_KERNEL); > kcalloc() implies __GFP_ZERO, do we need that here? AFAIK, yes, because: btrfs_cmp_data_free(): ... pg = cmp->src_pages[i]; if (pg) {...} .. And we will catch that, if errors happens in gather_extent_pages(). Thanks. > > if (!cmp.src_pages || !cmp.dst_pages) { > > - kfree(cmp.src_pages); > > - kfree(cmp.dst_pages); > > - return -ENOMEM; > > + ret = -ENOMEM; > > + goto out_free; > > } > > > > if (same_inode) > > @@ -3211,8 +3212,9 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, > > else > > btrfs_double_inode_unlock(src, dst); > > > > - kfree(cmp.src_pages); > > - kfree(cmp.dst_pages); > > +out_free: > > + kvfree(cmp.src_pages); > > + kvfree(cmp.dst_pages); > > > > return ret; > > } > > -- > > 2.16.2 > > -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] raid6_pq: Add module options to prefer algorithm
Skip testing unnecessary algorithms to speedup module initialization For my systems: Before: 1.510s (initrd) After: 977ms (initrd) # I set prefer to fastest algorithm Dmesg after patch: [1.190042] raid6: avx2x4 gen() 28153 MB/s [1.246683] raid6: avx2x4 xor() 19440 MB/s [1.246684] raid6: using algorithm avx2x4 gen() 28153 MB/s [1.246684] raid6: xor() 19440 MB/s, rmw enabled [1.246685] raid6: using avx2x2 recovery algorithm Signed-off-by: Timofey Titovets CC: linux-btrfs@vger.kernel.org --- lib/raid6/algos.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c index 5065b1e7e327..abfcb4107fc3 100644 --- a/lib/raid6/algos.c +++ b/lib/raid6/algos.c @@ -30,6 +30,11 @@ EXPORT_SYMBOL(raid6_empty_zero_page); #endif #endif +static char *prefer_name; + +module_param(prefer_name, charp, 0); +MODULE_PARM_DESC(prefer_name, "Prefer gen/xor() algorithm"); + struct raid6_calls raid6_call; EXPORT_SYMBOL_GPL(raid6_call); @@ -155,10 +160,27 @@ static inline const struct raid6_calls *raid6_choose_gen( { unsigned long perf, bestgenperf, bestxorperf, j0, j1; int start = (disks>>1)-1, stop = disks-3; /* work on the second half of the disks */ - const struct raid6_calls *const *algo; - const struct raid6_calls *best; + const struct raid6_calls *const *algo = NULL; + const struct raid6_calls *best = NULL; + + if (strlen(prefer_name)) { + for (algo = raid6_algos; strlen(prefer_name) && *algo; algo++) { + if (!strncmp(prefer_name, (*algo)->name, 8)) { + best = *algo; + break; + } + } + if (!best) + pr_info("raid6: %-8s prefer not found\n", prefer_name); + } + + + + if (!algo) { + algo = raid6_algos; + } - for (bestgenperf = 0, bestxorperf = 0, best = NULL, algo = raid6_algos; *algo; algo++) { + for (bestgenperf = 0, bestxorperf = 0; *algo; algo++) { if (!best || (*algo)->prefer >= best->prefer) { if ((*algo)->valid && !(*algo)->valid()) continue; -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 1/3] Btrfs: split btrfs_extent_same() for simplification
Split btrfs_extent_same() for simplification and preparation for call several times over target files Move most logic to __btrfs_extent_same() And leave in btrfs_extent_same() things which must happens only once Changes: v3: - Splited from one to 3 patches Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 64 ++-- 1 file changed, 35 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index f0e62e4f8fe7..fb8beedb0359 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2882,8 +2882,8 @@ static int extent_same_check_offsets(struct inode *inode, u64 off, u64 *plen, return 0; } -static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, -struct inode *dst, u64 dst_loff) +static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, + struct inode *dst, u64 dst_loff) { int ret; u64 len = olen; @@ -2892,21 +2892,13 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, u64 same_lock_start = 0; u64 same_lock_len = 0; - if (len == 0) - return 0; - - if (same_inode) - inode_lock(src); - else - btrfs_double_inode_lock(src, dst); - ret = extent_same_check_offsets(src, loff, &len, olen); if (ret) - goto out_unlock; + return ret; ret = extent_same_check_offsets(dst, dst_loff, &len, olen); if (ret) - goto out_unlock; + return ret; if (same_inode) { /* @@ -2923,32 +2915,21 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, * allow an unaligned length so long as it ends at * i_size. */ - if (len != olen) { - ret = -EINVAL; - goto out_unlock; - } + if (len != olen) + return -EINVAL; /* Check for overlapping ranges */ - if (dst_loff + len > loff && dst_loff < loff + len) { - ret = -EINVAL; - goto out_unlock; - } + if (dst_loff + len > loff && dst_loff < loff + len) + return -EINVAL; same_lock_start = min_t(u64, loff, dst_loff); same_lock_len = max_t(u64, loff, dst_loff) + len - same_lock_start; } - /* don't make the dst file partly checksummed */ - if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) != - (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) { - ret = -EINVAL; - goto out_unlock; - } - again: ret = btrfs_cmp_data_prepare(src, loff, dst, dst_loff, olen, &cmp); if (ret) - goto out_unlock; + return ret; if (same_inode) ret = lock_extent_range(src, same_lock_start, same_lock_len, @@ -2998,7 +2979,32 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, btrfs_double_extent_unlock(src, loff, dst, dst_loff, len); btrfs_cmp_data_free(&cmp); -out_unlock: + + return ret; +} + +static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, +struct inode *dst, u64 dst_loff) +{ + int ret; + bool same_inode = (src == dst); + + if (olen == 0) + return 0; + + /* don't make the dst file partly checksummed */ + if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) != + (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) { + return -EINVAL; + } + + if (same_inode) + inode_lock(src); + else + btrfs_double_inode_lock(src, dst); + + ret = __btrfs_extent_same(src, loff, olen, dst, dst_loff); + if (same_inode) inode_unlock(src); else -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 0/3] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
At now btrfs_dedupe_file_range() restricted to 16MiB range for limit locking time and memory requirement for dedup ioctl() For too big input range code silently set range to 16MiB Let's remove that restriction by do iterating over dedup range. That's backward compatible and will not change anything for request less then 16MiB. Changes: v1 -> v2: - Refactor btrfs_cmp_data_prepare and btrfs_extent_same - Store memory of pages array between iterations - Lock inodes once, not on each iteration - Small inplace cleanups v2 -> v3: - Split to several patches Timofey Titovets (3): Btrfs: split btrfs_extent_same() for simplification Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction Btrfs: btrfs_extent_same() reuse cmp workspace fs/btrfs/ioctl.c | 161 ++- 1 file changed, 91 insertions(+), 70 deletions(-) -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 3/3] Btrfs: btrfs_extent_same() reuse cmp workspace
We support big dedup requests by split range to several smaller, and call dedup logic over each of them. Instead of alloc/dealloc on each, let's reuse allocated memory. Changes: v3: - Splited from one to 3 patches Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 80 +--- 1 file changed, 41 insertions(+), 39 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 38ce990e9b4c..f2521bc0b069 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2769,8 +2769,6 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp) put_page(pg); } } - kfree(cmp->src_pages); - kfree(cmp->dst_pages); } static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, @@ -2779,40 +2777,14 @@ static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, { int ret; int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; - struct page **src_pgarr, **dst_pgarr; - /* -* We must gather up all the pages before we initiate our -* extent locking. We use an array for the page pointers. Size -* of the array is bounded by len, which is in turn bounded by -* BTRFS_MAX_DEDUPE_LEN. -*/ - src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); - dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); - if (!src_pgarr || !dst_pgarr) { - kfree(src_pgarr); - kfree(dst_pgarr); - return -ENOMEM; - } cmp->num_pages = num_pages; - cmp->src_pages = src_pgarr; - cmp->dst_pages = dst_pgarr; - /* -* If deduping ranges in the same inode, locking rules make it mandatory -* to always lock pages in ascending order to avoid deadlocks with -* concurrent tasks (such as starting writeback/delalloc). -*/ - if (src == dst && dst_loff < loff) { - swap(src_pgarr, dst_pgarr); - swap(loff, dst_loff); - } - - ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff); + ret = gather_extent_pages(src, cmp->src_pages, num_pages, loff); if (ret) goto out; - ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff); + ret = gather_extent_pages(dst, cmp->dst_pages, num_pages, dst_loff); out: if (ret) @@ -2883,11 +2855,11 @@ static int extent_same_check_offsets(struct inode *inode, u64 off, u64 *plen, } static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, - struct inode *dst, u64 dst_loff) + struct inode *dst, u64 dst_loff, + struct cmp_pages *cmp) { int ret; u64 len = olen; - struct cmp_pages cmp; bool same_inode = (src == dst); u64 same_lock_start = 0; u64 same_lock_len = 0; @@ -2927,7 +2899,7 @@ static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, } again: - ret = btrfs_cmp_data_prepare(src, loff, dst, dst_loff, olen, &cmp); + ret = btrfs_cmp_data_prepare(src, loff, dst, dst_loff, olen, cmp); if (ret) return ret; @@ -2950,7 +2922,7 @@ static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, * Ranges in the io trees already unlocked. Now unlock all * pages before waiting for all IO to complete. */ - btrfs_cmp_data_free(&cmp); + btrfs_cmp_data_free(cmp); if (same_inode) { btrfs_wait_ordered_range(src, same_lock_start, same_lock_len); @@ -2963,12 +2935,12 @@ static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, ASSERT(ret == 0); if (WARN_ON(ret)) { /* ranges in the io trees already unlocked */ - btrfs_cmp_data_free(&cmp); + btrfs_cmp_data_free(cmp); return ret; } /* pass original length for comparison so we stay within i_size */ - ret = btrfs_cmp_data(olen, &cmp); + ret = btrfs_cmp_data(olen, cmp); if (ret == 0) ret = btrfs_clone(src, dst, loff, olen, len, dst_loff, 1); @@ -2978,7 +2950,7 @@ static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, else btrfs_double_extent_unlock(src, loff, dst, dst_loff, len); - btrfs_cmp_data_free(&cmp); + btrfs_cmp_data_free(cmp); return ret; } @@ -2989,6 +2961,8 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, struct inode *dst, u64 dst_loff) { int ret; + struct cmp_pages cmp; + int num_pages = PAGE_ALIGN(BTRFS_MAX_DEDUPE_LEN) >>
[PATCH V3 2/3] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
At now btrfs_dedupe_file_range() restricted to 16MiB range for limit locking time and memory requirement for dedup ioctl() For too big input range code silently set range to 16MiB Let's remove that restriction by do iterating over dedup range. That's backward compatible and will not change anything for request less then 16MiB. Changes: v3: - Splited from one to 3 patches Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index fb8beedb0359..38ce990e9b4c 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2983,11 +2983,14 @@ static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, return ret; } +#define BTRFS_MAX_DEDUPE_LEN SZ_16M + static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, struct inode *dst, u64 dst_loff) { int ret; bool same_inode = (src == dst); + u64 i, tail_len, chunk_count; if (olen == 0) return 0; @@ -2998,13 +3001,28 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, return -EINVAL; } + tail_len = olen % BTRFS_MAX_DEDUPE_LEN; + chunk_count = div_u64(olen, BTRFS_MAX_DEDUPE_LEN); + if (same_inode) inode_lock(src); else btrfs_double_inode_lock(src, dst); - ret = __btrfs_extent_same(src, loff, olen, dst, dst_loff); + for (i = 0; i < chunk_count; i++) { + ret = __btrfs_extent_same(src, loff, BTRFS_MAX_DEDUPE_LEN, + dst, dst_loff); + if (ret) + goto out; + + loff += BTRFS_MAX_DEDUPE_LEN; + dst_loff += BTRFS_MAX_DEDUPE_LEN; + } + + if (tail_len > 0) + ret = __btrfs_extent_same(src, loff, tail_len, dst, dst_loff); +out: if (same_inode) inode_unlock(src); else @@ -3013,8 +3031,6 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, return ret; } -#define BTRFS_MAX_DEDUPE_LEN SZ_16M - ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen, struct file *dst_file, u64 dst_loff) { @@ -3023,9 +3039,6 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen, u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize; ssize_t res; - if (olen > BTRFS_MAX_DEDUPE_LEN) - olen = BTRFS_MAX_DEDUPE_LEN; - if (WARN_ON_ONCE(bs < PAGE_SIZE)) { /* * Btrfs does not support blocksize < page_size. As a -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] [RESEND] Btrfs: reduce size of struct btrfs_inode
чт, 26 апр. 2018 г. в 16:44, David Sterba : > On Wed, Apr 25, 2018 at 02:37:17AM +0300, Timofey Titovets wrote: > > Currently btrfs_inode have size equal 1136 bytes. (On x86_64). > > > > struct btrfs_inode store several vars releated to compression code, > > all states use 1 or 2 bits. > > > > Lets declare bitfields for compression releated vars, to reduce > > sizeof btrfs_inode to 1128 bytes. > Unfortunatelly, this has no big effect. The inodes are allocated from a > slab page, that's 4k and there are at most 3 inodes there. Snippet from > /proc/slabinfo: > # name > btrfs_inode 256043 278943 109631 > The size on my box is 1096 as it's 4.14, but this should not matter to > demonstrate the idea. > objperslab is 3 here, ie. there are 3 btrfs_inode in the page, and > there's 4096 - 3 * 1096 = 808 of slack space. In order to pack 4 inodes > per page, we'd have to squeeze the inode size to 1024 bytes. I've looked > into that and did not see enough members to remove or substitute. IIRC > there were like 24-32 bytes possible to shave, but that was it. > Once we'd get to 1024, adding anything new to btrfs_inode would be quite > difficult and as it goes, there's always something to add to the inode. > So I'd take a different approach, to regroup items and decide by > cacheline access patterns what to put together and what to separate. > The maximum size of inode before going to 2 objects per page is 1365, so > there's enough space for cacheline alignments. May be i misunderstood something, but i was think that slab combine several pages in continuous range, so object in slab can cross page boundary. So, all calculation will be very depends on scale of slab size. i.e. on my machine that looks quite different: name btrfs_inode 142475 146272 1136 28 8 So, PAGE_SIZE * pagesperslab / objperslab 4096 * 8 / 28 = 1170.28 4096*8 - 1136*28 = 960 That's looks like object can cross page boundary in slab. So, if size reduced to 1128, 4096 * 8 / 29 = 1129.93 4096*8 - 1128*29 = 56 Did i miss something? Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [RESEND] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
чт, 26 апр. 2018 г. в 17:05, David Sterba : > On Wed, Apr 25, 2018 at 02:37:14AM +0300, Timofey Titovets wrote: > > At now btrfs_dedupe_file_range() restricted to 16MiB range for > > limit locking time and memory requirement for dedup ioctl() > > > > For too big input range code silently set range to 16MiB > > > > Let's remove that restriction by do iterating over dedup range. > > That's backward compatible and will not change anything for request > > less then 16MiB. > > > > Changes: > > v1 -> v2: > > - Refactor btrfs_cmp_data_prepare and btrfs_extent_same > > - Store memory of pages array between iterations > > - Lock inodes once, not on each iteration > > - Small inplace cleanups > I think this patch should be split into more, there are several logical > changes mixed together. > I can add the patch to for-next to see if there are any problems caught > by the existing test, but will expect more revisions of the patch. I > don't see any fundamental problems so far. > Suggested changes: > * factor out __btrfs_extent_same > * adjust parameters if needed by the followup patches > * add the chunk counting logic > * any other cleanups Thanks, i will try split it out. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] Btrfs: enchanse raid1/10 balance heuristic
2018-04-25 10:54 GMT+03:00 Misono Tomohiro : > On 2018/04/25 9:20, Timofey Titovets wrote: >> Currently btrfs raid1/10 balancer bаlance requests to mirrors, >> based on pid % num of mirrors. >> >> Make logic understood: >> - if one of underline devices are non rotational >> - Queue leght to underline devices >> >> By default try use pid % num_mirrors guessing, but: >> - If one of mirrors are non rotational, repick optimal to it >> - If underline mirror have less queue leght then optimal, >>repick to that mirror >> >> For avoid round-robin request balancing, >> lets round down queue leght: >> - By 8 for rotational devs >> - By 2 for all non rotational devs >> >> Changes: >> v1 -> v2: >> - Use helper part_in_flight() from genhd.c >> to get queue lenght >> - Move guess code to guess_optimal() >> - Change balancer logic, try use pid % mirror by default >> Make balancing on spinning rust if one of underline devices >> are overloaded >> v2 -> v3: >> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes >> v3 -> v4: >> - Rebased on latest misc-next >> >> Signed-off-by: Timofey Titovets >> --- >> block/genhd.c | 1 + >> fs/btrfs/volumes.c | 111 - >> 2 files changed, 110 insertions(+), 2 deletions(-) >> >> diff --git a/block/genhd.c b/block/genhd.c >> index 9656f9e9f99e..5ea5acc88d3c 100644 >> --- a/block/genhd.c >> +++ b/block/genhd.c >> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct >> hd_struct *part, >> atomic_read(&part->in_flight[1]); >> } >> } >> +EXPORT_SYMBOL_GPL(part_in_flight); >> >> struct hd_struct *__disk_get_part(struct gendisk *disk, int partno) >> { >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index c95af358b71f..fa7dd6ac087f 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -16,6 +16,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include "ctree.h" >> #include "extent_map.h" >> @@ -5148,7 +5149,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, >> u64 logical, u64 len) >> /* >>* There could be two corrupted data stripes, we need >>* to loop retry in order to rebuild the correct data. >> - * >> + * >>* Fail a stripe at a time on every retry except the >>* stripe under reconstruction. >>*/ >> @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info >> *fs_info, u64 logical, u64 len) >> return ret; >> } >> >> +/** >> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev >> + * >> + * @bdev: target bdev >> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 >> + */ >> +static int bdev_get_queue_len(struct block_device *bdev, int round_down) >> +{ >> + int sum; >> + struct hd_struct *bd_part = bdev->bd_part; >> + struct request_queue *rq = bdev_get_queue(bdev); >> + uint32_t inflight[2] = {0, 0}; >> + >> + part_in_flight(rq, bd_part, inflight); >> + >> + sum = max_t(uint32_t, inflight[0], inflight[1]); >> + >> + /* >> + * Try prevent switch for every sneeze >> + * By roundup output num by some value >> + */ >> + return ALIGN_DOWN(sum, round_down); >> +} >> + >> +/** >> + * guess_optimal - return guessed optimal mirror >> + * >> + * Optimal expected to be pid % num_stripes >> + * >> + * That's generaly ok for spread load >> + * Add some balancer based on queue leght to device >> + * >> + * Basic ideas: >> + * - Sequential read generate low amount of request >> + *so if load of drives are equal, use pid % num_stripes balancing >> + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal >> + *and repick if other dev have "significant" less queue lenght > > The code looks always choosing the queue with the lowest length regardless > of the amount of queue length difference. So, this "significant" may be wrong? yes, but before code looks at queue len, we do round_down by 8, may be you confused because i hide ALIGN_DOWN in bdev_get_queue_len() I'm not think wha
[PATCH V4] Btrfs: enchanse raid1/10 balance heuristic
Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue leght to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue leght then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue leght: - By 8 for rotational devs - By 2 for all non rotational devs Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue lenght - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes v3 -> v4: - Rebased on latest misc-next Signed-off-by: Timofey Titovets --- block/genhd.c | 1 + fs/btrfs/volumes.c | 111 - 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 9656f9e9f99e..5ea5acc88d3c 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); struct hd_struct *__disk_get_part(struct gendisk *disk, int partno) { diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c95af358b71f..fa7dd6ac087f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5148,7 +5149,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len) /* * There could be two corrupted data stripes, we need * to loop retry in order to rebuild the correct data. -* +* * Fail a stripe at a time on every retry except the * stripe under reconstruction. */ @@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* +* Try prevent switch for every sneeze +* By roundup output num by some value +*/ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_stripes + * + * That's generaly ok for spread load + * Add some balancer based on queue leght to device + * + * Basic ideas: + * - Sequential read generate low amount of request + *so if load of drives are equal, use pid % num_stripes balancing + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal + *and repick if other dev have "significant" less queue lenght + * - Repick optimal if queue leght of other mirror are less + */ +static int guess_optimal(struct map_lookup *map, int num, int optimal) +{ + int i; + int round_down = 8; + int qlen[num]; + bool is_nonrot[num]; + bool all_bdev_nonrot = true; + bool all_bdev_rotate = true; + struct block_device *bdev; + + if (num == 1) + return optimal; + + /* Check accessible bdevs */ + for (i = 0; i < num; i++) { + /* Init for missing bdevs */ + is_nonrot[i] = false; + qlen[i] = INT_MAX; + bdev = map->stripes[i].dev->bdev; + if (bdev) { + qlen[i] = 0; + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev)); + if (is_nonrot[i]) + all_bdev_rotate = false; + else + all_bdev_nonrot = false; + } + } + + /* +* Don't bother with computation +* if only one of two bdevs are accessible +*/ + if (num == 2 && qlen[0] != qlen[1]) { + if (qlen[0] < qlen[1]) + return 0; + else + return 1; + } + + if (all_bdev_nonrot) + r
[PATCH 2/4] [RESEND] Btrfs: make should_defrag_range() understood compressed extents
Both, defrag ioctl and autodefrag - call btrfs_defrag_file() for file defragmentation. Kernel default target extent size - 256KiB. Btrfs progs default - 32MiB. Both bigger then maximum size of compressed extent - 128KiB. That lead to rewrite all compressed data on disk. Fix that by check compression extents with different logic. As addition, make should_defrag_range() understood compressed extent type, if requested target compression are same as current extent compression type. Just don't recompress/rewrite extents. To avoid useless recompression of compressed extents. Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 45a47d0891fc..b29ea1f0f621 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em) static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, u64 *last_len, u64 *skip, u64 *defrag_end, - int compress) + int compress, int compress_type) { struct extent_map *em; int ret = 1; @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, * real extent, don't bother defragging it */ if (!compress && (*last_len == 0 || *last_len >= thresh) && - (em->len >= thresh || (!next_mergeable && !prev_mergeable))) + (em->len >= thresh || (!next_mergeable && !prev_mergeable))) { ret = 0; + goto out; + } + + + /* +* Try not recompress compressed extents +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to +* recompress all compressed extents +*/ + if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) { + if (!compress) { + if (em->len == BTRFS_MAX_UNCOMPRESSED) + ret = 0; + } else { + if (em->compress_type != compress_type) + goto out; + if (em->len == BTRFS_MAX_UNCOMPRESSED) + ret = 0; + } + } + out: /* * last_len ends up being a counter of how many bytes we've defragged. @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, extent_thresh, &last_len, &skip, -&defrag_end, do_compress)){ +&defrag_end, do_compress, +compress_type)){ unsigned long next; /* * the should_defrag function tells us how much to skip -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] [RESEND] Btrfs: reduce size of struct btrfs_inode
Currently btrfs_inode have size equal 1136 bytes. (On x86_64). struct btrfs_inode store several vars releated to compression code, all states use 1 or 2 bits. Lets declare bitfields for compression releated vars, to reduce sizeof btrfs_inode to 1128 bytes. Signed-off-by: Timofey Titovets --- fs/btrfs/btrfs_inode.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 9eb0c92ee4b4..9d29d7e68757 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -181,13 +181,13 @@ struct btrfs_inode { /* * Cached values of inode properties */ - unsigned prop_compress; /* per-file compression algorithm */ + unsigned prop_compress : 2; /* per-file compression algorithm */ /* * Force compression on the file using the defrag ioctl, could be * different from prop_compress and takes precedence if set */ - unsigned defrag_compress; - unsigned change_compress; + unsigned defrag_compress : 2; + unsigned change_compress : 1; struct btrfs_delayed_node *delayed_node; -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] [RESEND] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
At now btrfs_dedupe_file_range() restricted to 16MiB range for limit locking time and memory requirement for dedup ioctl() For too big input range code silently set range to 16MiB Let's remove that restriction by do iterating over dedup range. That's backward compatible and will not change anything for request less then 16MiB. Changes: v1 -> v2: - Refactor btrfs_cmp_data_prepare and btrfs_extent_same - Store memory of pages array between iterations - Lock inodes once, not on each iteration - Small inplace cleanups Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 160 --- 1 file changed, 94 insertions(+), 66 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index be5bd81b3669..45a47d0891fc 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2965,8 +2965,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp) put_page(pg); } } - kfree(cmp->src_pages); - kfree(cmp->dst_pages); + + cmp->num_pages = 0; } static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, @@ -2974,41 +2974,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, u64 len, struct cmp_pages *cmp) { int ret; - int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; - struct page **src_pgarr, **dst_pgarr; - - /* -* We must gather up all the pages before we initiate our -* extent locking. We use an array for the page pointers. Size -* of the array is bounded by len, which is in turn bounded by -* BTRFS_MAX_DEDUPE_LEN. -*/ - src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); - dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); - if (!src_pgarr || !dst_pgarr) { - kfree(src_pgarr); - kfree(dst_pgarr); - return -ENOMEM; - } - cmp->num_pages = num_pages; - cmp->src_pages = src_pgarr; - cmp->dst_pages = dst_pgarr; /* * If deduping ranges in the same inode, locking rules make it mandatory * to always lock pages in ascending order to avoid deadlocks with * concurrent tasks (such as starting writeback/delalloc). */ - if (src == dst && dst_loff < loff) { - swap(src_pgarr, dst_pgarr); + if (src == dst && dst_loff < loff) swap(loff, dst_loff); - } - ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff); + cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; + + ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff); if (ret) goto out; - ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff); + ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, dst_loff); out: if (ret) @@ -3078,31 +3059,23 @@ static int extent_same_check_offsets(struct inode *inode, u64 off, u64 *plen, return 0; } -static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, -struct inode *dst, u64 dst_loff) +static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, + struct inode *dst, u64 dst_loff, + struct cmp_pages *cmp) { int ret; u64 len = olen; - struct cmp_pages cmp; bool same_inode = (src == dst); u64 same_lock_start = 0; u64 same_lock_len = 0; - if (len == 0) - return 0; - - if (same_inode) - inode_lock(src); - else - btrfs_double_inode_lock(src, dst); - ret = extent_same_check_offsets(src, loff, &len, olen); if (ret) - goto out_unlock; + return ret; ret = extent_same_check_offsets(dst, dst_loff, &len, olen); if (ret) - goto out_unlock; + return ret; if (same_inode) { /* @@ -3119,32 +3092,21 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, * allow an unaligned length so long as it ends at * i_size. */ - if (len != olen) { - ret = -EINVAL; - goto out_unlock; - } + if (len != olen) + return -EINVAL; /* Check for overlapping ranges */ - if (dst_loff + len > loff && dst_loff < loff + len) { - ret = -EINVAL; - goto out_unlock; - } + if (dst_loff + len > loff && dst_loff < loff + len) + return -EINVAL; same_lo
[PATCH 0/4] [RESEND] Btrfs: just bunch of patches to ioctl.c
1st patch, remove 16MiB restriction from extent_same ioctl(), by doing iterations over passed range. I did not see much difference in performance, so it's just remove logic restriction. 2-3 pathes, update defrag ioctl(): - Fix bad behaviour with full rewriting all compressed extents in defrag range. (that also make autodefrag on compressed fs not so expensive) - Allow userspace specify NONE as target compression type, that allow users to uncompress files by defragmentation with btrfs-progs - Make defrag ioctl understood requested compression type and current compression type of extents, to make btrfs fi def -rc idempotent operation. i.e. now possible to say, make all extents compressed with lzo, and btrfs will not recompress lzo compressed data. Same for zlib, zstd, none. (patch to btrfs-progs in PR on kdave GitHub). 4th patch, reduce size of struct btrfs_inode - btrfs_inode store fields like: prop_compress, defrag_compress and after 3rd patch, change_compress. They use unsigned as a type, and use 12 bytes in sum. But change_compress is a bitflag, and prop_compress/defrag_compress only store compression type, that currently use 0-3 of 2^32-1. So, set a bitfields on that vars, and reduce size of btrfs_inode: 1136 -> 1128. Timofey Titovets (4): Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction Btrfs: make should_defrag_range() understood compressed extents Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation Btrfs: reduce size of struct btrfs_inode fs/btrfs/btrfs_inode.h | 5 +- fs/btrfs/inode.c | 4 +- fs/btrfs/ioctl.c | 203 +++-- 3 files changed, 133 insertions(+), 79 deletions(-) -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] [RESEND] Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation
Currently defrag ioctl only support recompress files with specified compression type. Allow set compression type to none, while call defrag, and use BTRFS_DEFRAG_RANGE_COMPRESS as flag, that user request change of compression type. Signed-off-by: Timofey Titovets --- fs/btrfs/btrfs_inode.h | 1 + fs/btrfs/inode.c | 4 ++-- fs/btrfs/ioctl.c | 17 ++--- 3 files changed, 13 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 63f0ccc92a71..9eb0c92ee4b4 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -187,6 +187,7 @@ struct btrfs_inode { * different from prop_compress and takes precedence if set */ unsigned defrag_compress; + unsigned change_compress; struct btrfs_delayed_node *delayed_node; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 46df5e2a64e7..7af8f1784788 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -412,8 +412,8 @@ static inline int inode_need_compress(struct inode *inode, u64 start, u64 end) if (btrfs_test_opt(fs_info, FORCE_COMPRESS)) return 1; /* defrag ioctl */ - if (BTRFS_I(inode)->defrag_compress) - return 1; + if (BTRFS_I(inode)->change_compress) + return BTRFS_I(inode)->defrag_compress; /* bad compression ratios */ if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) return 0; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b29ea1f0f621..40f5e5678eac 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1276,7 +1276,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, unsigned long cluster = max_cluster; u64 new_align = ~((u64)SZ_128K - 1); struct page **pages = NULL; - bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS; + bool change_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS; if (isize == 0) return 0; @@ -1284,11 +1284,10 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (range->start >= isize) return -EINVAL; - if (do_compress) { + if (change_compress) { if (range->compress_type > BTRFS_COMPRESS_TYPES) return -EINVAL; - if (range->compress_type) - compress_type = range->compress_type; + compress_type = range->compress_type; } if (extent_thresh == 0) @@ -1363,7 +1362,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, extent_thresh, &last_len, &skip, -&defrag_end, do_compress, +&defrag_end, change_compress, compress_type)){ unsigned long next; /* @@ -1392,8 +1391,11 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, } inode_lock(inode); - if (do_compress) + if (change_compress) { + BTRFS_I(inode)->change_compress = change_compress; BTRFS_I(inode)->defrag_compress = compress_type; + } + ret = cluster_pages_for_defrag(inode, pages, i, cluster); if (ret < 0) { inode_unlock(inode); @@ -1449,8 +1451,9 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, ret = defrag_count; out_ra: - if (do_compress) { + if (change_compress) { inode_lock(inode); + BTRFS_I(inode)->change_compress = 0; BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE; inode_unlock(inode); } -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recovery from full metadata with all device space consumed?
2018-04-20 1:08 GMT+03:00 Drew Bloechl : > I've got a btrfs filesystem that I can't seem to get back to a useful > state. The symptom I started with is that rename() operations started > dying with ENOSPC, and it looks like the metadata allocation on the > filesystem is full: > > # btrfs fi df /broken > Data, RAID0: total=3.63TiB, used=67.00GiB > System, RAID1: total=8.00MiB, used=224.00KiB > Metadata, RAID1: total=3.00GiB, used=2.50GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > All of the consumable space on the backing devices also seems to be in > use: > > # btrfs fi show /broken > Label: 'mon_data' uuid: 85e52555-7d6d-4346-8b37-8278447eb590 > Total devices 4 FS bytes used 69.50GiB > devid1 size 931.51GiB used 931.51GiB path /dev/sda1 > devid2 size 931.51GiB used 931.51GiB path /dev/sdb1 > devid3 size 931.51GiB used 931.51GiB path /dev/sdc1 > devid4 size 931.51GiB used 931.51GiB path /dev/sdd1 > > Even the smallest balance operation I can start fails (this doesn't > change even with an extra temporary device added to the filesystem): > > # btrfs balance start -v -dusage=1 /broken > Dumping filters: flags 0x1, state 0x0, force is off > DATA (flags 0x2): balancing, usage=1 > ERROR: error during balancing '/broken': No space left on device > There may be more info in syslog - try dmesg | tail > # dmesg | tail -1 > [11554.296805] BTRFS info (device sdc1): 757 enospc errors during > balance > > The current kernel is 4.15.0 from Debian's stretch-backports > (specifically linux-image-4.15.0-0.bpo.2-amd64), but it was Debian's > 4.9.30 when the filesystem got into this state. I upgraded it in the > hopes that a newer kernel would be smarter, but no dice. > > btrfs-progs is currently at v4.7.3. > > Most of what this filesystem stores is Prometheus 1.8's TSDB for its > metrics, which are constantly written at around 50MB/second. The > filesystem never really gets full as far as data goes, but there's a lot > of never-ending churn for what data is there. > > Question 1: Are there other steps that can be tried to rescue a > filesystem in this state? I still have it mounted in the same state, and > I'm willing to try other things or extract debugging info. > > Question 2: Is there something I could have done to prevent this from > happening in the first place? > > Thanks! > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Not sure why this happening, but if you stuck at that state: - Reboot to ensure no other problems will exists - Add any other external device temporary to FS, as example zram. After you free small part of fs, delete external dev from FS and continue balance chunks. Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic
Gentle ping. 2018-01-03 0:23 GMT+03:00 Timofey Titovets : > 2018-01-02 21:31 GMT+03:00 Liu Bo : >> On Sat, Dec 30, 2017 at 11:32:04PM +0300, Timofey Titovets wrote: >>> Currently btrfs raid1/10 balancer bаlance requests to mirrors, >>> based on pid % num of mirrors. >>> >>> Make logic understood: >>> - if one of underline devices are non rotational >>> - Queue leght to underline devices >>> >>> By default try use pid % num_mirrors guessing, but: >>> - If one of mirrors are non rotational, repick optimal to it >>> - If underline mirror have less queue leght then optimal, >>>repick to that mirror >>> >>> For avoid round-robin request balancing, >>> lets round down queue leght: >>> - By 8 for rotational devs >>> - By 2 for all non rotational devs >>> >> >> Sorry for making a late comment on v3. >> >> It's good to choose non-rotational if it could. >> >> But I'm not sure whether it's a good idea to guess queue depth here >> because filesystem is still at a high position of IO stack. It'd >> probably get good results when running tests, but in practical mixed >> workloads, the underlying queue depth will be changing all the time. > > First version supposed for SSD, SSD + HDD only cases. > At that version that just a "attempt", make LB on hdd. > That can be easy dropped, if we decide that's a bad behaviour. > > If i understood correctly, which counters used, > we check count of I/O ops that device processing currently > (i.e. after merging & etc), > not queue what not send (i.e. before merging & etc). > > i.e. we guessed based on low level block io stuff. > As example that not work on zram devs (AFAIK, as zram don't have that > counters). > > So, no matter at which level we check that. > >> In fact, I think for rotational disks, more merging and less seeking >> make more sense, even in raid1/10 case. >> >> Thanks, >> >> -liubo > > queue_depth changing must not create big problems there, > i.e. round_down must make all changes "equal". > > For hdd, if we have a "big" (8..16?) queue depth, > with high probability that hdd overloaded, > and if other hdd have much less load > (may be instead of round_down, that better use abs diff > 8) > we try to requeue requests to other hdd. > > That will not show true equal distribution, but in case where > one disks have more load, and pid based mechanism fail to make LB, > we will just move part of load to other hdd. > > Until load distribution will not changed. > > May be for HDD that need to make threshold more aggressive, like 16 > (i.e. afaik SATA drives have hw rq len 31, so just use half of that). > > Thanks. > >>> Changes: >>> v1 -> v2: >>> - Use helper part_in_flight() from genhd.c >>> to get queue lenght >>> - Move guess code to guess_optimal() >>> - Change balancer logic, try use pid % mirror by default >>> Make balancing on spinning rust if one of underline devices >>> are overloaded >>> v2 -> v3: >>> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes >>> >>> Signed-off-by: Timofey Titovets >>> --- >>> block/genhd.c | 1 + >>> fs/btrfs/volumes.c | 115 >>> - >>> 2 files changed, 114 insertions(+), 2 deletions(-) >>> >>> diff --git a/block/genhd.c b/block/genhd.c >>> index 96a66f671720..a77426a7 100644 >>> --- a/block/genhd.c >>> +++ b/block/genhd.c >>> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct >>> hd_struct *part, >>> atomic_read(&part->in_flight[1]); >>> } >>> } >>> +EXPORT_SYMBOL_GPL(part_in_flight); >>> >>> struct hd_struct *__disk_get_part(struct gendisk *disk, int partno) >>> { >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index 49810b70afd3..a3b80ba31d4d 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -27,6 +27,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> #include >>> #include "ctree.h" >>> #include "extent_map.h" >>> @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info >>> *fs_info, u64 logical, u64 len) &
Re: [PATCH 0/4] Btrfs: just bunch of patches to ioctl.c
Gentle ping 2018-01-09 13:53 GMT+03:00 Timofey Titovets : > Gentle ping > > 2017-12-19 13:02 GMT+03:00 Timofey Titovets : >> 1st patch, remove 16MiB restriction from extent_same ioctl(), >> by doing iterations over passed range. >> >> I did not see much difference in performance, so it's just remove >> logic restriction. >> >> 2-3 pathes, update defrag ioctl(): >> - Fix bad behaviour with full rewriting all compressed >>extents in defrag range. (that also make autodefrag on compressed fs >>not so expensive) >> - Allow userspace specify NONE as target compression type, >>that allow users to uncompress files by defragmentation with btrfs-progs >> - Make defrag ioctl understood requested compression type and current >>compression type of extents, to make btrfs fi def -rc >>idempotent operation. >>i.e. now possible to say, make all extents compressed with lzo, >>and btrfs will not recompress lzo compressed data. >>Same for zlib, zstd, none. >>(patch to btrfs-progs in PR on kdave GitHub). >> >> 4th patch, reduce size of struct btrfs_inode >> - btrfs_inode store fields like: prop_compress, defrag_compress and >>after 3rd patch, change_compress. >>They use unsigned as a type, and use 12 bytes in sum. >>But change_compress is a bitflag, and prop_compress/defrag_compress >>only store compression type, that currently use 0-3 of 2^32-1. >> >>So, set a bitfields on that vars, and reduce size of btrfs_inode: >>1136 -> 1128. >> >> Timofey Titovets (4): >> Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction >> Btrfs: make should_defrag_range() understood compressed extents >> Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation >> Btrfs: reduce size of struct btrfs_inode >> >> fs/btrfs/btrfs_inode.h | 5 +- >> fs/btrfs/inode.c | 4 +- >> fs/btrfs/ioctl.c | 203 >> +++-- >> 3 files changed, 133 insertions(+), 79 deletions(-) >> >> -- >> 2.15.1 > > > > -- > Have a nice day, > Timofey. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: invalid files names, btrfs check can't repair it
2018-01-13 0:04 GMT+03:00 Sebastian Andrzej Siewior : > Hi, > > so I had bad memory and before I realized it and removed it btrfs took some > damage. Now I have this: > > |ls -lh crap/ > |ls: cannot access 'crap/2f3f379b2a3d7499471edb74869efe-1948311.d': No such > file or directory > |ls: cannot access 'crap/454bf066ddfbf42e0f3b77ea71c82f-878732.o': No such > file or directory > |total 0 > |-? ? ? ? ?? 2f3f379b2a3d7499471edb74869efe-1948311.d > |-? ? ? ? ?? 454bf066ddfbf42e0f3b77ea71c82f-878732.o > > and in dmesg I see: > > | BTRFS critical (device sda4): invalid dir item type: 33 > | BTRFS critical (device sda4): invalid dir item name len: 8231 > > `btrfs check' (from v4.14.1) finds them and prints them but has no idea > what to do with it. Would it be possible to let the check tool rename > the offended filename to something (like its inode number) put it in > lost+found if it has any data attached to it and otherwise simply remove > it? Right now I can't remove that folder. > > Sebastian > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Deletion: If that happens in subvol, you can create new subvol, reflink data and delete old vol. I don't know other ways to fix that entries. P.S. I have that issue without bad ram, but by some system hangs/resets (I've use notreelog as workaround for now). Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs subvolume mount with different options
2018-01-12 20:49 GMT+03:00 Konstantin V. Gavrilenko : > Hi list, > > just wondering whether it is possible to mount two subvolumes with different > mount options, i.e. > > | > |- /a defaults,compress-force=lza > | > |- /b defaults,nodatacow > > > since, when both subvolumes are mounted, and when I change the option for one > it is changed for all of them. > > > thanks in advance. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Not possible for now. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommendations for balancing as part of regular maintenance?
2018-01-10 21:33 GMT+03:00 Tom Worster : > On 10 Jan 2018, at 12:01, Austin S. Hemmelgarn wrote: > >> On 2018-01-10 11:30, Tom Worster wrote: >> >> Also, for future reference, the term we typically use is ENOSPC, as that's >> the symbolic name for the error code you get when this happens (or when your >> filesystem is just normally full), but I actually kind of like your name for >> it too, it conveys the exact condition being discussed in a way that should >> be a bit easier for non-technical types to understand. > > > Iiuc, ENOSPC is _exhaustion_ of unallocated space, which is a specific case > of depletion. > > I sought a term to refer to the phenomenon of unallocated space shrinking > beyond what filesystem use would demand and how it ratchets down. Hence a > sysop needs to manage DoUS. ENOSPC is likely a failure of such management. > > >>> - Some experienced users say that, to resolve a problem with DoUS, they >>> would rather recreate the filesystem than run balance. >> >> This is kind of independent of BTRFS. > > > Yes. I mentioned it only because it was, to me, a striking statement of lack > of confidence in balance. > > >>> But if Duncan is right (which, for me, is practically the same as >>> consensus on the proposition) that problems with corruption while running >>> balance are associated with heavy coincident IO activity, then I can see a >>> reasonable way forwards. I can even see how general recommendations for >>> BTRFS maintenance might develop. >> >> As I commented above, I would tend to believe Duncan is right in this case >> (both because it makes sense, and because he seems to generally be right >> about this type of thing). That said, I really do think that normal user >> I/O is probably not the issue, but low-level filesystem operations are. >> That said, there is no reason that BTRFS shouldn't either: >> 1. Handle this just fine without causing corruption. >> or: >> 2. Extend the mutex used to prevent concurrent balances to cover other >> operations that might cause issues (that is, make it so you can't scrub a >> filesystem while it's being balanced, or defragment it, or whatever else). > > > Yes, but backtracking a bit, I think there's another really important point > here. Assuming Duncan's right, it's not so hard to develop guidelines for > general BTRFS management that include DoUS among other topics. Duncan's > other email today contains or implies quite a lot of those guidelines. > > Or, to put it another way, it's enough for me. I think I know what to do > now. And that much could be written down for the benefit of others. > > Tom > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html My two cents, I've about ~50 different systems (VCS Systems, MySQL DB, Web Servers, Elastic Search nodes & etc.). All running btrfs only and run fine, even with auto snapshot rotating on some of them, (btrfs make my life easier and i like it). Most of them are small VMs From 3GiB..512GiB (I use compression everywhere). And no one of them need balance, only that i care, i try have always some unallocated space on it. Most of them are stuck with some used/allocated/unallocated ratio. I.e. as i see from conversation point of view. We run balance for reallocate data -> make more unallocated space, but if someone have plenty of it, that useless, no? ex. I've 60% allocated by data/meta data chunks on my notebook, And only 40% are really used by data, even then i have 90% allocated, and 85% used, i don't face into ENOSPC problems. (256GiB ssd). And if i run balance, i run it only to fight with btrfs discard processing bug, which leads to trim only unallocated space (probably fixed already). So if we talk about "regular" running of balance, may be that make a sense To check free space, i.e. if system have some percentage of space allocated, like 80%, and have plenty of allocated/unused space, only then balance will be needed, no? (I'm not say that btrfs have no problems, i see some rare hateful bugs, on some systems, but most of them are internal btrfs problems or problems with coop of btrfs with applications). Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Btrfs: just bunch of patches to ioctl.c
Gentle ping 2017-12-19 13:02 GMT+03:00 Timofey Titovets : > 1st patch, remove 16MiB restriction from extent_same ioctl(), > by doing iterations over passed range. > > I did not see much difference in performance, so it's just remove > logic restriction. > > 2-3 pathes, update defrag ioctl(): > - Fix bad behaviour with full rewriting all compressed >extents in defrag range. (that also make autodefrag on compressed fs >not so expensive) > - Allow userspace specify NONE as target compression type, >that allow users to uncompress files by defragmentation with btrfs-progs > - Make defrag ioctl understood requested compression type and current >compression type of extents, to make btrfs fi def -rc >idempotent operation. >i.e. now possible to say, make all extents compressed with lzo, >and btrfs will not recompress lzo compressed data. >Same for zlib, zstd, none. >(patch to btrfs-progs in PR on kdave GitHub). > > 4th patch, reduce size of struct btrfs_inode > - btrfs_inode store fields like: prop_compress, defrag_compress and >after 3rd patch, change_compress. >They use unsigned as a type, and use 12 bytes in sum. >But change_compress is a bitflag, and prop_compress/defrag_compress >only store compression type, that currently use 0-3 of 2^32-1. > >So, set a bitfields on that vars, and reduce size of btrfs_inode: >1136 -> 1128. > > Timofey Titovets (4): > Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction > Btrfs: make should_defrag_range() understood compressed extents > Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation > Btrfs: reduce size of struct btrfs_inode > > fs/btrfs/btrfs_inode.h | 5 +- > fs/btrfs/inode.c | 4 +- > fs/btrfs/ioctl.c | 203 > +++-- > 3 files changed, 133 insertions(+), 79 deletions(-) > > -- > 2.15.1 -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic/015: Change the test filesystem size to 101mb
2018-01-08 15:54 GMT+03:00 Qu Wenruo : > > > On 2018年01月08日 16:43, Nikolay Borisov wrote: >> This test has been failing for btrfs for quite some time, >> at least since 4.7. There are 2 implementation details of btrfs that >> it exposes: >> >> 1. Currently btrfs filesystem under 100mb are created in Mixed block >> group mode. Freespace accounting for it is not 100% accurate - I've >> observed around 100-200kb discrepancy between a newly created filesystem, >> then writing a file and deleting it and checking the free space. This >> falls within %3 and not %1 as hardcoded in the test. >> >> 2. BTRFS won't flush it's delayed allocation on file deletion if less >> than 32mb are deleted. On such files we need to perform sync (missing >> in the test) or wait until time elapses for transaction commit. > > I'm a little confused about the 32mb limit. > > My personal guess about the reason to delay space freeing would be: > 1) Performance >Btrfs tree operation (at least for write) is slow due to its tree >design. >So it may makes sense to delay space freeing. > >But in that case, 32MB may seems to small to really improve the >performance. (Max file extent size is 128M, delaying one item >deletion doesn't really improve performance) > > 2) To avoid later new allocation to rewrite the data. >It's possible that freed space of deleted inode A get allocated to >new file extents. And a power loss happens before we commit the >transaction. > >In that case, if everything else works fine, we should be reverted to >previous transaction where deleted inode A still exists. >But we lost its data, as its data is overwritten by other file >extents. And any read will just cause csum error. > >But in that case, there shouldn't be any 32MB limit, but all deletion >of orphan inodes should be delayed. > >And further more, this can be addressed using log tree, to log such >deletion so at recovery time, we just delete that inode. > > So I'm wonder if we can improve btrfs deletion behavior. > > >> >> Since mixed mode is somewhat deprecated and btrfs is not really intended >> to be used on really small devices let's just adjust the test to >> create a 101mb fs, which doesn't use mixed mode and really test >> freespace accounting. > > Despite of some btrfs related questions, I'm wondering if there is any > standard specifying (POSIX?) how a filesystem should behave when > unlinking a file. > > Should the space freeing be synchronized? And how should statfs report > available space? > > In short, I'm wondering if this test and its expected behavior is > generic enough for all filesystems. > > Thanks, > Qu > >> >> Signed-off-by: Nikolay Borisov >> --- >> tests/generic/015 | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/tests/generic/015 b/tests/generic/015 >> index 78f2b13..416c4ae 100755 >> --- a/tests/generic/015 >> +++ b/tests/generic/015 >> @@ -53,7 +53,7 @@ _supported_os Linux >> _require_scratch >> _require_no_large_scratch_dev >> >> -_scratch_mkfs_sized `expr 50 \* 1024 \* 1024` >/dev/null 2>&1 \ >> +_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \ >> || _fail "mkfs failed" >> _scratch_mount || _fail "mount failed" >> out=$SCRATCH_MNT/fillup.$$ >> > All fs, including btrfs (AFAIK) return unlink(), (if file not open) only then space has been freed. So free space after return of unlink() must be freed. Proofs: [1] [2] [3] [4] - Posix, looks like do not describe behaviour. 1. http://man7.org/linux/man-pages/man2/unlink.2.html 2. https://stackoverflow.com/questions/31448693/why-system-call-unlink-so-slow 3. https://www.spinics.net/lists/linux-btrfs/msg59901.html 4. https://www.unix.com/man-page/posix/1P/unlink/ Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Remove custom crc32c init code from btrfs
2018-01-08 12:45 GMT+03:00 Nikolay Borisov : > So here is a small 2 patch set which removes btrfs' manual initialisation of > the lower level crc32c module. Explanation why is ok can be found in Patch > 2/2. > > Patch 1/2 just adds a function to the generic crc32c header which allows > querying the actual crc32c implementaiton used (i.e. software or > hw-accelerated) > to retain current btrfs behavior. This is mainly used for debugging purposes > and is independent. > > Nikolay Borisov (2): > libcrc32c: Add crc32c_impl function > btrfs: Remove custom crc32c init code > > fs/btrfs/Kconfig | 3 +-- > fs/btrfs/Makefile | 2 +- > fs/btrfs/check-integrity.c | 4 ++-- > fs/btrfs/ctree.h | 16 ++ > fs/btrfs/dir-item.c| 1 - > fs/btrfs/disk-io.c | 4 ++-- > fs/btrfs/extent-tree.c | 10 - > fs/btrfs/hash.c| 54 > -- > fs/btrfs/hash.h| 43 > fs/btrfs/inode-item.c | 1 - > fs/btrfs/inode.c | 1 - > fs/btrfs/props.c | 2 +- > fs/btrfs/send.c| 4 ++-- > fs/btrfs/super.c | 14 > fs/btrfs/tree-log.c| 2 +- > include/linux/crc32c.h | 1 + > lib/libcrc32c.c| 6 ++ > 17 files changed, 42 insertions(+), 126 deletions(-) > delete mode 100644 fs/btrfs/hash.c > delete mode 100644 fs/btrfs/hash.h > > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Reviewed-by: Timofey Titovets P.S. May that are overkill to remove hash.c completely? i.e. if we have a "plan" to support another hash algo, we still need some abstractions for that. Inband dedup don't touch hash.* so, no one else must be affected. Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
2017-12-20 0:23 GMT+03:00 Darrick J. Wong : > On Tue, Dec 19, 2017 at 01:02:44PM +0300, Timofey Titovets wrote: >> At now btrfs_dedupe_file_range() restricted to 16MiB range for >> limit locking time and memory requirement for dedup ioctl() >> >> For too big input range code silently set range to 16MiB >> >> Let's remove that restriction by do iterating over dedup range. >> That's backward compatible and will not change anything for request >> less then 16MiB. >> >> Changes: >> v1 -> v2: >> - Refactor btrfs_cmp_data_prepare and btrfs_extent_same >> - Store memory of pages array between iterations >> - Lock inodes once, not on each iteration >> - Small inplace cleanups > > /me wonders if you could take advantage of vfs_clone_file_prep_inodes, > which takes care of the content comparison (and flushing files, and inode > checks, etc.) ? > > (ISTR Qu Wenruo(??) or someone remarking that this might not work well > with btrfs locking model, but I could be mistaken about all that...) > > --D Sorry, not enough knowledge to give an authoritative answer. I can only say that, i try lightly test that, by add call before btrfs_extent_same() with inode_locks, at least that works. Thanks. >> >> Signed-off-by: Timofey Titovets >> --- >> fs/btrfs/ioctl.c | 160 >> --- >> 1 file changed, 94 insertions(+), 66 deletions(-) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index be5bd81b3669..45a47d0891fc 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -2965,8 +2965,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp) >> put_page(pg); >> } >> } >> - kfree(cmp->src_pages); >> - kfree(cmp->dst_pages); >> + >> + cmp->num_pages = 0; >> } >> >> static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, >> @@ -2974,41 +2974,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, >> u64 loff, >> u64 len, struct cmp_pages *cmp) >> { >> int ret; >> - int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; >> - struct page **src_pgarr, **dst_pgarr; >> - >> - /* >> - * We must gather up all the pages before we initiate our >> - * extent locking. We use an array for the page pointers. Size >> - * of the array is bounded by len, which is in turn bounded by >> - * BTRFS_MAX_DEDUPE_LEN. >> - */ >> - src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); >> - dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); >> - if (!src_pgarr || !dst_pgarr) { >> - kfree(src_pgarr); >> - kfree(dst_pgarr); >> - return -ENOMEM; >> - } >> - cmp->num_pages = num_pages; >> - cmp->src_pages = src_pgarr; >> - cmp->dst_pages = dst_pgarr; >> >> /* >>* If deduping ranges in the same inode, locking rules make it >> mandatory >>* to always lock pages in ascending order to avoid deadlocks with >>* concurrent tasks (such as starting writeback/delalloc). >>*/ >> - if (src == dst && dst_loff < loff) { >> - swap(src_pgarr, dst_pgarr); >> + if (src == dst && dst_loff < loff) >> swap(loff, dst_loff); >> - } >> >> - ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff); >> + cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; >> + >> + ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff); >> if (ret) >> goto out; >> >> - ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff); >> + ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, >> dst_loff); >> >> out: >> if (ret) >> @@ -3078,31 +3059,23 @@ static int extent_same_check_offsets(struct inode >> *inode, u64 off, u64 *plen, >> return 0; >> } >> >> -static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, >> - struct inode *dst, u64 dst_loff) >> +static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, >> +struct inode *dst, u64 dst_loff, >> +struct cmp_pages *cmp) >> { >> int ret; >> u64 len = ol
Re: [PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort
2018-01-03 14:40 GMT+03:00 Filipe Manana : > On Thu, Dec 28, 2017 at 3:28 PM, Timofey Titovets > wrote: >> Insert sort are generaly perform better then bubble sort, >> by have less iterations on avarage. >> That version also try place element to right position >> instead of raw swap. >> >> I'm not sure how many stripes per bio raid56, >> btrfs try to store (and try to sort). > > If you don't know it, besides unlikely to be doing the best possible > thing here, you might actually make things worse or not offering any > benefit. IOW, you should know it for sure before submitting such > changes. > > You should know if the number of elements to sort is big enough such > that an insertion sort is faster than a bubble sort, and more > importantly, measure it and mention it in the changelog. > As it is, you are showing lack of understanding of the code and > component you are touching, and leaving many open questions such as > how faster this is, why insertion sort and not a > quick/merge/heap/whatever sort, etc. > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” Sorry, you are right, I must do some tests and investigations before send a patch. (I just try believe in some magic math things). Input size depends on number of devs, so on small arrays, like 3-5 no meaningful difference. Example: raid6 (with 4 disks) produce many stripe line addresses like: 1. 4641783808 4641849344 4641914880 18446744073709551614 2. 4641652736 4641718272 18446744073709551614 4641587200 3. 18446744073709551614 4636475392 4636540928 4636606464 4. 4641521664 18446744073709551614 4641390592 4641456128 For that count of elements any sorting algo will work fast enough. Let's, consider that addresses as random non-repeating numbers. We can use tool like Sound Of Sorting (SoS) to make some easy to interpret tests of algorithms. (Sorry, no script to reproduce, as SoS not provide a cli, just hand made by run SoS with different params). Table (also in attach with source data points): Sort_algo |Disk_num |3 |4|6|8|10|12|14|AVG Bubble|Comparasions |3 |6|15 |28 |45|66|91 |36,2857142857143 Bubble|Array_Accesses |7,8 |18,2 |45,8 |81,8 |133,4 |192 |268,6 |106,8 Insertion |Comparasions |2,8 |5|11,6 |17 |28,6 |39,4 |55,2 |22,8 Insertion |Array_Accesses |8,4 |13,6 |31 |48,8 |80,4 |109,6 |155,8 |63,9428571428571 i.e. on Size like 3-4 no much difference, Insertion sort will work faster on bigger arrays (up to 1.7x for 14 disk array). Does that make a sense? I think yes, i.e. in any case that are several dozen machine instructions. Which can be used elsewhere. P.S. For heap sort, which are also available in kernel by sort(), That will to much overhead on that small number of devices, i.e. heap sort will show a profit over insert sort at 16+ cells in array. /* Snob mode on */ P.S.S. Heap sort & other like, need additional memory, so that useless to compare in our case, but they will works faster, of course. /* Snob mode off */ Thanks. -- Have a nice day, Timofey. Bubble_vs_Insertion.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic
2018-01-02 21:31 GMT+03:00 Liu Bo : > On Sat, Dec 30, 2017 at 11:32:04PM +0300, Timofey Titovets wrote: >> Currently btrfs raid1/10 balancer bаlance requests to mirrors, >> based on pid % num of mirrors. >> >> Make logic understood: >> - if one of underline devices are non rotational >> - Queue leght to underline devices >> >> By default try use pid % num_mirrors guessing, but: >> - If one of mirrors are non rotational, repick optimal to it >> - If underline mirror have less queue leght then optimal, >>repick to that mirror >> >> For avoid round-robin request balancing, >> lets round down queue leght: >> - By 8 for rotational devs >> - By 2 for all non rotational devs >> > > Sorry for making a late comment on v3. > > It's good to choose non-rotational if it could. > > But I'm not sure whether it's a good idea to guess queue depth here > because filesystem is still at a high position of IO stack. It'd > probably get good results when running tests, but in practical mixed > workloads, the underlying queue depth will be changing all the time. First version supposed for SSD, SSD + HDD only cases. At that version that just a "attempt", make LB on hdd. That can be easy dropped, if we decide that's a bad behaviour. If i understood correctly, which counters used, we check count of I/O ops that device processing currently (i.e. after merging & etc), not queue what not send (i.e. before merging & etc). i.e. we guessed based on low level block io stuff. As example that not work on zram devs (AFAIK, as zram don't have that counters). So, no matter at which level we check that. > In fact, I think for rotational disks, more merging and less seeking > make more sense, even in raid1/10 case. > > Thanks, > > -liubo queue_depth changing must not create big problems there, i.e. round_down must make all changes "equal". For hdd, if we have a "big" (8..16?) queue depth, with high probability that hdd overloaded, and if other hdd have much less load (may be instead of round_down, that better use abs diff > 8) we try to requeue requests to other hdd. That will not show true equal distribution, but in case where one disks have more load, and pid based mechanism fail to make LB, we will just move part of load to other hdd. Until load distribution will not changed. May be for HDD that need to make threshold more aggressive, like 16 (i.e. afaik SATA drives have hw rq len 31, so just use half of that). Thanks. >> Changes: >> v1 -> v2: >> - Use helper part_in_flight() from genhd.c >> to get queue lenght >> - Move guess code to guess_optimal() >> - Change balancer logic, try use pid % mirror by default >> Make balancing on spinning rust if one of underline devices >> are overloaded >> v2 -> v3: >> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes >> >> Signed-off-by: Timofey Titovets >> --- >> block/genhd.c | 1 + >> fs/btrfs/volumes.c | 115 >> - >> 2 files changed, 114 insertions(+), 2 deletions(-) >> >> diff --git a/block/genhd.c b/block/genhd.c >> index 96a66f671720..a77426a7 100644 >> --- a/block/genhd.c >> +++ b/block/genhd.c >> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct >> hd_struct *part, >> atomic_read(&part->in_flight[1]); >> } >> } >> +EXPORT_SYMBOL_GPL(part_in_flight); >> >> struct hd_struct *__disk_get_part(struct gendisk *disk, int partno) >> { >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 49810b70afd3..a3b80ba31d4d 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -27,6 +27,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include "ctree.h" >> #include "extent_map.h" >> @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info >> *fs_info, u64 logical, u64 len) >> return ret; >> } >> >> +/** >> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev >> + * >> + * @bdev: target bdev >> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 >> + */ >> +static int bdev_get_queue_len(struct block_device *bdev, int round_down) >> +{ >> + int sum; >> + struct hd_struct *bd_part = bdev->bd_part; >> + struct request_queue *rq = bdev_get_queue(bdev); >> + uint32_t infligh
Re: [PATCH 1/2] Btrfs: heuristic: replace workspace managment code by mempool API
2017-12-24 7:55 GMT+03:00 Timofey Titovets : > Currently compression code have custom workspace/memory cache > for guarantee forward progress on high memory pressure. > > That api can be replaced with mempool API, which can guarantee the same. > Main goal is simplify/cleanup code and replace it with general solution. > > I try avoid use of atomic/lock/wait stuff, > as that all already hidden in mempool API. > Only thing that must be racy safe is initialization of > mempool. > > So i create simple mempool_alloc_wrap, which will handle > mempool_create failures, and sync threads work by cmpxchg() > on mempool_t pointer. > > Another logic difference between our custom stuff and mempool: > - ws find/free mosly reuse current workspaces whenever possible. > - mempool use alloc/free of provided helpers with more >aggressive use of __GFP_NOMEMALLOC, __GFP_NORETRY, GFP_NOWARN, >and only use already preallocated space when memory get tight. > > Not sure which approach are better, but simple stress tests with > writing stuff on compressed fs on ramdisk show negligible difference on > 8 CPU Virtual Machine with Intel Xeon E5-2420 0 @ 1.90GHz (+-1%). > > Other needed changes to use mempool: > - memalloc_nofs_{save,restore} move to each place where kvmalloc >will be used in call chain. > - mempool_create return pointer to mampool or NULL, > no error, so macros like IS_ERR(ptr) can't be used. > > Signed-off-by: Timofey Titovets > --- > fs/btrfs/compression.c | 197 > ++--- > 1 file changed, 106 insertions(+), 91 deletions(-) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index 208334aa6c6e..02bd60357f04 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > #include "ctree.h" > #include "disk-io.h" > #include "transaction.h" > @@ -768,46 +769,46 @@ struct heuristic_ws { > struct bucket_item *bucket; > /* Sorting buffer */ > struct bucket_item *bucket_b; > - struct list_head list; > }; > > -static void free_heuristic_ws(struct list_head *ws) > +static void heuristic_ws_free(void *element, void *pool_data) > { > - struct heuristic_ws *workspace; > + struct heuristic_ws *ws = (struct heuristic_ws *) element; > > - workspace = list_entry(ws, struct heuristic_ws, list); > - > - kvfree(workspace->sample); > - kfree(workspace->bucket); > - kfree(workspace->bucket_b); > - kfree(workspace); > + kfree(ws->sample); > + kfree(ws->bucket); > + kfree(ws->bucket_b); > + kfree(ws); > } > > -static struct list_head *alloc_heuristic_ws(void) > +static void *heuristic_ws_alloc(gfp_t gfp_mask, void *pool_data) > { > - struct heuristic_ws *ws; > + struct heuristic_ws *ws = kzalloc(sizeof(*ws), gfp_mask); > > - ws = kzalloc(sizeof(*ws), GFP_KERNEL); > if (!ws) > - return ERR_PTR(-ENOMEM); > + return NULL; > > - ws->sample = kvmalloc(MAX_SAMPLE_SIZE, GFP_KERNEL); > + /* > +* We can handle allocation failures and > +* slab have caches for 8192 byte allocations > +*/ > + ws->sample = kmalloc(MAX_SAMPLE_SIZE, gfp_mask); > if (!ws->sample) > goto fail; > > - ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), GFP_KERNEL); > + ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), gfp_mask); > if (!ws->bucket) > goto fail; > > - ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), > GFP_KERNEL); > + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), gfp_mask); > if (!ws->bucket_b) > goto fail; > > - INIT_LIST_HEAD(&ws->list); > - return &ws->list; > + return ws; > + > fail: > - free_heuristic_ws(&ws->list); > - return ERR_PTR(-ENOMEM); > + heuristic_ws_free(ws, NULL); > + return NULL; > } > > struct workspaces_list { > @@ -821,9 +822,12 @@ struct workspaces_list { > wait_queue_head_t ws_wait; > }; > > -static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES]; > +struct workspace_stor { > + mempool_t *pool; > +}; > > -static struct workspaces_list btrfs_heuristic_ws; > +static struct workspace_stor btrfs_heuristic_ws_stor; > +static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES]; > > s
[PATCH V3] Btrfs: enchanse raid1/10 balance heuristic
Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue leght to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue leght then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue leght: - By 8 for rotational devs - By 2 for all non rotational devs Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue lenght - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded v2 -> v3: - Fix arg for RAID10 - use sub_stripes, instead of num_stripes Signed-off-by: Timofey Titovets --- block/genhd.c | 1 + fs/btrfs/volumes.c | 115 - 2 files changed, 114 insertions(+), 2 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 96a66f671720..a77426a7 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); struct hd_struct *__disk_get_part(struct gendisk *disk, int partno) { diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 49810b70afd3..a3b80ba31d4d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* +* Try prevent switch for every sneeze +* By roundup output num by some value +*/ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_stripes + * + * That's generaly ok for spread load + * Add some balancer based on queue leght to device + * + * Basic ideas: + * - Sequential read generate low amount of request + *so if load of drives are equal, use pid % num_stripes balancing + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal + *and repick if other dev have "significant" less queue lenght + * - Repick optimal if queue leght of other mirror are less + */ +static int guess_optimal(struct map_lookup *map, int num, int optimal) +{ + int i; + int round_down = 8; + int qlen[num]; + bool is_nonrot[num]; + bool all_bdev_nonrot = true; + bool all_bdev_rotate = true; + struct block_device *bdev; + + if (num == 1) + return optimal; + + /* Check accessible bdevs */ + for (i = 0; i < num; i++) { + /* Init for missing bdevs */ + is_nonrot[i] = false; + qlen[i] = INT_MAX; + bdev = map->stripes[i].dev->bdev; + if (bdev) { + qlen[i] = 0; + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev)); + if (is_nonrot[i]) + all_bdev_rotate = false; + else + all_bdev_nonrot = false; + } + } + + /* +* Don't bother with computation +* if only one of two bdevs are accessible +*/ + if (num == 2 && qlen[0] != qlen[1]) { + if (qlen[0] < qlen[1]) + return 0; + else + return 1; + } + + if (all_bdev_nonrot) + round_down = 2; + + for (i = 0; i < num; i++) { + if (qlen[i]) + continue; + bdev = map->stripes[i].dev->bdev; + qlen[i] = bdev_get_queue_len(bdev, round_down); + } + + /* For mixed case, pick non rotational dev as optimal */ + if (all_bdev_rotate == all_bdev_nonrot) { + for (i = 0; i < num; i++) { + if (is_nonrot[i]) +
Re: [PATCH v2] Btrfs: enchanse raid1/10 balance heuristic
2017-12-30 11:14 GMT+03:00 Dmitrii Tcvetkov : > On Sat, 30 Dec 2017 03:15:20 +0300 > Timofey Titovets wrote: > >> 2017-12-29 22:14 GMT+03:00 Dmitrii Tcvetkov : >> > On Fri, 29 Dec 2017 21:44:19 +0300 >> > Dmitrii Tcvetkov wrote: >> >> > +/** >> >> > + * guess_optimal - return guessed optimal mirror >> >> > + * >> >> > + * Optimal expected to be pid % num_stripes >> >> > + * >> >> > + * That's generaly ok for spread load >> >> > + * Add some balancer based on queue leght to device >> >> > + * >> >> > + * Basic ideas: >> >> > + * - Sequential read generate low amount of request >> >> > + *so if load of drives are equal, use pid % num_stripes >> >> > balancing >> >> > + * - For mixed rotate/non-rotate mirrors, pick non-rotate as >> >> > optimal >> >> > + *and repick if other dev have "significant" less queue >> >> > lenght >> >> > + * - Repick optimal if queue leght of other mirror are less >> >> > + */ >> >> > +static int guess_optimal(struct map_lookup *map, int optimal) >> >> > +{ >> >> > + int i; >> >> > + int round_down = 8; >> >> > + int num = map->num_stripes; >> >> >> >> num has to be initialized from map->sub_stripes if we're reading >> >> RAID10, otherwise there will be NULL pointer dereference >> >> >> > >> > Check can be like: >> > if (map->type & BTRFS_BLOCK_GROUP_RAID10) >> > num = map->sub_stripes; >> > >> >>@@ -5804,10 +5914,12 @@ static int __btrfs_map_block(struct >> >>btrfs_fs_info *fs_info, >> >> stripe_index += mirror_num - 1; >> >> else { >> >> int old_stripe_index = stripe_index; >> >>+ optimal = guess_optimal(map, >> >>+ current->pid % >> >>map->num_stripes); >> >> stripe_index = find_live_mirror(fs_info, map, >> >> stripe_index, >> >> map->sub_stripes, >> >> stripe_index + >> >>-current->pid % >> >>map->sub_stripes, >> >>+optimal, >> >> dev_replace_is_ongoing); >> >> mirror_num = stripe_index - old_stripe_index >> >> + 1; } >> >>-- >> >>2.15.1 >> > >> > Also here calculation should be with map->sub_stripes too. >> > -- >> > To unsubscribe from this list: send the line "unsubscribe >> > linux-btrfs" in the body of a message to majord...@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> Why you think we need such check? >> I.e. guess_optimal always called for find_live_mirror() >> Both in same context, like that: >> >> if (map->type & BTRFS_BLOCK_GROUP_RAID10) { >> u32 factor = map->num_stripes / map->sub_stripes; >> >> stripe_nr = div_u64_rem(stripe_nr, factor, &stripe_index); >> stripe_index *= map->sub_stripes; >> >> if (need_full_stripe(op)) >> num_stripes = map->sub_stripes; >> else if (mirror_num) >> stripe_index += mirror_num - 1; >> else { >> int old_stripe_index = stripe_index; >> stripe_index = find_live_mirror(fs_info, map, >> stripe_index, >> map->sub_stripes, stripe_index + >> current->pid % map->sub_stripes, >> dev_replace_is_ongoing); >> mirror_num = stripe_index - old_stripe_index + 1; >> } >> >> That useless to check that internally > > My bad, so only need to call > guess_optimal(map, current->pid % map->sub_stripes) > in RAID10 branch. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Yes, my bad, copy-paste error, will be fixed in v3 Thanks -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: enchanse raid1/10 balance heuristic
2017-12-29 22:14 GMT+03:00 Dmitrii Tcvetkov : > On Fri, 29 Dec 2017 21:44:19 +0300 > Dmitrii Tcvetkov wrote: >> > +/** >> > + * guess_optimal - return guessed optimal mirror >> > + * >> > + * Optimal expected to be pid % num_stripes >> > + * >> > + * That's generaly ok for spread load >> > + * Add some balancer based on queue leght to device >> > + * >> > + * Basic ideas: >> > + * - Sequential read generate low amount of request >> > + *so if load of drives are equal, use pid % num_stripes >> > balancing >> > + * - For mixed rotate/non-rotate mirrors, pick non-rotate as >> > optimal >> > + *and repick if other dev have "significant" less queue lenght >> > + * - Repick optimal if queue leght of other mirror are less >> > + */ >> > +static int guess_optimal(struct map_lookup *map, int optimal) >> > +{ >> > + int i; >> > + int round_down = 8; >> > + int num = map->num_stripes; >> >> num has to be initialized from map->sub_stripes if we're reading >> RAID10, otherwise there will be NULL pointer dereference >> > > Check can be like: > if (map->type & BTRFS_BLOCK_GROUP_RAID10) > num = map->sub_stripes; > >>@@ -5804,10 +5914,12 @@ static int __btrfs_map_block(struct >>btrfs_fs_info *fs_info, >> stripe_index += mirror_num - 1; >> else { >> int old_stripe_index = stripe_index; >>+ optimal = guess_optimal(map, >>+ current->pid % >>map->num_stripes); >> stripe_index = find_live_mirror(fs_info, map, >> stripe_index, >> map->sub_stripes, >> stripe_index + >>-current->pid % >>map->sub_stripes, >>+optimal, >> dev_replace_is_ongoing); >> mirror_num = stripe_index - old_stripe_index >> + 1; } >>-- >>2.15.1 > > Also here calculation should be with map->sub_stripes too. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Why you think we need such check? I.e. guess_optimal always called for find_live_mirror() Both in same context, like that: if (map->type & BTRFS_BLOCK_GROUP_RAID10) { u32 factor = map->num_stripes / map->sub_stripes; stripe_nr = div_u64_rem(stripe_nr, factor, &stripe_index); stripe_index *= map->sub_stripes; if (need_full_stripe(op)) num_stripes = map->sub_stripes; else if (mirror_num) stripe_index += mirror_num - 1; else { int old_stripe_index = stripe_index; stripe_index = find_live_mirror(fs_info, map, stripe_index, map->sub_stripes, stripe_index + current->pid % map->sub_stripes, dev_replace_is_ongoing); mirror_num = stripe_index - old_stripe_index + 1; } That useless to check that internally --- Also, fio results for all hdd raid1, results from waxhead: Original: Disk-4k-randread-depth-32: (g=0): rw=randread, bs=(R) 4096B-512KiB, (W) 4096B-512KiB, (T) 4096B-512KiB, ioengine=libaio, iodepth=32 Disk-4k-read-depth-8: (g=0): rw=read, bs=(R) 4096B-512KiB, (W) 4096B-512KiB, (T) 4096B-512KiB, ioengine=libaio, iodepth=8 Disk-4k-randwrite-depth-8: (g=0): rw=randwrite, bs=(R) 4096B-512KiB, (W) 4096B-512KiB, (T) 4096B-512KiB, ioengine=libaio, iodepth=8 fio-3.1 Starting 3 processes Disk-4k-randread-depth-32: Laying out IO file (1 file / 65536MiB) Jobs: 3 (f=3): [r(1),R(1),w(1)][100.0%][r=120MiB/s,w=9.88MiB/s][r=998,w=96 IOPS][eta 00m:00s] Disk-4k-randread-depth-32: (groupid=0, jobs=1): err= 0: pid=3132: Fri Dec 29 16:16:33 2017 read: IOPS=375, BW=41.3MiB/s (43.3MB/s)(24.2GiB/600128msec) slat (usec): min=15, max=206039, avg=88.71, stdev=990.35 clat (usec): min=357, max=3487.1k, avg=85022.93, stdev=141872.25 lat (usec): min=399, max=3487.2k, avg=85112.58, stdev=141880.31 clat percentiles (msec): | 1.00th=[5], 5.00th=[7], 10.00th=[9], 20.00th=[ 13], | 30.00th=[ 19], 40.00th=[ 27], 50.00th=[ 39], 60.00th=[ 56], | 70.00th=[ 83], 80.00th=[ 127], 90.00th=[ 209], 95.00th=[ 300], | 99.00th=[ 600], 99.50th=[ 852], 99.90th=[ 1703], 99.95th=[ 2165], | 99.99th=[ 2937] bw ( KiB/s): min= 392, max=75824, per=30.46%, avg=42736.09, stdev=12019.09, samples=1186 iops: min=3, max= 500, avg=380.24, stdev=99.50, samples=1186 lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.29%, 10=12.33%, 20=19.67%, 50=24.92% lat (msec) : 100=17.51%, 250=18.05%, 500=5.72%, 750=0.85%, 1000=0.28% lat (msec) : 2000=0.29%, >=2000=0.07% cpu : usr=0.67%, sys=4.62%, ctx=215716, majf=0, minf=526 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8
[PATCH v2] Btrfs: enchanse raid1/10 balance heuristic
Currently btrfs raid1/10 balancer bаlance requests to mirrors, based on pid % num of mirrors. Make logic understood: - if one of underline devices are non rotational - Queue leght to underline devices By default try use pid % num_mirrors guessing, but: - If one of mirrors are non rotational, repick optimal to it - If underline mirror have less queue leght then optimal, repick to that mirror For avoid round-robin request balancing, lets round down queue leght: - By 8 for rotational devs - By 2 for all non rotational devs Changes: v1 -> v2: - Use helper part_in_flight() from genhd.c to get queue lenght - Move guess code to guess_optimal() - Change balancer logic, try use pid % mirror by default Make balancing on spinning rust if one of underline devices are overloaded Signed-off-by: Timofey Titovets --- block/genhd.c | 1 + fs/btrfs/volumes.c | 116 - 2 files changed, 115 insertions(+), 2 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 96a66f671720..a77426a7 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct *part, atomic_read(&part->in_flight[1]); } } +EXPORT_SYMBOL_GPL(part_in_flight); struct hd_struct *__disk_get_part(struct gendisk *disk, int partno) { diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9a04245003ab..1c84534df9a5 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include "ctree.h" #include "extent_map.h" @@ -5216,6 +5217,112 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +/** + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev + * + * @bdev: target bdev + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 + */ +static int bdev_get_queue_len(struct block_device *bdev, int round_down) +{ + int sum; + struct hd_struct *bd_part = bdev->bd_part; + struct request_queue *rq = bdev_get_queue(bdev); + uint32_t inflight[2] = {0, 0}; + + part_in_flight(rq, bd_part, inflight); + + sum = max_t(uint32_t, inflight[0], inflight[1]); + + /* +* Try prevent switch for every sneeze +* By roundup output num by some value +*/ + return ALIGN_DOWN(sum, round_down); +} + +/** + * guess_optimal - return guessed optimal mirror + * + * Optimal expected to be pid % num_stripes + * + * That's generaly ok for spread load + * Add some balancer based on queue leght to device + * + * Basic ideas: + * - Sequential read generate low amount of request + *so if load of drives are equal, use pid % num_stripes balancing + * - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal + *and repick if other dev have "significant" less queue lenght + * - Repick optimal if queue leght of other mirror are less + */ +static int guess_optimal(struct map_lookup *map, int optimal) +{ + int i; + int round_down = 8; + int num = map->num_stripes; + int qlen[num]; + bool is_nonrot[num]; + bool all_bdev_nonrot = true; + bool all_bdev_rotate = true; + struct block_device *bdev; + + if (num == 1) + return optimal; + + /* Check accessible bdevs */ + for (i = 0; i < num; i++) { + /* Init for missing bdevs */ + is_nonrot[i] = false; + qlen[i] = INT_MAX; + bdev = map->stripes[i].dev->bdev; + if (bdev) { + qlen[i] = 0; + is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev)); + if (is_nonrot[i]) + all_bdev_rotate = false; + else + all_bdev_nonrot = false; + } + } + + /* +* Don't bother with computation +* if only one of two bdevs are accessible +*/ + if (num == 2 && qlen[0] != qlen[1]) { + if (qlen[0] < qlen[1]) + return 0; + else + return 1; + } + + if (all_bdev_nonrot) + round_down = 2; + + for (i = 0; i < num; i++) { + if (qlen[i]) + continue; + bdev = map->stripes[i].dev->bdev; + qlen[i] = bdev_get_queue_len(bdev, round_down); + } + + /* For mixed case, pick non rotational dev as optimal */ + if (all_bdev_rotate == all_bdev_nonrot) { + for (i = 0; i < num; i++) { + if (is_nonrot[i]) + optimal = i; + } +
Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices
2017-12-28 11:06 GMT+03:00 Dmitrii Tcvetkov : > On Thu, 28 Dec 2017 01:39:31 +0300 > Timofey Titovets wrote: > >> Currently btrfs raid1/10 balancer blance requests to mirrors, >> based on pid % num of mirrors. >> >> Update logic and make it understood if underline device are non rotational. >> >> If one of mirrors are non rotational, then all read requests will be moved to >> non rotational device. >> >> If both of mirrors are non rotational, calculate sum of >> pending and in flight request for queue on that bdev and use >> device with least queue leght. >> >> P.S. >> Inspired by md-raid1 read balancing >> >> Signed-off-by: Timofey Titovets >> --- >> fs/btrfs/volumes.c | 59 >> ++ 1 file changed, 59 >> insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 9a04245003ab..98bc2433a920 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info >> *fs_info, u64 logical, u64 len) return ret; >> } >> >> +static inline int bdev_get_queue_len(struct block_device *bdev) >> +{ >> + int sum = 0; >> + struct request_queue *rq = bdev_get_queue(bdev); >> + >> + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC]; >> + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC]; >> + > > This won't work as expected if bdev is controlled by blk-mq, these > counters will be zero. AFAIK to get this info in block layer agnostic way > part_in_flight[1] has to be used. It extracts these counters approriately. > > But it needs to be EXPORT_SYMBOL()'ed in block/genhd.c so we can continue > to build btrfs as module. > >> + /* >> + * Try prevent switch for every sneeze >> + * By roundup output num by 2 >> + */ >> + return ALIGN(sum, 2); >> +} >> + >> static int find_live_mirror(struct btrfs_fs_info *fs_info, >> struct map_lookup *map, int first, int num, >> int optimal, int dev_replace_is_ongoing) >> { >> int i; >> int tolerance; >> + struct block_device *bdev; >> struct btrfs_device *srcdev; >> + bool all_bdev_nonrot = true; >> >> if (dev_replace_is_ongoing && >> fs_info->dev_replace.cont_reading_from_srcdev_mode == >> @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info >> *fs_info, else >> srcdev = NULL; >> >> + /* >> + * Optimal expected to be pid % num >> + * That's generaly ok for spinning rust drives >> + * But if one of mirror are non rotating, >> + * that bdev can show better performance >> + * >> + * if one of disks are non rotating: >> + * - set optimal to non rotating device >> + * if both disk are non rotating >> + * - set optimal to bdev with least queue >> + * If both disks are spinning rust: >> + * - leave old pid % nu, >> + */ >> + for (i = 0; i < num; i++) { >> + bdev = map->stripes[i].dev->bdev; >> + if (!bdev) >> + continue; >> + if (blk_queue_nonrot(bdev_get_queue(bdev))) >> + optimal = i; >> + else >> + all_bdev_nonrot = false; >> + } >> + >> + if (all_bdev_nonrot) { >> + int qlen; >> + /* Forse following logic choise by init with some big number >> */ >> + int optimal_dev_rq_count = 1 << 24; > > Probably better to use INT_MAX macro instead. > > [1] https://elixir.free-electrons.com/linux/v4.15-rc5/source/block/genhd.c#L68 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Thank you very much! -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: replace raid56 stripe bubble sort with insert sort
Insert sort are generaly perform better then bubble sort, by have less iterations on avarage. That version also try place element to right position instead of raw swap. I'm not sure how many stripes per bio raid56, btrfs try to store (and try to sort). So, that a bit shorter just in the name of a great justice. Signed-off-by: Timofey Titovets --- fs/btrfs/volumes.c | 29 - 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 98bc2433a920..7195fc8c49b1 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5317,29 +5317,24 @@ static inline int parity_smaller(u64 a, u64 b) return a > b; } -/* Bubble-sort the stripe set to put the parity/syndrome stripes last */ +/* Insertion-sort the stripe set to put the parity/syndrome stripes last */ static void sort_parity_stripes(struct btrfs_bio *bbio, int num_stripes) { struct btrfs_bio_stripe s; - int i; + int i, j; u64 l; - int again = 1; - while (again) { - again = 0; - for (i = 0; i < num_stripes - 1; i++) { - if (parity_smaller(bbio->raid_map[i], - bbio->raid_map[i+1])) { - s = bbio->stripes[i]; - l = bbio->raid_map[i]; - bbio->stripes[i] = bbio->stripes[i+1]; - bbio->raid_map[i] = bbio->raid_map[i+1]; - bbio->stripes[i+1] = s; - bbio->raid_map[i+1] = l; - - again = 1; - } + for (i = 1; i < num_stripes; i++) { + s = bbio->stripes[i]; + l = bbio->raid_map[i]; + for (j = i - 1; j >= 0; j--) { + if (!parity_smaller(bbio->raid_map[j], l)) + break; + bbio->stripes[j+1] = bbio->stripes[j]; + bbio->raid_map[j+1] = bbio->raid_map[j]; } + bbio->stripes[j+1] = s; + bbio->raid_map[j+1] = l; } } -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices
2017-12-28 3:44 GMT+03:00 Qu Wenruo : > > > On 2017年12月28日 06:39, Timofey Titovets wrote: >> Currently btrfs raid1/10 balancer blance requests to mirrors, >> based on pid % num of mirrors. >> >> Update logic and make it understood if underline device are non rotational. >> >> If one of mirrors are non rotational, then all read requests will be moved to >> non rotational device. >> >> If both of mirrors are non rotational, calculate sum of >> pending and in flight request for queue on that bdev and use >> device with least queue leght. >> >> P.S. >> Inspired by md-raid1 read balancing >> >> Signed-off-by: Timofey Titovets >> --- >> fs/btrfs/volumes.c | 59 >> ++ >> 1 file changed, 59 insertions(+) >> >> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >> index 9a04245003ab..98bc2433a920 100644 >> --- a/fs/btrfs/volumes.c >> +++ b/fs/btrfs/volumes.c >> @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info >> *fs_info, u64 logical, u64 len) >> return ret; >> } >> >> +static inline int bdev_get_queue_len(struct block_device *bdev) >> +{ >> + int sum = 0; >> + struct request_queue *rq = bdev_get_queue(bdev); >> + >> + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC]; >> + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC]; >> + >> + /* >> + * Try prevent switch for every sneeze >> + * By roundup output num by 2 >> + */ >> + return ALIGN(sum, 2); >> +} >> + >> static int find_live_mirror(struct btrfs_fs_info *fs_info, >> struct map_lookup *map, int first, int num, >> int optimal, int dev_replace_is_ongoing) >> { >> int i; >> int tolerance; >> + struct block_device *bdev; >> struct btrfs_device *srcdev; >> + bool all_bdev_nonrot = true; >> >> if (dev_replace_is_ongoing && >> fs_info->dev_replace.cont_reading_from_srcdev_mode == >> @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info >> *fs_info, >> else >> srcdev = NULL; >> >> + /* >> + * Optimal expected to be pid % num >> + * That's generaly ok for spinning rust drives >> + * But if one of mirror are non rotating, >> + * that bdev can show better performance >> + * >> + * if one of disks are non rotating: >> + * - set optimal to non rotating device >> + * if both disk are non rotating >> + * - set optimal to bdev with least queue >> + * If both disks are spinning rust: >> + * - leave old pid % nu, > > And I'm wondering why this case can't use the same bdev queue length? > > Any special reason spinning disk can't benifit from a shorter queue? > > Thanks, > Qu I didn't have spinning rust to test it, But i expect that queue based balancing will kill sequential io balancing. (Also, it's better to balance by avg Latency per request, i think, but we just didn't have that property and need much more calculation) i.e. with spinning rust "true way" (in theory), is just trying to calculate where head at now. As example: based on last queryes, and send request to hdd which have a shorter path to blocks. That in theory will show best random read and sequential read, from hdd raid1 array. But for that we need some tracking of io queue: - Write it own, as done in mdraid and just believe no-one else will touch our disk - Make some analisys of queue linked to bdev, not sure if we have another way. In theory, user with that patch just can switch rotational to 0 on spinning rust, but that can lead to misbehaving of io scheduler.. so may be it's a bad idea to test that by flags. --- About benchmarks: Sorry, didn't have a real hardware to test, so i don't think it's representative, but: Fio config: [global] ioengine=libaio buffered=0 direct=1 bssplit=32k/100 size=1G directory=/mnt/ iodepth=16 time_based runtime=60 [test-fio] rw=randread VM KVM: - Debian 9.3 - Scheduler: noop - Image devid 1 on Notebook SSD. - Image devid 2 on Fast Enough USB Stick. - Both formatted to btrfs raid1. - Kernel patched 4.15-rc3 from misc-next kdave (that i have compiled..) - (I see same on backported 4.13 debian kernel) --- Pid choice image on SSD: test-fio: (g=0): rw=randread, bs=32K-32K/32K-32K/32K-32K, ioengine=libaio, iodepth=16 fio-2.16 Starting 1 process Jobs: 1 (f=1):
[PATCH] Btrfs: enchanse raid1/10 balance heuristic for non rotating devices
Currently btrfs raid1/10 balancer blance requests to mirrors, based on pid % num of mirrors. Update logic and make it understood if underline device are non rotational. If one of mirrors are non rotational, then all read requests will be moved to non rotational device. If both of mirrors are non rotational, calculate sum of pending and in flight request for queue on that bdev and use device with least queue leght. P.S. Inspired by md-raid1 read balancing Signed-off-by: Timofey Titovets --- fs/btrfs/volumes.c | 59 ++ 1 file changed, 59 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9a04245003ab..98bc2433a920 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5216,13 +5216,30 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return ret; } +static inline int bdev_get_queue_len(struct block_device *bdev) +{ + int sum = 0; + struct request_queue *rq = bdev_get_queue(bdev); + + sum += rq->nr_rqs[BLK_RW_SYNC] + rq->nr_rqs[BLK_RW_ASYNC]; + sum += rq->in_flight[BLK_RW_SYNC] + rq->in_flight[BLK_RW_ASYNC]; + + /* +* Try prevent switch for every sneeze +* By roundup output num by 2 +*/ + return ALIGN(sum, 2); +} + static int find_live_mirror(struct btrfs_fs_info *fs_info, struct map_lookup *map, int first, int num, int optimal, int dev_replace_is_ongoing) { int i; int tolerance; + struct block_device *bdev; struct btrfs_device *srcdev; + bool all_bdev_nonrot = true; if (dev_replace_is_ongoing && fs_info->dev_replace.cont_reading_from_srcdev_mode == @@ -5231,6 +5248,48 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info, else srcdev = NULL; + /* +* Optimal expected to be pid % num +* That's generaly ok for spinning rust drives +* But if one of mirror are non rotating, +* that bdev can show better performance +* +* if one of disks are non rotating: +* - set optimal to non rotating device +* if both disk are non rotating +* - set optimal to bdev with least queue +* If both disks are spinning rust: +* - leave old pid % nu, +*/ + for (i = 0; i < num; i++) { + bdev = map->stripes[i].dev->bdev; + if (!bdev) + continue; + if (blk_queue_nonrot(bdev_get_queue(bdev))) + optimal = i; + else + all_bdev_nonrot = false; + } + + if (all_bdev_nonrot) { + int qlen; + /* Forse following logic choise by init with some big number */ + int optimal_dev_rq_count = 1 << 24; + + for (i = 0; i < num; i++) { + bdev = map->stripes[i].dev->bdev; + if (!bdev) + continue; + + qlen = bdev_get_queue_len(bdev); + + if (qlen < optimal_dev_rq_count) { + optimal = i; + optimal_dev_rq_count = qlen; + } + } + } + /* * try to avoid the drive that is the source drive for a * dev-replace procedure, only choose it if no other non-missing -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs: compression: replace workspace managment code by mempool API
Mostly cleanup of old code and replace old API with new one. 1. Drop old linked list based approach 2. Replace all ERR_PTR(-ENOMEM) with NULL, as mempool code only understood NULL 3. mempool call alloc methods on create/resize, so for be sure, move nofs_{save,restore} to appropriate places 4. Update btrfs_comp_op to use void *ws, instead of list_head *ws 5. LZO more aggressive use of kmalloc on order 1 alloc, for more aggressive fallback to mempool 6. Refactor alloc functions to check every allocation, because mempool flags are aggressive and can fail more frequently. Signed-off-by: Timofey Titovets --- fs/btrfs/compression.c | 213 + fs/btrfs/compression.h | 12 +-- fs/btrfs/lzo.c | 64 +-- fs/btrfs/zlib.c| 56 +++-- fs/btrfs/zstd.c| 49 +++- 5 files changed, 148 insertions(+), 246 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 02bd60357f04..869df3f5bd1b 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -811,23 +811,12 @@ static void *heuristic_ws_alloc(gfp_t gfp_mask, void *pool_data) return NULL; } -struct workspaces_list { - struct list_head idle_ws; - spinlock_t ws_lock; - /* Number of free workspaces */ - int free_ws; - /* Total number of allocated workspaces */ - atomic_t total_ws; - /* Waiters for a free workspace */ - wait_queue_head_t ws_wait; -}; - struct workspace_stor { mempool_t *pool; }; static struct workspace_stor btrfs_heuristic_ws_stor; -static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES]; +static struct workspace_stor btrfs_comp_stor[BTRFS_COMPRESS_TYPES]; static const struct btrfs_compress_op * const btrfs_compress_op[] = { &btrfs_zlib_compress, @@ -837,14 +826,14 @@ static const struct btrfs_compress_op * const btrfs_compress_op[] = { void __init btrfs_init_compress(void) { - struct list_head *workspace; int i; - mempool_t *pool = btrfs_heuristic_ws_stor.pool; + mempool_t *pool; /* * Preallocate one workspace for heuristic so * we can guarantee forward progress in the worst case */ + pool = btrfs_heuristic_ws_stor.pool; pool = mempool_create(1, heuristic_ws_alloc, heuristic_ws_free, NULL); @@ -852,23 +841,17 @@ void __init btrfs_init_compress(void) pr_warn("BTRFS: cannot preallocate heuristic workspace, will try later\n"); for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) { - INIT_LIST_HEAD(&btrfs_comp_ws[i].idle_ws); - spin_lock_init(&btrfs_comp_ws[i].ws_lock); - atomic_set(&btrfs_comp_ws[i].total_ws, 0); - init_waitqueue_head(&btrfs_comp_ws[i].ws_wait); - + pool = btrfs_comp_stor[i].pool; /* * Preallocate one workspace for each compression type so * we can guarantee forward progress in the worst case */ - workspace = btrfs_compress_op[i]->alloc_workspace(); - if (IS_ERR(workspace)) { + pool = mempool_create(1, btrfs_compress_op[i]->alloc_workspace, + btrfs_compress_op[i]->free_workspace, + NULL); + + if (pool == NULL) pr_warn("BTRFS: cannot preallocate compression workspace, will try later\n"); - } else { - atomic_set(&btrfs_comp_ws[i].total_ws, 1); - btrfs_comp_ws[i].free_ws = 1; - list_add(workspace, &btrfs_comp_ws[i].idle_ws); - } } } @@ -881,6 +864,7 @@ static void *mempool_alloc_wrap(struct workspace_stor *stor) int ncpu = num_online_cpus(); while (unlikely(stor->pool == NULL)) { + int i; mempool_t *pool; void *(*ws_alloc)(gfp_t gfp_mask, void *pool_data); void (*ws_free)(void *element, void *pool_data); @@ -888,6 +872,13 @@ static void *mempool_alloc_wrap(struct workspace_stor *stor) if (stor == &btrfs_heuristic_ws_stor) { ws_alloc = heuristic_ws_alloc; ws_free = heuristic_ws_free; + } else { + for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) { + if (stor == &btrfs_comp_stor[i]) + break; + } + ws_alloc = btrfs_compress_op[i]->alloc_workspace; + ws_free = btrfs_compress_op[i]->free_workspace; } pool = mempool_create(1, ws_alloc, ws_free, NULL); @@ -915,7 +90
[PATCH 0/2] Btrfs: heuristic/compression convert workspace memory cache
Attemp to simplify/cleanup compression code. Little tested under high memory pressure. At least all looks like working as expected. First patch include preparation work for replace old linked list based approach with the new one based on mempool API. Covert only one part as proof of concept, heuristic memory managment. Define usage pattern and mempool_alloc_wrap() - handle pool resize and pool init errors. Second move zlib/lzo/zstd to new mempool API Timofey Titovets (2): Btrfs: heuristic: replace workspace managment code by mempool API Btrfs: compression: replace workspace managment code by mempool API fs/btrfs/compression.c | 332 - fs/btrfs/compression.h | 12 +- fs/btrfs/lzo.c | 64 ++ fs/btrfs/zlib.c| 56 + fs/btrfs/zstd.c| 49 +--- 5 files changed, 215 insertions(+), 298 deletions(-) -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Btrfs: heuristic: replace workspace managment code by mempool API
Currently compression code have custom workspace/memory cache for guarantee forward progress on high memory pressure. That api can be replaced with mempool API, which can guarantee the same. Main goal is simplify/cleanup code and replace it with general solution. I try avoid use of atomic/lock/wait stuff, as that all already hidden in mempool API. Only thing that must be racy safe is initialization of mempool. So i create simple mempool_alloc_wrap, which will handle mempool_create failures, and sync threads work by cmpxchg() on mempool_t pointer. Another logic difference between our custom stuff and mempool: - ws find/free mosly reuse current workspaces whenever possible. - mempool use alloc/free of provided helpers with more aggressive use of __GFP_NOMEMALLOC, __GFP_NORETRY, GFP_NOWARN, and only use already preallocated space when memory get tight. Not sure which approach are better, but simple stress tests with writing stuff on compressed fs on ramdisk show negligible difference on 8 CPU Virtual Machine with Intel Xeon E5-2420 0 @ 1.90GHz (+-1%). Other needed changes to use mempool: - memalloc_nofs_{save,restore} move to each place where kvmalloc will be used in call chain. - mempool_create return pointer to mampool or NULL, no error, so macros like IS_ERR(ptr) can't be used. Signed-off-by: Timofey Titovets --- fs/btrfs/compression.c | 197 ++--- 1 file changed, 106 insertions(+), 91 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 208334aa6c6e..02bd60357f04 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -768,46 +769,46 @@ struct heuristic_ws { struct bucket_item *bucket; /* Sorting buffer */ struct bucket_item *bucket_b; - struct list_head list; }; -static void free_heuristic_ws(struct list_head *ws) +static void heuristic_ws_free(void *element, void *pool_data) { - struct heuristic_ws *workspace; + struct heuristic_ws *ws = (struct heuristic_ws *) element; - workspace = list_entry(ws, struct heuristic_ws, list); - - kvfree(workspace->sample); - kfree(workspace->bucket); - kfree(workspace->bucket_b); - kfree(workspace); + kfree(ws->sample); + kfree(ws->bucket); + kfree(ws->bucket_b); + kfree(ws); } -static struct list_head *alloc_heuristic_ws(void) +static void *heuristic_ws_alloc(gfp_t gfp_mask, void *pool_data) { - struct heuristic_ws *ws; + struct heuristic_ws *ws = kzalloc(sizeof(*ws), gfp_mask); - ws = kzalloc(sizeof(*ws), GFP_KERNEL); if (!ws) - return ERR_PTR(-ENOMEM); + return NULL; - ws->sample = kvmalloc(MAX_SAMPLE_SIZE, GFP_KERNEL); + /* +* We can handle allocation failures and +* slab have caches for 8192 byte allocations +*/ + ws->sample = kmalloc(MAX_SAMPLE_SIZE, gfp_mask); if (!ws->sample) goto fail; - ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), GFP_KERNEL); + ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), gfp_mask); if (!ws->bucket) goto fail; - ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL); + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), gfp_mask); if (!ws->bucket_b) goto fail; - INIT_LIST_HEAD(&ws->list); - return &ws->list; + return ws; + fail: - free_heuristic_ws(&ws->list); - return ERR_PTR(-ENOMEM); + heuristic_ws_free(ws, NULL); + return NULL; } struct workspaces_list { @@ -821,9 +822,12 @@ struct workspaces_list { wait_queue_head_t ws_wait; }; -static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES]; +struct workspace_stor { + mempool_t *pool; +}; -static struct workspaces_list btrfs_heuristic_ws; +static struct workspace_stor btrfs_heuristic_ws_stor; +static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES]; static const struct btrfs_compress_op * const btrfs_compress_op[] = { &btrfs_zlib_compress, @@ -835,21 +839,17 @@ void __init btrfs_init_compress(void) { struct list_head *workspace; int i; + mempool_t *pool = btrfs_heuristic_ws_stor.pool; - INIT_LIST_HEAD(&btrfs_heuristic_ws.idle_ws); - spin_lock_init(&btrfs_heuristic_ws.ws_lock); - atomic_set(&btrfs_heuristic_ws.total_ws, 0); - init_waitqueue_head(&btrfs_heuristic_ws.ws_wait); + /* +* Preallocate one workspace for heuristic so +* we can guarantee forward progress in the worst case +*/ + pool = memp
[RFC PATCH] Btrfs: replace custom heuristic ws allocation logic with mempool API
Currently btrfs compression code use custom wrapper for store allocated compression/heuristic workspaces. That logic try store at least ncpu+1 each type of workspaces. As far, as i can see that logic fully reimplement mempool API. So i think, that use of mempool api can simplify code and allow for cleanup it. That a proof of concept patch, i have tested it (at least that works), future version will looks mostly same. If that acceptable, next step will be: 1. Create mempool_alloc_w() that will resize mempool to apropriate size ncpu+1 And will create apropriate mempool, if creating failed in __init. 2. Convert per compression ws to mempool. Thanks. Signed-off-by: Timofey Titovets Cc: David Sterba --- fs/btrfs/compression.c | 123 - 1 file changed, 39 insertions(+), 84 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 208334aa6c6e..cf47089b9ec0 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -768,14 +769,11 @@ struct heuristic_ws { struct bucket_item *bucket; /* Sorting buffer */ struct bucket_item *bucket_b; - struct list_head list; }; -static void free_heuristic_ws(struct list_head *ws) +static void heuristic_ws_free(void *element, void *pool_data) { - struct heuristic_ws *workspace; - - workspace = list_entry(ws, struct heuristic_ws, list); + struct heuristic_ws *workspace = (struct heuristic_ws *) element; kvfree(workspace->sample); kfree(workspace->bucket); @@ -783,13 +781,12 @@ static void free_heuristic_ws(struct list_head *ws) kfree(workspace); } -static struct list_head *alloc_heuristic_ws(void) +static void *heuristic_ws_alloc(gfp_t gfp_mask, void *pool_data) { - struct heuristic_ws *ws; + struct heuristic_ws *ws = kmalloc(sizeof(*ws), GFP_KERNEL); - ws = kzalloc(sizeof(*ws), GFP_KERNEL); if (!ws) - return ERR_PTR(-ENOMEM); + return ws; ws->sample = kvmalloc(MAX_SAMPLE_SIZE, GFP_KERNEL); if (!ws->sample) @@ -803,11 +800,14 @@ static struct list_head *alloc_heuristic_ws(void) if (!ws->bucket_b) goto fail; - INIT_LIST_HEAD(&ws->list); - return &ws->list; + return ws; + fail: - free_heuristic_ws(&ws->list); - return ERR_PTR(-ENOMEM); + kvfree(ws->sample); + kfree(ws->bucket); + kfree(ws->bucket_b); + kfree(ws); + return NULL; } struct workspaces_list { @@ -821,10 +821,9 @@ struct workspaces_list { wait_queue_head_t ws_wait; }; +static mempool_t *btrfs_heuristic_ws_pool; static struct workspaces_list btrfs_comp_ws[BTRFS_COMPRESS_TYPES]; -static struct workspaces_list btrfs_heuristic_ws; - static const struct btrfs_compress_op * const btrfs_compress_op[] = { &btrfs_zlib_compress, &btrfs_lzo_compress, @@ -836,20 +835,15 @@ void __init btrfs_init_compress(void) struct list_head *workspace; int i; - INIT_LIST_HEAD(&btrfs_heuristic_ws.idle_ws); - spin_lock_init(&btrfs_heuristic_ws.ws_lock); - atomic_set(&btrfs_heuristic_ws.total_ws, 0); - init_waitqueue_head(&btrfs_heuristic_ws.ws_wait); + /* +* Try preallocate pool with minimum size for successful +* initialization of btrfs module +*/ + btrfs_heuristic_ws_pool = mempool_create(1, heuristic_ws_alloc, + heuristic_ws_free, NULL); - workspace = alloc_heuristic_ws(); - if (IS_ERR(workspace)) { - pr_warn( - "BTRFS: cannot preallocate heuristic workspace, will try later\n"); - } else { - atomic_set(&btrfs_heuristic_ws.total_ws, 1); - btrfs_heuristic_ws.free_ws = 1; - list_add(workspace, &btrfs_heuristic_ws.idle_ws); - } + if (IS_ERR(btrfs_heuristic_ws_pool)) + pr_warn("BTRFS: cannot preallocate heuristic workspace, will try later\n"); for (i = 0; i < BTRFS_COMPRESS_TYPES; i++) { INIT_LIST_HEAD(&btrfs_comp_ws[i].idle_ws); @@ -878,7 +872,7 @@ void __init btrfs_init_compress(void) * Preallocation makes a forward progress guarantees and we do not return * errors. */ -static struct list_head *__find_workspace(int type, bool heuristic) +static struct list_head *find_workspace(int type) { struct list_head *workspace; int cpus = num_online_cpus(); @@ -890,19 +884,11 @@ static struct list_head *__find_workspace(int type, bool heuristic) wait_queue_head_t *ws_wait; int *free_ws; - if (heuristic) { - idle_ws = &b
Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)
How reproduce: touch test_file chattr +C test_file dd if=/dev/zero of=test_file bs=1M count=1 btrfs fi def -vrczlib test_file filefrag -v test_file test_file Filesystem type is: 9123683e File size of test_file is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31: 72917050.. 72917081: 32: encoded 1: 32.. 63: 72917118.. 72917149: 32: 72917082: encoded 2: 64.. 95: 72919494.. 72919525: 32: 72917150: encoded 3: 96.. 127: 72927576.. 72927607: 32: 72919526: encoded 4: 128.. 159: 72943261.. 72943292: 32: 72927608: encoded 5: 160.. 191: 72944929.. 72944960: 32: 72943293: encoded 6: 192.. 223: 72944952.. 72944983: 32: 72944961: encoded 7: 224.. 255: 72967084.. 72967115: 32: 72944984: last,encoded,eof test_file: 8 extents found I can't found at now, where that error happen in code, but it's reproducible on Linux 4.14.8 Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation
Currently defrag ioctl only support recompress files with specified compression type. Allow set compression type to none, while call defrag, and use BTRFS_DEFRAG_RANGE_COMPRESS as flag, that user request change of compression type. Signed-off-by: Timofey Titovets --- fs/btrfs/btrfs_inode.h | 1 + fs/btrfs/inode.c | 4 ++-- fs/btrfs/ioctl.c | 17 ++--- 3 files changed, 13 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 63f0ccc92a71..9eb0c92ee4b4 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -187,6 +187,7 @@ struct btrfs_inode { * different from prop_compress and takes precedence if set */ unsigned defrag_compress; + unsigned change_compress; struct btrfs_delayed_node *delayed_node; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 46df5e2a64e7..7af8f1784788 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -412,8 +412,8 @@ static inline int inode_need_compress(struct inode *inode, u64 start, u64 end) if (btrfs_test_opt(fs_info, FORCE_COMPRESS)) return 1; /* defrag ioctl */ - if (BTRFS_I(inode)->defrag_compress) - return 1; + if (BTRFS_I(inode)->change_compress) + return BTRFS_I(inode)->defrag_compress; /* bad compression ratios */ if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) return 0; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b29ea1f0f621..40f5e5678eac 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1276,7 +1276,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, unsigned long cluster = max_cluster; u64 new_align = ~((u64)SZ_128K - 1); struct page **pages = NULL; - bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS; + bool change_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS; if (isize == 0) return 0; @@ -1284,11 +1284,10 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (range->start >= isize) return -EINVAL; - if (do_compress) { + if (change_compress) { if (range->compress_type > BTRFS_COMPRESS_TYPES) return -EINVAL; - if (range->compress_type) - compress_type = range->compress_type; + compress_type = range->compress_type; } if (extent_thresh == 0) @@ -1363,7 +1362,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, extent_thresh, &last_len, &skip, -&defrag_end, do_compress, +&defrag_end, change_compress, compress_type)){ unsigned long next; /* @@ -1392,8 +1391,11 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, } inode_lock(inode); - if (do_compress) + if (change_compress) { + BTRFS_I(inode)->change_compress = change_compress; BTRFS_I(inode)->defrag_compress = compress_type; + } + ret = cluster_pages_for_defrag(inode, pages, i, cluster); if (ret < 0) { inode_unlock(inode); @@ -1449,8 +1451,9 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, ret = defrag_count; out_ra: - if (do_compress) { + if (change_compress) { inode_lock(inode); + BTRFS_I(inode)->change_compress = 0; BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE; inode_unlock(inode); } -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
At now btrfs_dedupe_file_range() restricted to 16MiB range for limit locking time and memory requirement for dedup ioctl() For too big input range code silently set range to 16MiB Let's remove that restriction by do iterating over dedup range. That's backward compatible and will not change anything for request less then 16MiB. Changes: v1 -> v2: - Refactor btrfs_cmp_data_prepare and btrfs_extent_same - Store memory of pages array between iterations - Lock inodes once, not on each iteration - Small inplace cleanups Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 160 --- 1 file changed, 94 insertions(+), 66 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index be5bd81b3669..45a47d0891fc 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2965,8 +2965,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp) put_page(pg); } } - kfree(cmp->src_pages); - kfree(cmp->dst_pages); + + cmp->num_pages = 0; } static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, @@ -2974,41 +2974,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, u64 len, struct cmp_pages *cmp) { int ret; - int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; - struct page **src_pgarr, **dst_pgarr; - - /* -* We must gather up all the pages before we initiate our -* extent locking. We use an array for the page pointers. Size -* of the array is bounded by len, which is in turn bounded by -* BTRFS_MAX_DEDUPE_LEN. -*/ - src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); - dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); - if (!src_pgarr || !dst_pgarr) { - kfree(src_pgarr); - kfree(dst_pgarr); - return -ENOMEM; - } - cmp->num_pages = num_pages; - cmp->src_pages = src_pgarr; - cmp->dst_pages = dst_pgarr; /* * If deduping ranges in the same inode, locking rules make it mandatory * to always lock pages in ascending order to avoid deadlocks with * concurrent tasks (such as starting writeback/delalloc). */ - if (src == dst && dst_loff < loff) { - swap(src_pgarr, dst_pgarr); + if (src == dst && dst_loff < loff) swap(loff, dst_loff); - } - ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff); + cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; + + ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff); if (ret) goto out; - ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff); + ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, dst_loff); out: if (ret) @@ -3078,31 +3059,23 @@ static int extent_same_check_offsets(struct inode *inode, u64 off, u64 *plen, return 0; } -static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, -struct inode *dst, u64 dst_loff) +static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, + struct inode *dst, u64 dst_loff, + struct cmp_pages *cmp) { int ret; u64 len = olen; - struct cmp_pages cmp; bool same_inode = (src == dst); u64 same_lock_start = 0; u64 same_lock_len = 0; - if (len == 0) - return 0; - - if (same_inode) - inode_lock(src); - else - btrfs_double_inode_lock(src, dst); - ret = extent_same_check_offsets(src, loff, &len, olen); if (ret) - goto out_unlock; + return ret; ret = extent_same_check_offsets(dst, dst_loff, &len, olen); if (ret) - goto out_unlock; + return ret; if (same_inode) { /* @@ -3119,32 +3092,21 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, * allow an unaligned length so long as it ends at * i_size. */ - if (len != olen) { - ret = -EINVAL; - goto out_unlock; - } + if (len != olen) + return -EINVAL; /* Check for overlapping ranges */ - if (dst_loff + len > loff && dst_loff < loff + len) { - ret = -EINVAL; - goto out_unlock; - } + if (dst_loff + len > loff && dst_loff < loff + len) + return -EINVAL; same_lo
[PATCH 2/4] Btrfs: make should_defrag_range() understood compressed extents
Both, defrag ioctl and autodefrag - call btrfs_defrag_file() for file defragmentation. Kernel default target extent size - 256KiB. Btrfs progs default - 32MiB. Both bigger then maximum size of compressed extent - 128KiB. That lead to rewrite all compressed data on disk. Fix that by check compression extents with different logic. As addition, make should_defrag_range() understood compressed extent type, if requested target compression are same as current extent compression type. Just don't recompress/rewrite extents. To avoid useless recompression of compressed extents. Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 45a47d0891fc..b29ea1f0f621 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em) static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, u64 *last_len, u64 *skip, u64 *defrag_end, - int compress) + int compress, int compress_type) { struct extent_map *em; int ret = 1; @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, * real extent, don't bother defragging it */ if (!compress && (*last_len == 0 || *last_len >= thresh) && - (em->len >= thresh || (!next_mergeable && !prev_mergeable))) + (em->len >= thresh || (!next_mergeable && !prev_mergeable))) { ret = 0; + goto out; + } + + + /* +* Try not recompress compressed extents +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to +* recompress all compressed extents +*/ + if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) { + if (!compress) { + if (em->len == BTRFS_MAX_UNCOMPRESSED) + ret = 0; + } else { + if (em->compress_type != compress_type) + goto out; + if (em->len == BTRFS_MAX_UNCOMPRESSED) + ret = 0; + } + } + out: /* * last_len ends up being a counter of how many bytes we've defragged. @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, extent_thresh, &last_len, &skip, -&defrag_end, do_compress)){ +&defrag_end, do_compress, +compress_type)){ unsigned long next; /* * the should_defrag function tells us how much to skip -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Btrfs: just bunch of patches to ioctl.c
1st patch, remove 16MiB restriction from extent_same ioctl(), by doing iterations over passed range. I did not see much difference in performance, so it's just remove logic restriction. 2-3 pathes, update defrag ioctl(): - Fix bad behaviour with full rewriting all compressed extents in defrag range. (that also make autodefrag on compressed fs not so expensive) - Allow userspace specify NONE as target compression type, that allow users to uncompress files by defragmentation with btrfs-progs - Make defrag ioctl understood requested compression type and current compression type of extents, to make btrfs fi def -rc idempotent operation. i.e. now possible to say, make all extents compressed with lzo, and btrfs will not recompress lzo compressed data. Same for zlib, zstd, none. (patch to btrfs-progs in PR on kdave GitHub). 4th patch, reduce size of struct btrfs_inode - btrfs_inode store fields like: prop_compress, defrag_compress and after 3rd patch, change_compress. They use unsigned as a type, and use 12 bytes in sum. But change_compress is a bitflag, and prop_compress/defrag_compress only store compression type, that currently use 0-3 of 2^32-1. So, set a bitfields on that vars, and reduce size of btrfs_inode: 1136 -> 1128. Timofey Titovets (4): Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction Btrfs: make should_defrag_range() understood compressed extents Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation Btrfs: reduce size of struct btrfs_inode fs/btrfs/btrfs_inode.h | 5 +- fs/btrfs/inode.c | 4 +- fs/btrfs/ioctl.c | 203 +++-- 3 files changed, 133 insertions(+), 79 deletions(-) -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Btrfs: reduce size of struct btrfs_inode
Currently btrfs_inode have size equal 1136 bytes. (On x86_64). struct btrfs_inode store several vars releated to compression code, all states use 1 or 2 bits. Lets declare bitfields for compression releated vars, to reduce sizeof btrfs_inode to 1128 bytes. Signed-off-by: Timofey Titovets --- fs/btrfs/btrfs_inode.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 9eb0c92ee4b4..9d29d7e68757 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -181,13 +181,13 @@ struct btrfs_inode { /* * Cached values of inode properties */ - unsigned prop_compress; /* per-file compression algorithm */ + unsigned prop_compress : 2; /* per-file compression algorithm */ /* * Force compression on the file using the defrag ioctl, could be * different from prop_compress and takes precedence if set */ - unsigned defrag_compress; - unsigned change_compress; + unsigned defrag_compress : 2; + unsigned change_compress : 1; struct btrfs_delayed_node *delayed_node; -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: allow btrfs_defrag_file() uncompress files on defragmentation
Currently defrag ioctl only support compress files with specified compression type. Allow set compression type to none, while call defrag. Signed-off-by: Timofey Titovets --- fs/btrfs/btrfs_inode.h | 1 + fs/btrfs/inode.c | 4 ++-- fs/btrfs/ioctl.c | 17 ++--- 3 files changed, 13 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 63f0ccc92a71..9eb0c92ee4b4 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -187,6 +187,7 @@ struct btrfs_inode { * different from prop_compress and takes precedence if set */ unsigned defrag_compress; + unsigned change_compress; struct btrfs_delayed_node *delayed_node; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 46df5e2a64e7..7af8f1784788 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -412,8 +412,8 @@ static inline int inode_need_compress(struct inode *inode, u64 start, u64 end) if (btrfs_test_opt(fs_info, FORCE_COMPRESS)) return 1; /* defrag ioctl */ - if (BTRFS_I(inode)->defrag_compress) - return 1; + if (BTRFS_I(inode)->change_compress) + return BTRFS_I(inode)->defrag_compress; /* bad compression ratios */ if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) return 0; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 12d4fa5d6dec..b777c8f53153 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1276,7 +1276,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, unsigned long cluster = max_cluster; u64 new_align = ~((u64)SZ_128K - 1); struct page **pages = NULL; - bool do_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS; + bool change_compress = range->flags & BTRFS_DEFRAG_RANGE_COMPRESS; if (isize == 0) return 0; @@ -1284,11 +1284,10 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (range->start >= isize) return -EINVAL; - if (do_compress) { + if (change_compress) { if (range->compress_type > BTRFS_COMPRESS_TYPES) return -EINVAL; - if (range->compress_type) - compress_type = range->compress_type; + compress_type = range->compress_type; } if (extent_thresh == 0) @@ -1363,7 +1362,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, extent_thresh, &last_len, &skip, -&defrag_end, do_compress, +&defrag_end, change_compress, compress_type)){ unsigned long next; /* @@ -1392,8 +1391,11 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, } inode_lock(inode); - if (do_compress) + if (change_compress) { + BTRFS_I(inode)->change_compress = change_compress; BTRFS_I(inode)->defrag_compress = compress_type; + } + ret = cluster_pages_for_defrag(inode, pages, i, cluster); if (ret < 0) { inode_unlock(inode); @@ -1449,8 +1451,9 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, ret = defrag_count; out_ra: - if (do_compress) { + if (change_compress) { inode_lock(inode); + BTRFS_I(inode)->change_compress = 0; BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE; inode_unlock(inode); } -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: make should_defrag_range() understood compressed extents
Also, in theory that break following case: When fs mounted with compress=no, you can uncompress data by btrfs fi def -vr , because of "bug" in logic/defaults. And man page show that btrfs "Currently it’s not possible to select no compression. See also section EXAMPLES." I have a two simple patches that can provide a way (one to btrfs-progs and one to btrfs.ko), to uncompress data on fs mounted with compress=no, by run btrfs fi def -vrcnone But behavior on fs mounted with compress, will be recompress data with selected by "compress" algo, because of inode_need_compression() logic. In theory that also must be fixed. 2017-12-14 16:37 GMT+03:00 Timofey Titovets : > Compile tested and "battle" tested > > 2017-12-14 16:35 GMT+03:00 Timofey Titovets : >> Both, defrag ioctl and autodefrag - call btrfs_defrag_file() >> for file defragmentation. >> >> Kernel target extent size default is 256KiB >> Btrfs progs by default, use 32MiB. >> >> Both bigger then max (not fragmented) compressed extent size 128KiB. >> That lead to rewrite all compressed data on disk. >> >> Fix that and also make should_defrag_range() understood >> if requested target compression are same as current extent compression type. >> To avoid useless recompression of compressed extents. >> >> Signed-off-by: Timofey Titovets >> --- >> fs/btrfs/ioctl.c | 28 +--- >> 1 file changed, 25 insertions(+), 3 deletions(-) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index be5bd81b3669..12d4fa5d6dec 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode >> *inode, struct extent_map *em) >> >> static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, >>u64 *last_len, u64 *skip, u64 *defrag_end, >> - int compress) >> + int compress, int compress_type) >> { >> struct extent_map *em; >> int ret = 1; >> @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, >> u64 start, u32 thresh, >> * real extent, don't bother defragging it >> */ >> if (!compress && (*last_len == 0 || *last_len >= thresh) && >> - (em->len >= thresh || (!next_mergeable && !prev_mergeable))) >> + (em->len >= thresh || (!next_mergeable && !prev_mergeable))) { >> ret = 0; >> + goto out; >> + } >> + >> + >> + /* >> +* Try not recompress compressed extents >> +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to >> +* recompress all compressed extents >> +*/ >> + if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) { >> + if (!compress) { >> + if (em->len == BTRFS_MAX_UNCOMPRESSED) >> + ret = 0; >> + } else { >> + if (em->compress_type != compress_type) >> + goto out; >> + if (em->len == BTRFS_MAX_UNCOMPRESSED) >> + ret = 0; >> + } >> + } >> + >> out: >> /* >> * last_len ends up being a counter of how many bytes we've >> defragged. >> @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file >> *file, >> >> if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, >> extent_thresh, &last_len, &skip, >> -&defrag_end, do_compress)){ >> +&defrag_end, do_compress, >> +compress_type)){ >> unsigned long next; >> /* >> * the should_defrag function tells us how much to >> skip >> -- >> 2.15.1 >> > > > > -- > Have a nice day, > Timofey. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] Defrag on compressed FS do massive data rewrites
I send: [PATCH] Btrfs: make should_defrag_range() understood compressed extents If you want you can test that, it fix btrfs fi def and autodefrag behavior with compressed data. Also it understood if user try recompress data with old/new compression algo. 2017-12-14 14:27 GMT+03:00 Timofey Titovets : > 2017-12-14 8:58 GMT+03:00 Duncan <1i5t5.dun...@cox.net>: >> Timofey Titovets posted on Thu, 14 Dec 2017 02:05:35 +0300 as excerpted: >> >>> Also, same problem exist for autodefrag case i.e.: >>> write 4KiB at start of compressed file autodefrag code add that file to >>> autodefrag queue, call btrfs_defrag_file, set range from start to u64-1. >>> That will trigger to full file rewrite, as all extents are smaller then >>> 256KiB. >>> >>> (if i understood all correctly). >> >> If so, it's rather ironic, because that's how I believed autodefrag to >> work, whole-file, for quite some time. Then I learned otherwise, but I >> always enable both autodefrag and compress=lzo on all my btrfs, so it >> looks like at least for my use-case, I was correct with the whole-file >> assumption after all. (Well, at least for files that are actually >> compressed, I don't run compress-force=lzo, just compress=lzo, so a >> reasonable number of files aren't actually compressed anyway, and they'd >> do the partial-file rewrite I had learned to be normal.) >> >> -- >> Duncan - List replies preferred. No HTML msgs. >> "Every nonfree program has a lord, a master -- >> and if you use the program, he is your master." Richard Stallman >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > btrfs fi def can easy avoid that problem if properly documented, > but i think that fix must be in place, because "full rewrite" by > autodefrag are unacceptable behaviour. > > How i see, "How it works": > Both, defrag ioctl and autodefrag - call btrfs_defrag_file() (ioctl.c). > btrfs_defrag_file() set extent_thresh to args, if args not provided > (autodefrag not initialize range->extent_thresh); > use default: > if (extent_thresh == 0) > extent_thresh = SZ_256K; > > Later btrfs_defrag_file() try defrag file from start by "index" (page > number from start, file virtually splitted to page sized blocks). > Than it call should_defrag_range(), if need make i+1 or skip range by > info from should_defrag_range(). > > should_defrag_range() get extent for specified start offset: > em = defrag_lookup_extent(inode, start); > > Later (em->len >= thresh || (!next_mergeable && !prev_mergeable)) will > fail condition because len (<128KiB) < thresh (256KiB or 32MiB usual). > So extent will be rewritten. > > struct extent_map{}; have two potential useful info: > > ... > u64 len; > ... > > unsigned int compress_type; > ... > > As i see by code len store "real" length, so > in theory that just need to add additional check later like: > if (em->len = BTRFS_MAX_UNCOMPRESSED && em->compress_type > 0) > ret = 0; > > That must fix problem by "true" insight check for compressed extents. > > Thanks. > > P.S. > (May be someone from more experienced devs can comment that?) > > -- > Have a nice day, > Timofey. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: make should_defrag_range() understood compressed extents
Compile tested and "battle" tested 2017-12-14 16:35 GMT+03:00 Timofey Titovets : > Both, defrag ioctl and autodefrag - call btrfs_defrag_file() > for file defragmentation. > > Kernel target extent size default is 256KiB > Btrfs progs by default, use 32MiB. > > Both bigger then max (not fragmented) compressed extent size 128KiB. > That lead to rewrite all compressed data on disk. > > Fix that and also make should_defrag_range() understood > if requested target compression are same as current extent compression type. > To avoid useless recompression of compressed extents. > > Signed-off-by: Timofey Titovets > --- > fs/btrfs/ioctl.c | 28 +--- > 1 file changed, 25 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index be5bd81b3669..12d4fa5d6dec 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode > *inode, struct extent_map *em) > > static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, >u64 *last_len, u64 *skip, u64 *defrag_end, > - int compress) > + int compress, int compress_type) > { > struct extent_map *em; > int ret = 1; > @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, > u64 start, u32 thresh, > * real extent, don't bother defragging it > */ > if (!compress && (*last_len == 0 || *last_len >= thresh) && > - (em->len >= thresh || (!next_mergeable && !prev_mergeable))) > + (em->len >= thresh || (!next_mergeable && !prev_mergeable))) { > ret = 0; > + goto out; > + } > + > + > + /* > +* Try not recompress compressed extents > +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to > +* recompress all compressed extents > +*/ > + if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) { > + if (!compress) { > + if (em->len == BTRFS_MAX_UNCOMPRESSED) > + ret = 0; > + } else { > + if (em->compress_type != compress_type) > + goto out; > + if (em->len == BTRFS_MAX_UNCOMPRESSED) > + ret = 0; > + } > + } > + > out: > /* > * last_len ends up being a counter of how many bytes we've defragged. > @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file > *file, > > if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, > extent_thresh, &last_len, &skip, > -&defrag_end, do_compress)){ > +&defrag_end, do_compress, > +compress_type)){ > unsigned long next; > /* > * the should_defrag function tells us how much to > skip > -- > 2.15.1 > -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: make should_defrag_range() understood compressed extents
Both, defrag ioctl and autodefrag - call btrfs_defrag_file() for file defragmentation. Kernel target extent size default is 256KiB Btrfs progs by default, use 32MiB. Both bigger then max (not fragmented) compressed extent size 128KiB. That lead to rewrite all compressed data on disk. Fix that and also make should_defrag_range() understood if requested target compression are same as current extent compression type. To avoid useless recompression of compressed extents. Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index be5bd81b3669..12d4fa5d6dec 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1008,7 +1008,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em) static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, u64 *last_len, u64 *skip, u64 *defrag_end, - int compress) + int compress, int compress_type) { struct extent_map *em; int ret = 1; @@ -1043,8 +1043,29 @@ static int should_defrag_range(struct inode *inode, u64 start, u32 thresh, * real extent, don't bother defragging it */ if (!compress && (*last_len == 0 || *last_len >= thresh) && - (em->len >= thresh || (!next_mergeable && !prev_mergeable))) + (em->len >= thresh || (!next_mergeable && !prev_mergeable))) { ret = 0; + goto out; + } + + + /* +* Try not recompress compressed extents +* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to +* recompress all compressed extents +*/ + if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) { + if (!compress) { + if (em->len == BTRFS_MAX_UNCOMPRESSED) + ret = 0; + } else { + if (em->compress_type != compress_type) + goto out; + if (em->len == BTRFS_MAX_UNCOMPRESSED) + ret = 0; + } + } + out: /* * last_len ends up being a counter of how many bytes we've defragged. @@ -1342,7 +1363,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT, extent_thresh, &last_len, &skip, -&defrag_end, do_compress)){ +&defrag_end, do_compress, +compress_type)){ unsigned long next; /* * the should_defrag function tells us how much to skip -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Btrfs: btrfs_defrag_file() force use target extent size SZ_128KiB for compressed data
Ignore that patch please, i will send another 2017-12-14 2:25 GMT+03:00 Timofey Titovets : > Defrag heuristic use extent lengh as threshold, > kernel autodefrag use SZ_256KiB and btrfs-progs use SZ_32MiB as > target extent lengh. > > Problem: > Compressed extents always have lengh at < 128KiB (BTRFS_MAX_COMPRESSED) > So btrfs_defrag_file() always rewrite all extents in defrag range. > > Hot fix that by force set target extent size to BTRFS_MAX_COMPRESSED, > if file allowed to be compressed. > > Signed-off-by: Timofey Titovets > --- > fs/btrfs/ioctl.c | 23 +++ > 1 file changed, 23 insertions(+) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index be5bd81b3669..952364ff4108 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1232,6 +1232,26 @@ static int cluster_pages_for_defrag(struct inode > *inode, > > } > > +static inline int inode_use_compression(struct inode *inode) > +{ > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > + > + /* force compress */ > + if (btrfs_test_opt(fs_info, FORCE_COMPRESS)) > + return 1; > + /* defrag ioctl */ > + if (BTRFS_I(inode)->defrag_compress) > + return 1; > + /* bad compression ratios */ > + if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) > + return 0; > + if (btrfs_test_opt(fs_info, COMPRESS) || > + BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS || > + BTRFS_I(inode)->prop_compress) > + return 1; > + return 0; > +} > + > int btrfs_defrag_file(struct inode *inode, struct file *file, > struct btrfs_ioctl_defrag_range_args *range, > u64 newer_than, unsigned long max_to_defrag) > @@ -1270,6 +1290,9 @@ int btrfs_defrag_file(struct inode *inode, struct file > *file, > compress_type = range->compress_type; > } > > + if (inode_use_compression(inode)) > + extent_thresh = BTRFS_MAX_COMPRESSED; > + > if (extent_thresh == 0) > extent_thresh = SZ_256K; > > -- > 2.15.1 -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] Defrag on compressed FS do massive data rewrites
2017-12-14 8:58 GMT+03:00 Duncan <1i5t5.dun...@cox.net>: > Timofey Titovets posted on Thu, 14 Dec 2017 02:05:35 +0300 as excerpted: > >> Also, same problem exist for autodefrag case i.e.: >> write 4KiB at start of compressed file autodefrag code add that file to >> autodefrag queue, call btrfs_defrag_file, set range from start to u64-1. >> That will trigger to full file rewrite, as all extents are smaller then >> 256KiB. >> >> (if i understood all correctly). > > If so, it's rather ironic, because that's how I believed autodefrag to > work, whole-file, for quite some time. Then I learned otherwise, but I > always enable both autodefrag and compress=lzo on all my btrfs, so it > looks like at least for my use-case, I was correct with the whole-file > assumption after all. (Well, at least for files that are actually > compressed, I don't run compress-force=lzo, just compress=lzo, so a > reasonable number of files aren't actually compressed anyway, and they'd > do the partial-file rewrite I had learned to be normal.) > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html btrfs fi def can easy avoid that problem if properly documented, but i think that fix must be in place, because "full rewrite" by autodefrag are unacceptable behaviour. How i see, "How it works": Both, defrag ioctl and autodefrag - call btrfs_defrag_file() (ioctl.c). btrfs_defrag_file() set extent_thresh to args, if args not provided (autodefrag not initialize range->extent_thresh); use default: if (extent_thresh == 0) extent_thresh = SZ_256K; Later btrfs_defrag_file() try defrag file from start by "index" (page number from start, file virtually splitted to page sized blocks). Than it call should_defrag_range(), if need make i+1 or skip range by info from should_defrag_range(). should_defrag_range() get extent for specified start offset: em = defrag_lookup_extent(inode, start); Later (em->len >= thresh || (!next_mergeable && !prev_mergeable)) will fail condition because len (<128KiB) < thresh (256KiB or 32MiB usual). So extent will be rewritten. struct extent_map{}; have two potential useful info: ... u64 len; ... unsigned int compress_type; ... As i see by code len store "real" length, so in theory that just need to add additional check later like: if (em->len = BTRFS_MAX_UNCOMPRESSED && em->compress_type > 0) ret = 0; That must fix problem by "true" insight check for compressed extents. Thanks. P.S. (May be someone from more experienced devs can comment that?) -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] Btrfs: btrfs_defrag_file() force use target extent size SZ_128KiB for compressed data
Defrag heuristic use extent lengh as threshold, kernel autodefrag use SZ_256KiB and btrfs-progs use SZ_32MiB as target extent lengh. Problem: Compressed extents always have lengh at < 128KiB (BTRFS_MAX_COMPRESSED) So btrfs_defrag_file() always rewrite all extents in defrag range. Hot fix that by force set target extent size to BTRFS_MAX_COMPRESSED, if file allowed to be compressed. Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index be5bd81b3669..952364ff4108 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1232,6 +1232,26 @@ static int cluster_pages_for_defrag(struct inode *inode, } +static inline int inode_use_compression(struct inode *inode) +{ + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + + /* force compress */ + if (btrfs_test_opt(fs_info, FORCE_COMPRESS)) + return 1; + /* defrag ioctl */ + if (BTRFS_I(inode)->defrag_compress) + return 1; + /* bad compression ratios */ + if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS) + return 0; + if (btrfs_test_opt(fs_info, COMPRESS) || + BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS || + BTRFS_I(inode)->prop_compress) + return 1; + return 0; +} + int btrfs_defrag_file(struct inode *inode, struct file *file, struct btrfs_ioctl_defrag_range_args *range, u64 newer_than, unsigned long max_to_defrag) @@ -1270,6 +1290,9 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, compress_type = range->compress_type; } + if (inode_use_compression(inode)) + extent_thresh = BTRFS_MAX_COMPRESSED; + if (extent_thresh == 0) extent_thresh = SZ_256K; -- 2.15.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG?] Defrag on compressed FS do massive data rewrites
2017-12-14 1:09 GMT+03:00 Timofey Titovets : > Hi, i see massive data rewrites of defragmented files when work with > btrfs fi def . > Before, i just thought it's a design problem - i.e. defrag always > rewrite data to new place. > > At now, i read the code and see 2 bad cases: > 1. With -c all extents of data will be rewriten, always. > 2. btrfs use "bad" default target extent size, i.e. kernel by default > try get 256KiB extent, btrfs-progres use 32MiB as a threshold. > Both of them make ioctl code should_defrag_range() think, that extent > are "too" fragmented, and rewrite all compressed extents. > > Does that behavior expected? > > i.e. only way that i can safely use on my data are: > btrfs fi def -vr -t 128KiB > That will defrag all fragmented compressed extents. > > "Hacky" solution that i see for now, is a create copy of inode_need_compress() > for defrag ioctl, and if file must be compressed, force use of 128KiB > as target extent. > or at least document that not obvious behaviour. > > Thanks! > > -- > Have a nice day, > Timofey. Also, same problem exist for autodefrag case i.e.: write 4KiB at start of compressed file autodefrag code add that file to autodefrag queue, call btrfs_defrag_file, set range from start to u64-1. That will trigger to full file rewrite, as all extents are smaller then 256KiB. (if i understood all correctly). Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[BUG?] Defrag on compressed FS do massive data rewrites
Hi, i see massive data rewrites of defragmented files when work with btrfs fi def . Before, i just thought it's a design problem - i.e. defrag always rewrite data to new place. At now, i read the code and see 2 bad cases: 1. With -c all extents of data will be rewriten, always. 2. btrfs use "bad" default target extent size, i.e. kernel by default try get 256KiB extent, btrfs-progres use 32MiB as a threshold. Both of them make ioctl code should_defrag_range() think, that extent are "too" fragmented, and rewrite all compressed extents. Does that behavior expected? i.e. only way that i can safely use on my data are: btrfs fi def -vr -t 128KiB That will defrag all fragmented compressed extents. "Hacky" solution that i see for now, is a create copy of inode_need_compress() for defrag ioctl, and if file must be compressed, force use of 128KiB as target extent. or at least document that not obvious behaviour. Thanks! -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
Compile tested && battle tested by btrfs-extent-same from duperemove. At performance, i see a negligible difference. Thanks 2017-12-13 3:45 GMT+03:00 Timofey Titovets : > At now btrfs_dedupe_file_range() restricted to 16MiB range for > limit locking time and memory requirement for dedup ioctl() > > For too big input range code silently set range to 16MiB > > Let's remove that restriction by do iterating over dedup range. > That's backward compatible and will not change anything for request > less then 16MiB. > > Changes: > v1 -> v2: > - Refactor btrfs_cmp_data_prepare and btrfs_extent_same > - Store memory of pages array between iterations > - Lock inodes once, not on each iteration > - Small inplace cleanups > > Signed-off-by: Timofey Titovets > --- > fs/btrfs/ioctl.c | 160 > --- > 1 file changed, 94 insertions(+), 66 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index d136ff0522e6..b17dcab1bb0c 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -2985,8 +2985,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp) > put_page(pg); > } > } > - kfree(cmp->src_pages); > - kfree(cmp->dst_pages); > + > + cmp->num_pages = 0; > } > > static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, > @@ -2994,41 +2994,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, > u64 loff, > u64 len, struct cmp_pages *cmp) > { > int ret; > - int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; > - struct page **src_pgarr, **dst_pgarr; > - > - /* > -* We must gather up all the pages before we initiate our > -* extent locking. We use an array for the page pointers. Size > -* of the array is bounded by len, which is in turn bounded by > -* BTRFS_MAX_DEDUPE_LEN. > -*/ > - src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); > - dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); > - if (!src_pgarr || !dst_pgarr) { > - kfree(src_pgarr); > - kfree(dst_pgarr); > - return -ENOMEM; > - } > - cmp->num_pages = num_pages; > - cmp->src_pages = src_pgarr; > - cmp->dst_pages = dst_pgarr; > > /* > * If deduping ranges in the same inode, locking rules make it > mandatory > * to always lock pages in ascending order to avoid deadlocks with > * concurrent tasks (such as starting writeback/delalloc). > */ > - if (src == dst && dst_loff < loff) { > - swap(src_pgarr, dst_pgarr); > + if (src == dst && dst_loff < loff) > swap(loff, dst_loff); > - } > > - ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff); > + cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; > + > + ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff); > if (ret) > goto out; > > - ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff); > + ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, > dst_loff); > > out: > if (ret) > @@ -3098,31 +3079,23 @@ static int extent_same_check_offsets(struct inode > *inode, u64 off, u64 *plen, > return 0; > } > > -static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, > -struct inode *dst, u64 dst_loff) > +static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, > + struct inode *dst, u64 dst_loff, > + struct cmp_pages *cmp) > { > int ret; > u64 len = olen; > - struct cmp_pages cmp; > bool same_inode = (src == dst); > u64 same_lock_start = 0; > u64 same_lock_len = 0; > > - if (len == 0) > - return 0; > - > - if (same_inode) > - inode_lock(src); > - else > - btrfs_double_inode_lock(src, dst); > - > ret = extent_same_check_offsets(src, loff, &len, olen); > if (ret) > - goto out_unlock; > + return ret; > > ret = extent_same_check_offsets(dst, dst_loff, &len, olen); > if (ret) > - goto out_unlock; > + return ret; > > if (same_inode) { > /* > @@ -
Re: [PATCH v3 0/5] define BTRFS_DEV_STATE
2017-12-13 5:26 GMT+03:00 David Sterba : > On Wed, Dec 13, 2017 at 06:38:12AM +0800, Anand Jain wrote: >> >> >> On 12/13/2017 01:42 AM, David Sterba wrote: >> > On Sun, Dec 10, 2017 at 05:15:17PM +0800, Anand Jain wrote: >> >> As of now device properties and states are being represented as int >> >> variable, patches here makes them bit flags instead. Further, wip >> >> patches such as device failed state needs this cleanup. >> >> >> >> v2: >> >> Adds BTRFS_DEV_STATE_REPLACE_TGT >> >> Adds BTRFS_DEV_STATE_FLUSH_SENT >> >> Drops BTRFS_DEV_STATE_CAN_DISCARD >> >> Starts bit flag from the bit 0 >> >> Drops unrelated change - declare btrfs_device >> >> >> >> v3: >> >> Fix static checker warning, define respective dev state as bit number >> > >> > The define numbers are fixed but the whitespace changes that I made in >> > misc-next >> >> Will do next time. Thanks. I don't see misc-next. Is it for-next ? > > The kernel.org repository only gets the latest for-next, that is > assembled from the pending branches, and also after some testing. You > could still find 'misc-next' inside the for-next branch, but it's not > obvious. > > All the development branches are pushed to > > https://github.com/kdave/btrfs-devel or > http://repo.or.cz/linux-2.6/btrfs-unstable.git > > more frequently than the k.org/for-next is updated. I thought this has > become a common knowledge, but yet it's not documented on the wiki so. > Let's fix that. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Didn't know about your GitHub copy, may be that have sense to update: https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories And add new links? To: - k.org/for-next - https://github.com/kdave/btrfs-devel or - http://repo.or.cz/linux-2.6/btrfs-unstable.git Because current link to git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git looks like a bit outdated. Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
At now btrfs_dedupe_file_range() restricted to 16MiB range for limit locking time and memory requirement for dedup ioctl() For too big input range code silently set range to 16MiB Let's remove that restriction by do iterating over dedup range. That's backward compatible and will not change anything for request less then 16MiB. Changes: v1 -> v2: - Refactor btrfs_cmp_data_prepare and btrfs_extent_same - Store memory of pages array between iterations - Lock inodes once, not on each iteration - Small inplace cleanups Signed-off-by: Timofey Titovets --- fs/btrfs/ioctl.c | 160 --- 1 file changed, 94 insertions(+), 66 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d136ff0522e6..b17dcab1bb0c 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2985,8 +2985,8 @@ static void btrfs_cmp_data_free(struct cmp_pages *cmp) put_page(pg); } } - kfree(cmp->src_pages); - kfree(cmp->dst_pages); + + cmp->num_pages = 0; } static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, @@ -2994,41 +2994,22 @@ static int btrfs_cmp_data_prepare(struct inode *src, u64 loff, u64 len, struct cmp_pages *cmp) { int ret; - int num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; - struct page **src_pgarr, **dst_pgarr; - - /* -* We must gather up all the pages before we initiate our -* extent locking. We use an array for the page pointers. Size -* of the array is bounded by len, which is in turn bounded by -* BTRFS_MAX_DEDUPE_LEN. -*/ - src_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); - dst_pgarr = kcalloc(num_pages, sizeof(struct page *), GFP_KERNEL); - if (!src_pgarr || !dst_pgarr) { - kfree(src_pgarr); - kfree(dst_pgarr); - return -ENOMEM; - } - cmp->num_pages = num_pages; - cmp->src_pages = src_pgarr; - cmp->dst_pages = dst_pgarr; /* * If deduping ranges in the same inode, locking rules make it mandatory * to always lock pages in ascending order to avoid deadlocks with * concurrent tasks (such as starting writeback/delalloc). */ - if (src == dst && dst_loff < loff) { - swap(src_pgarr, dst_pgarr); + if (src == dst && dst_loff < loff) swap(loff, dst_loff); - } - ret = gather_extent_pages(src, src_pgarr, cmp->num_pages, loff); + cmp->num_pages = PAGE_ALIGN(len) >> PAGE_SHIFT; + + ret = gather_extent_pages(src, cmp->src_pages, cmp->num_pages, loff); if (ret) goto out; - ret = gather_extent_pages(dst, dst_pgarr, cmp->num_pages, dst_loff); + ret = gather_extent_pages(dst, cmp->dst_pages, cmp->num_pages, dst_loff); out: if (ret) @@ -3098,31 +3079,23 @@ static int extent_same_check_offsets(struct inode *inode, u64 off, u64 *plen, return 0; } -static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, -struct inode *dst, u64 dst_loff) +static int __btrfs_extent_same(struct inode *src, u64 loff, u64 olen, + struct inode *dst, u64 dst_loff, + struct cmp_pages *cmp) { int ret; u64 len = olen; - struct cmp_pages cmp; bool same_inode = (src == dst); u64 same_lock_start = 0; u64 same_lock_len = 0; - if (len == 0) - return 0; - - if (same_inode) - inode_lock(src); - else - btrfs_double_inode_lock(src, dst); - ret = extent_same_check_offsets(src, loff, &len, olen); if (ret) - goto out_unlock; + return ret; ret = extent_same_check_offsets(dst, dst_loff, &len, olen); if (ret) - goto out_unlock; + return ret; if (same_inode) { /* @@ -3139,32 +3112,21 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen, * allow an unaligned length so long as it ends at * i_size. */ - if (len != olen) { - ret = -EINVAL; - goto out_unlock; - } + if (len != olen) + return -EINVAL; /* Check for overlapping ranges */ - if (dst_loff + len > loff && dst_loff < loff + len) { - ret = -EINVAL; - goto out_unlock; - } + if (dst_loff + len > loff && dst_loff < loff + len) + return -EINVAL; same_lo
Re: [PATCH 0/3] Minor compression heuristic cleanups
2017-12-12 23:55 GMT+03:00 David Sterba : > The callback pointers for radix_sort are not needed, we don't plan to > export the function now. The compiler is smart enough to replace the > indirect calls with direct ones, so there's no change in the resulting > asm code. > > David Sterba (3): > btrfs: heuristic: open code get_num callback of radix sort > btrfs: heuristic: open code copy_call callback of radix sort > btrfs: heuristic: call get4bits directly > > fs/btrfs/compression.c | 42 +++--- > 1 file changed, 11 insertions(+), 31 deletions(-) > > -- > 2.15.1 > Thanks! On whole series: Reviewed-by: Timofey Titovets -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
2017-12-11 8:18 GMT+03:00 Dave : > On Tue, Oct 31, 2017 someone wrote: >> >> >> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted >> > nocow -- it will NOT be snapshotted > > I did exactly this. It servers the purpose of avoiding snapshots. > However, today I saw the following at > https://wiki.archlinux.org/index.php/Btrfs > > Note: From Btrfs Wiki Mount options: within a single file system, it > is not possible to mount some subvolumes with nodatacow and others > with datacow. The mount option of the first mounted subvolume applies > to any other subvolumes. > > That makes me think my nodatacow mount option on $HOME/.cache is not > effective. True? > > (My subjective performance results have not been as good as hoped for > with the tweaks I have tried so far.) > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html True, for magic dirs, that you may want mark as no cow, you need to use chattr, like: rm -rf ~/.cache mkdir ~/.cache chattr +C ~/.cache -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Just curious, whats happened with btrfs & rcu skiplist in 2013?
Subj, I found that https://lwn.net/Articles/554885/, https://lwn.net/Articles/553047/ and some others messages like judy rcu & etc. But nothing after june 2013, just curious, may be some one know why that has been stalled? or just what happened in the end? (i.e. ex, RT guys, just said that too bad for us..) Thanks! -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] Btrfs: heuristic replace heap sort with radix sort
Slowest part of heuristic for now is kernel heap sort() It's can take up to 55% of runtime on sorting bucket items. As sorting will always call on most data sets to get correctly byte_core_set_size, the only way to speed up heuristic, is to speed up sort on bucket. Add a general radix_sort function. Radix sort require 2 buffers, one full size of input array and one for store counters (jump addresses). That increase usage per heuristic workspace +1KiB 8KiB + 1KiB -> 8KiB + 2KiB That is LSD Radix, i use 4 bit as a base for calculating, to make counters array acceptable small (16 elements * 8 byte). That Radix sort implementation have several points to adjust, I added him to make radix sort general usable in kernel, like heap sort, if needed. Performance tested in userspace copy of heuristic code, throughput: - average <-> random data: ~3500 MiB/s - heap sort - average <-> random data: ~6000 MiB/s - radix sort Changes: v1 -> v2: - Tested on Big Endian - Drop most of multiply operations - Separately allocate sort buffer Changes: v2 -> v3: - Fix uint -> u conversion - Reduce stack size, by reduce vars sizes to u32, restrict input array size to u32 Assume that kernel will never try sorting arrays > 2^32 - Drop max_cell arg (precheck - correctly find max value by it self) Signed-off-by: Timofey Titovets --- fs/btrfs/compression.c | 135 ++--- 1 file changed, 128 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 06ef50712acd..9573f4491367 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -33,7 +33,6 @@ #include #include #include -#include #include #include "ctree.h" #include "disk-io.h" @@ -752,6 +751,8 @@ struct heuristic_ws { u32 sample_size; /* Buckets store counters for each byte value */ struct bucket_item *bucket; + /* Sorting buffer */ + struct bucket_item *bucket_b; struct list_head list; }; @@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws) kvfree(workspace->sample); kfree(workspace->bucket); + kfree(workspace->bucket_b); kfree(workspace); } @@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void) if (!ws->bucket) goto fail; + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL); + if (!ws->bucket_b) + goto fail; + INIT_LIST_HEAD(&ws->list); return &ws->list; fail: @@ -1278,13 +1284,127 @@ static u32 shannon_entropy(struct heuristic_ws *ws) return entropy_sum * 100 / entropy_max; } -/* Compare buckets by size, ascending */ -static int bucket_comp_rev(const void *lv, const void *rv) +#define RADIX_BASE 4 +#define COUNTERS_SIZE (1 << RADIX_BASE) + +static inline u8 get4bits(u64 num, u32 shift) { + u8 low4bits; + num = num >> shift; + /* Reverse order */ + low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE); + return low4bits; +} + +static inline void copy_cell(void *dst, u32 dest_i, void *src, u32 src_i) +{ + struct bucket_item *dstv = (struct bucket_item *) dst; + struct bucket_item *srcv = (struct bucket_item *) src; + dstv[dest_i] = srcv[src_i]; +} + +static inline u64 get_num(const void *a, u32 i) +{ + struct bucket_item *av = (struct bucket_item *) a; + return av[i].count; +} + +/* + * Use 4 bits as radix base + * Use 16 u32 counters for calculating new possition in buf array + * + * @array - array that will be sorted + * @array_buf - buffer array to store sorting results + * must be equal in size to @array + * @num - array size + * @get_num - function to extract number from array + * @copy_cell - function to copy data from array to array_buf + * and vise versa + * @get4bits - function to get 4 bits from number at specified offset + */ + +static void radix_sort(void *array, void *array_buf, u32 num, + u64 (*get_num)(const void *, u32 i), + void (*copy_cell)(void *dest, u32 dest_i, +void* src, u32 src_i), + u8 (*get4bits)(u64 num, u32 shift)) { - const struct bucket_item *l = (const struct bucket_item *)lv; - const struct bucket_item *r = (const struct bucket_item *)rv; + u64 max_num; + u64 buf_num; + u32 counters[COUNTERS_SIZE]; + u32 new_addr; + u32 addr; + u32 bitlen; + u32 shift; + int i; + + /* +* Try avoid useless loop iterations +* For small numbers stored in big counters +* example: 48 33 4 ... in 64bit array +*/ + max_num = get_num(array, 0); + for (i = 1; i < num; i++) { +
Re: [PATCH v2] Btrfs: heuristic replace heap sort with radix sort
2017-12-05 0:24 GMT+03:00 Timofey Titovets : > 2017-12-04 23:47 GMT+03:00 David Sterba : >> On Mon, Dec 04, 2017 at 12:30:33AM +0300, Timofey Titovets wrote: >>> Slowest part of heuristic for now is kernel heap sort() >>> It's can take up to 55% of runtime on sorting bucket items. >>> >>> As sorting will always call on most data sets to get correctly >>> byte_core_set_size, the only way to speed up heuristic, is to >>> speed up sort on bucket. >>> >>> Add a general radix_sort function. >>> Radix sort require 2 buffers, one full size of input array >>> and one for store counters (jump addresses). >>> >>> That increase usage per heuristic workspace +1KiB >>> 8KiB + 1KiB -> 8KiB + 2KiB >>> >>> That is LSD Radix, i use 4 bit as a base for calculating, >>> to make counters array acceptable small (16 elements * 8 byte). >>> >>> That Radix sort implementation have several points to adjust, >>> I added him to make radix sort general usable in kernel, >>> like heap sort, if needed. >>> >>> Performance tested in userspace copy of heuristic code, >>> throughput: >>> - average <-> random data: ~3500 MiB/s - heap sort >>> - average <-> random data: ~6000 MiB/s - radix sort >>> >>> Changes: >>> v1 -> v2: >>> - Tested on Big Endian >>> - Drop most of multiply operations >>> - Separately allocate sort buffer >>> >>> Signed-off-by: Timofey Titovets >>> --- >>> fs/btrfs/compression.c | 147 >>> ++--- >>> 1 file changed, 140 insertions(+), 7 deletions(-) >>> >>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c >>> index ae016699d13e..19b52982deda 100644 >>> --- a/fs/btrfs/compression.c >>> +++ b/fs/btrfs/compression.c >>> @@ -33,7 +33,6 @@ >>> #include >>> #include >>> #include >>> -#include >>> #include >>> #include "ctree.h" >>> #include "disk-io.h" >>> @@ -752,6 +751,8 @@ struct heuristic_ws { >>> u32 sample_size; >>> /* Buckets store counters for each byte value */ >>> struct bucket_item *bucket; >>> + /* Sorting buffer */ >>> + struct bucket_item *bucket_b; >>> struct list_head list; >>> }; >>> >>> @@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws) >>> >>> kvfree(workspace->sample); >>> kfree(workspace->bucket); >>> + kfree(workspace->bucket_b); >>> kfree(workspace); >>> } >>> >>> @@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void) >>> if (!ws->bucket) >>> goto fail; >>> >>> + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), >>> GFP_KERNEL); >>> + if (!ws->bucket_b) >>> + goto fail; >>> + >>> INIT_LIST_HEAD(&ws->list); >>> return &ws->list; >>> fail: >>> @@ -1278,13 +1284,136 @@ static u32 shannon_entropy(struct heuristic_ws *ws) >>> return entropy_sum * 100 / entropy_max; >>> } >>> >>> -/* Compare buckets by size, ascending */ >>> -static int bucket_comp_rev(const void *lv, const void *rv) >>> +#define RADIX_BASE 4 >>> +#define COUNTERS_SIZE (1 << RADIX_BASE) >>> + >>> +static inline uint8_t get4bits(uint64_t num, int shift) { >>> + uint8_t low4bits; >>> + num = num >> shift; >>> + /* Reverse order */ >>> + low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE); >>> + return low4bits; >>> +} >>> + >>> +static inline void copy_cell(void *dst, int dest_i, void *src, int src_i) >>> { >>> - const struct bucket_item *l = (const struct bucket_item *)lv; >>> - const struct bucket_item *r = (const struct bucket_item *)rv; >>> + struct bucket_item *dstv = (struct bucket_item *) dst; >>> + struct bucket_item *srcv = (struct bucket_item *) src; >>> + dstv[dest_i] = srcv[src_i]; >>> +} >>> >>> - return r->count - l->count; >>> +static inline uint64_t get_num(const void *a, int i) >>> +{ >>> + struct bucket_item *av = (struct bucket_ite
Re: [PATCH v2] Btrfs: heuristic replace heap sort with radix sort
2017-12-04 23:47 GMT+03:00 David Sterba : > On Mon, Dec 04, 2017 at 12:30:33AM +0300, Timofey Titovets wrote: >> Slowest part of heuristic for now is kernel heap sort() >> It's can take up to 55% of runtime on sorting bucket items. >> >> As sorting will always call on most data sets to get correctly >> byte_core_set_size, the only way to speed up heuristic, is to >> speed up sort on bucket. >> >> Add a general radix_sort function. >> Radix sort require 2 buffers, one full size of input array >> and one for store counters (jump addresses). >> >> That increase usage per heuristic workspace +1KiB >> 8KiB + 1KiB -> 8KiB + 2KiB >> >> That is LSD Radix, i use 4 bit as a base for calculating, >> to make counters array acceptable small (16 elements * 8 byte). >> >> That Radix sort implementation have several points to adjust, >> I added him to make radix sort general usable in kernel, >> like heap sort, if needed. >> >> Performance tested in userspace copy of heuristic code, >> throughput: >> - average <-> random data: ~3500 MiB/s - heap sort >> - average <-> random data: ~6000 MiB/s - radix sort >> >> Changes: >> v1 -> v2: >> - Tested on Big Endian >> - Drop most of multiply operations >> - Separately allocate sort buffer >> >> Signed-off-by: Timofey Titovets >> --- >> fs/btrfs/compression.c | 147 >> ++--- >> 1 file changed, 140 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c >> index ae016699d13e..19b52982deda 100644 >> --- a/fs/btrfs/compression.c >> +++ b/fs/btrfs/compression.c >> @@ -33,7 +33,6 @@ >> #include >> #include >> #include >> -#include >> #include >> #include "ctree.h" >> #include "disk-io.h" >> @@ -752,6 +751,8 @@ struct heuristic_ws { >> u32 sample_size; >> /* Buckets store counters for each byte value */ >> struct bucket_item *bucket; >> + /* Sorting buffer */ >> + struct bucket_item *bucket_b; >> struct list_head list; >> }; >> >> @@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws) >> >> kvfree(workspace->sample); >> kfree(workspace->bucket); >> + kfree(workspace->bucket_b); >> kfree(workspace); >> } >> >> @@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void) >> if (!ws->bucket) >> goto fail; >> >> + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL); >> + if (!ws->bucket_b) >> + goto fail; >> + >> INIT_LIST_HEAD(&ws->list); >> return &ws->list; >> fail: >> @@ -1278,13 +1284,136 @@ static u32 shannon_entropy(struct heuristic_ws *ws) >> return entropy_sum * 100 / entropy_max; >> } >> >> -/* Compare buckets by size, ascending */ >> -static int bucket_comp_rev(const void *lv, const void *rv) >> +#define RADIX_BASE 4 >> +#define COUNTERS_SIZE (1 << RADIX_BASE) >> + >> +static inline uint8_t get4bits(uint64_t num, int shift) { >> + uint8_t low4bits; >> + num = num >> shift; >> + /* Reverse order */ >> + low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE); >> + return low4bits; >> +} >> + >> +static inline void copy_cell(void *dst, int dest_i, void *src, int src_i) >> { >> - const struct bucket_item *l = (const struct bucket_item *)lv; >> - const struct bucket_item *r = (const struct bucket_item *)rv; >> + struct bucket_item *dstv = (struct bucket_item *) dst; >> + struct bucket_item *srcv = (struct bucket_item *) src; >> + dstv[dest_i] = srcv[src_i]; >> +} >> >> - return r->count - l->count; >> +static inline uint64_t get_num(const void *a, int i) >> +{ >> + struct bucket_item *av = (struct bucket_item *) a; >> + return av[i].count; >> +} >> + >> +/* >> + * Use 4 bits as radix base >> + * Use 16 uint64_t counters for calculating new possition in buf array >> + * >> + * @array - array that will be sorted >> + * @array_buf - buffer array to store sorting results >> + * must be equal in size to @array >> + * @num - array size >> + * @max_cell - Link to element with maximum possible value >> + * that
Re: [PATCH v4] Btrfs: compress_file_range() change page dirty status once
Gentle ping 2017-10-24 1:29 GMT+03:00 Timofey Titovets : > We need to call extent_range_clear_dirty_for_io() > on compression range to prevent application from changing > page content, while pages compressing. > > extent_range_clear_dirty_for_io() run on each loop iteration, > "(end - start)" can be much (up to 1024 times) bigger > then compression range (BTRFS_MAX_UNCOMPRESSED). > > That produce extra calls to page managment code. > > Fix that behaviour by call extent_range_clear_dirty_for_io() > only once. > > v1 -> v2: > - Make that more obviously and more safeprone > > v2 -> v3: > - Rebased on: >Btrfs: compress_file_range() remove dead variable num_bytes > - Update change log > - Add comments > > v3 -> v4: > - Rebased on: kdave for-next > - To avoid dirty bit clear/set behaviour change >call clear_bit once, istead of per compression range > > Signed-off-by: Timofey Titovets > --- > fs/btrfs/inode.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index b93fe05a39c7..5816dd3cb6e6 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -536,8 +536,10 @@ static noinline void compress_file_range(struct inode > *inode, > * If the compression fails for any reason, we set the pages > * dirty again later on. > */ > - extent_range_clear_dirty_for_io(inode, start, end); > - redirty = 1; > + if (!redirty) { > + extent_range_clear_dirty_for_io(inode, start, end); > + redirty = 1; > + } > > /* Compression level is applied here and only here */ > ret = btrfs_compress_pages( > -- > 2.14.2 -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: heuristic replace heap sort with radix sort
Slowest part of heuristic for now is kernel heap sort() It's can take up to 55% of runtime on sorting bucket items. As sorting will always call on most data sets to get correctly byte_core_set_size, the only way to speed up heuristic, is to speed up sort on bucket. Add a general radix_sort function. Radix sort require 2 buffers, one full size of input array and one for store counters (jump addresses). That increase usage per heuristic workspace +1KiB 8KiB + 1KiB -> 8KiB + 2KiB That is LSD Radix, i use 4 bit as a base for calculating, to make counters array acceptable small (16 elements * 8 byte). That Radix sort implementation have several points to adjust, I added him to make radix sort general usable in kernel, like heap sort, if needed. Performance tested in userspace copy of heuristic code, throughput: - average <-> random data: ~3500 MiB/s - heap sort - average <-> random data: ~6000 MiB/s - radix sort Changes: v1 -> v2: - Tested on Big Endian - Drop most of multiply operations - Separately allocate sort buffer Signed-off-by: Timofey Titovets --- fs/btrfs/compression.c | 147 ++--- 1 file changed, 140 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ae016699d13e..19b52982deda 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -33,7 +33,6 @@ #include #include #include -#include #include #include "ctree.h" #include "disk-io.h" @@ -752,6 +751,8 @@ struct heuristic_ws { u32 sample_size; /* Buckets store counters for each byte value */ struct bucket_item *bucket; + /* Sorting buffer */ + struct bucket_item *bucket_b; struct list_head list; }; @@ -763,6 +764,7 @@ static void free_heuristic_ws(struct list_head *ws) kvfree(workspace->sample); kfree(workspace->bucket); + kfree(workspace->bucket_b); kfree(workspace); } @@ -782,6 +784,10 @@ static struct list_head *alloc_heuristic_ws(void) if (!ws->bucket) goto fail; + ws->bucket_b = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket_b), GFP_KERNEL); + if (!ws->bucket_b) + goto fail; + INIT_LIST_HEAD(&ws->list); return &ws->list; fail: @@ -1278,13 +1284,136 @@ static u32 shannon_entropy(struct heuristic_ws *ws) return entropy_sum * 100 / entropy_max; } -/* Compare buckets by size, ascending */ -static int bucket_comp_rev(const void *lv, const void *rv) +#define RADIX_BASE 4 +#define COUNTERS_SIZE (1 << RADIX_BASE) + +static inline uint8_t get4bits(uint64_t num, int shift) { + uint8_t low4bits; + num = num >> shift; + /* Reverse order */ + low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE); + return low4bits; +} + +static inline void copy_cell(void *dst, int dest_i, void *src, int src_i) { - const struct bucket_item *l = (const struct bucket_item *)lv; - const struct bucket_item *r = (const struct bucket_item *)rv; + struct bucket_item *dstv = (struct bucket_item *) dst; + struct bucket_item *srcv = (struct bucket_item *) src; + dstv[dest_i] = srcv[src_i]; +} - return r->count - l->count; +static inline uint64_t get_num(const void *a, int i) +{ + struct bucket_item *av = (struct bucket_item *) a; + return av[i].count; +} + +/* + * Use 4 bits as radix base + * Use 16 uint64_t counters for calculating new possition in buf array + * + * @array - array that will be sorted + * @array_buf - buffer array to store sorting results + * must be equal in size to @array + * @num - array size + * @max_cell - Link to element with maximum possible value + * that can be used to cap radix sort iterations + * if we know maximum value before call sort + * @get_num - function to extract number from array + * @copy_cell - function to copy data from array to array_buf + * and vise versa + * @get4bits - function to get 4 bits from number at specified offset + */ + +static void radix_sort(void *array, void *array_buf, + int num, + const void *max_cell, + uint64_t (*get_num)(const void *, int i), + void (*copy_cell)(void *dest, int dest_i, +void* src, int src_i), + uint8_t (*get4bits)(uint64_t num, int shift)) +{ + u64 max_num; + uint64_t buf_num; + uint64_t counters[COUNTERS_SIZE]; + uint64_t new_addr; + int i; + int addr; + int bitlen; + int shift; + + /* +* Try avoid useless loop iterations +* For small numbers stored in big counters +* example: 48 33 4 ... in 64bit array +*/ + if (
Re: How about adding an ioctl to convert a directory to a subvolume?
2017-11-28 21:48 GMT+03:00 David Sterba : > On Mon, Nov 27, 2017 at 05:41:56PM +0800, Lu Fengqi wrote: >> As we all know, under certain circumstances, it is more appropriate to >> create some subvolumes rather than keep everything in the same >> subvolume. As the condition of demand change, the user may need to >> convert a previous directory to a subvolume. For this reason,how about >> adding an ioctl to convert a directory to a subvolume? > > I'd say too difficult to get everything right in kernel. This is > possible to be done in userspace, with existing tools. > > The problem is that the conversion cannot be done atomically in most > cases, so even if it's just one ioctl call, there are several possible > intermediate states that would exist during the call. Reporting where > did the ioctl fail would need some extended error code semantics. > >> Users can convert by the scripts mentioned in this >> thread(https://www.spinics.net/lists/linux-btrfs/msg33252.html), but is >> it easier to use the off-the-shelf btrfs subcommand? > > Adding a subcommand would work, though I'd rather avoid reimplementing > 'cp -ax' or 'rsync -ax'. We want to copy the files preserving all > attributes, with reflink, and be able to identify partially synced > files, and not cross the mountpoints or subvolumes. > > The middle step with snapshotting the containing subvolume before > syncing the data is also a valid option, but not always necessary. > >> After an initial consideration, our implementation is broadly divided >> into the following steps: >> 1. Freeze the filesystem or set the subvolume above the source directory >> to read-only; > > Freezing the filesystme will freeze all IO, so this would not work, but > I understand what you mean. The file data are synced before the snapshot > is taken, but nothing prevents applications to continue writing data. > > Open and live files is a problem and don't see a nice solution here. > >> 2. Perform a pre-check, for example, check if a cross-device link >> creation during the conversion; > > Cross-device links are not a problem as long as we use 'cp' ie. the > manual creation of files in the target. > >> 3. Perform conversion, such as creating a new subvolume and moving the >> contents of the source directory; >> 4. Thaw the filesystem or restore the subvolume writable property. >> >> In fact, I am not so sure whether this use of freeze is appropriate >> because the source directory the user needs to convert may be located >> at / or /home and this pre-check and conversion process may take a long >> time, which can lead to some shell and graphical application suspended. > > I think the closest operation is a read-only remount, which is not > always possible due to open files and can otherwise considered as quite > intrusive operation to the whole system. And the root filesystem cannot > be easily remounted read-only in the systemd days anyway. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html My two 2c, Then we talking about 'fast' (i.e. i like the idea where ioctl calls to be fast) conversion of dir to subvolume, can be done like that (sorry if i miss understood something and that a rave or i'm crazy..): For make idea more clear, for userspace that can looks like: 1. Create snapshot of parent subvol for that dir 2. Cleanup all data, except content of dir in snapshot 3. Move content of that dir to snapshot root 4. Replace dir with that snapshot/subvol i.e. no copy, no cp, only rename() and garbage collecting. In kernel that in "theory" will looks like: 1. Copy of subvol root inode 2. Replace root inode with target dir inode 3. Replace target dir in old subvol with new subvol 4. GC old dir content from parent subvol, GC all useless content of around dir in new subvol That's may be a fastest way for user, but that will not solve problems with opened files & etc, but that must be fast from user point of view, and all other staff can be simply cleaned in background Thanks -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: zstd compression
2017-11-16 19:32 GMT+03:00 Austin S. Hemmelgarn : > On 2017-11-16 08:43, Duncan wrote: >> >> Austin S. Hemmelgarn posted on Thu, 16 Nov 2017 07:30:47 -0500 as >> excerpted: >> >>> On 2017-11-15 16:31, Duncan wrote: Austin S. Hemmelgarn posted on Wed, 15 Nov 2017 07:57:06 -0500 as excerpted: > The 'compress' and 'compress-force' mount options only impact newly > written data. The compression used is stored with the metadata for > the extents themselves, so any existing data on the volume will be > read just fine with whatever compression method it was written with, > while new data will be written with the specified compression method. > > If you want to convert existing files, you can use the '-c' option to > the defrag command to do so. ... Being aware of course that using defrag to recompress files like that will break 100% of the existing reflinks, effectively (near) doubling data usage if the files are snapshotted, since the snapshot will now share 0% of its extents with the newly compressed files. >>> >>> Good point, I forgot to mention that. (The actual effect shouldn't be quite that bad, as some files are likely to be uncompressed due to not compressing well, and I'm not sure if defrag -c rewrites them or not. Further, if there's multiple snapshots data usage should only double with respect to the latest one, the data delta between it and previous snapshots won't be doubled as well.) >>> >>> I'm pretty sure defrag is equivalent to 'compress-force', not >>> 'compress', but I may be wrong. >> >> >> But... compress-force doesn't actually force compression _all_ the time. >> Rather, it forces btrfs to continue checking whether compression is worth >> it for each "block"[1] of the file, instead of giving up if the first >> quick try at the beginning says that block won't compress. >> >> So what I'm saying is that if the snapshotted data is already compressed, >> think (pre-)compressed tarballs or image files such as jpeg that are >> unlikely to /easily/ compress further and might well actually be _bigger_ >> once the compression algorithm is run over them, defrag -c will likely >> fail to compress them further even if it's the equivalent of compress- >> force, and thus /should/ leave them as-is, not breaking the reflinks of >> the snapshots and thus not doubling the data usage for that file, or more >> exactly, that extent of that file. >> >> Tho come to think of it, is defrag -c that smart, to actually leave the >> data as-is if it doesn't compress further, or does it still rewrite it >> even if it doesn't compress, thus breaking the reflink and doubling the >> usage regardless? > > I'm not certain how compression factors in, but if you aren't compressing > the file, it will only get rewritten if it's fragmented (which is shy > defragmenting the system root directory is usually insanely fast on most > systems, stuff there is almost never fragmented). >> >> >> --- >> [1] Block: I'm not positive it's the usual 4K block in this case. I >> think I read that it's 16K, but I might be confused on that. But >> regardless the size, the point is, with compress-force btrfs won't give >> up like simple compress will if the first "block" doesn't compress, it'll >> keep trying. >> >> Of course the new compression heuristic changes this a bit too, but the >> same general idea holds, compress-force continues to try for the entire >> file, compress will give up much faster. > > I'm not actually sure, I would think it checks 128k blocks of data (the > effective block size for compression), but if it doesn't it should be > checking at the filesystem block size (which means 16k on most recently > created filesystems). > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Defragment of data on btrfs, is simply rewrite data if, data doesn't meet some criteria. And only that -c does, it's say which compression method apply for new written data, no more, no less. On write side, FS see long/short data ranges for writing (see compress_file_range()), if compression needed, split data to 128KiB and pass it to compression logic. compression logic give up it self in 2 cases: 1. Compression of 2 (or 3?) first page sized blocks of 128KiB make data bigger -> give up -> write data as is 2. After compression done, if compression not free at least one sector size -> write data as is i.e. If you write 16 KiB at time, btrfs will compress each separate write as 16 KiB. If you write 1 MiB at time, btrfs will split it by 128 KiB. If you write 1025KiB, btrfs will split it by 128 KiB and last 1 KiB will be written as is. JFYI: Only that heuristic logic doing (i.e. compress, not compress-force) is: On every write, kernel check if compression are needed by inode_need_comp
Re: [PATCH 4/4] Btrfs: btrfs_dedupe_file_range() ioctl, remove 16MiB restriction
Sorry, i just thinking that i can test that and send you some feedback, But for now, no time. I will check that later and try adds memory reusing. So, just ignore patches for now. Thanks 2017-10-10 20:36 GMT+03:00 David Sterba : > On Tue, Oct 03, 2017 at 06:06:04PM +0300, Timofey Titovets wrote: >> At now btrfs_dedupe_file_range() restricted to 16MiB range for >> limit locking time and memory requirement for dedup ioctl() >> >> For too big input rage code silently set range to 16MiB >> >> Let's remove that restriction by do iterating over dedup range. >> That's backward compatible and will not change anything for request >> less then 16MiB. > > This would make the ioctl more pleasant to use. So far I haven't found > any problems to do the iteration. One possible speedup could be done to > avoid the repeated allocations in btrfs_extent_same if we're going to > iterate more than once. > > As this would mean the 16MiB length restriction is gone, this needs to > bubble up to the documentation > (http://man7.org/linux/man-pages/man2/ioctl_fideduperange.2.html) > > Have you tested the behaviour with larger ranges? -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1: can't readd removed dev while the fs is mounted
2017-10-28 1:40 GMT+03:00 Julien Muchembled : > Hello, > > I have 2 disks in RAID1, each one having 2 partitions: > - 1 for / (BtrFS) > - 1 for /home (MD/XFS) > > For some reasons, 1 disk was removed and readded. I had no issue at readding > it to the MD array, but for BtrFS, I had to reboot. > > Then, I tried to investigate more using qemu with systemrescuecd (kernel > 4.9.30 and btrfs-progs v4.9.1). From: > > /sys/devices/pci:00/:00:01.1/ata1/host0 > > I use: > > # echo 1 > target0:0:1/0:0:1:0/delete > > to remove sdb and after some changes in the mount point: > > # echo '- - -' > scsi_host/host0/scan > > to readd it. > > Then I executed > > # btrfs scrub start -B -d /mnt/tmp > > to fix things but I only get uncorrectable errors and the dmesg is full of > 'i/o error' lines > > Maybe some command is required so that BtrFS accept to reuse the device. I > tried: > > # btrfs replace start -B -f 2 /dev/sdb /mnt/tmp > ERROR: ioctl(DEV_REPLACE_STATUS) failed on "/mnt/tmp": Inappropriate ioctl > for device > > Julien > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html AFAIK, no, for now btrfs just can't reuse device - no way At least for mounted FS. AFAIK, btrfs have patches for dynamic device states, but patches not merged -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] Btrfs: compress_file_range() change page dirty status once
We need to call extent_range_clear_dirty_for_io() on compression range to prevent application from changing page content, while pages compressing. extent_range_clear_dirty_for_io() run on each loop iteration, "(end - start)" can be much (up to 1024 times) bigger then compression range (BTRFS_MAX_UNCOMPRESSED). That produce extra calls to page managment code. Fix that behaviour by call extent_range_clear_dirty_for_io() only once. v1 -> v2: - Make that more obviously and more safeprone v2 -> v3: - Rebased on: Btrfs: compress_file_range() remove dead variable num_bytes - Update change log - Add comments v3 -> v4: - Rebased on: kdave for-next - To avoid dirty bit clear/set behaviour change call clear_bit once, istead of per compression range Signed-off-by: Timofey Titovets --- fs/btrfs/inode.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b93fe05a39c7..5816dd3cb6e6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -536,8 +536,10 @@ static noinline void compress_file_range(struct inode *inode, * If the compression fails for any reason, we set the pages * dirty again later on. */ - extent_range_clear_dirty_for_io(inode, start, end); - redirty = 1; + if (!redirty) { + extent_range_clear_dirty_for_io(inode, start, end); + redirty = 1; + } /* Compression level is applied here and only here */ ret = btrfs_compress_pages( -- 2.14.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 0/6] Btrfs: populate heuristic with code
2017-10-22 16:44 GMT+03:00 Timofey Titovets : > 2017-10-20 16:45 GMT+03:00 David Sterba : >> On Fri, Oct 20, 2017 at 01:48:01AM +0300, Timofey Titovets wrote: >>> 2017-10-19 18:39 GMT+03:00 David Sterba : >>> > On Fri, Sep 29, 2017 at 06:22:00PM +0200, David Sterba wrote: >>> >> On Thu, Sep 28, 2017 at 05:33:35PM +0300, Timofey Titovets wrote: >>> >> > Compile tested, hand tested on live system >>> >> > >>> >> > Change v7 -> v8 >>> >> > - All code moved to compression.c (again) >>> >> > - Heuristic workspaces inmplemented another way >>> >> > i.e. only share logic with compression workspaces >>> >> > - Some style fixes suggested by Devid >>> >> > - Move sampling function from heuristic code >>> >> > (I'm afraid of big functions) >>> >> > - Much more comments and explanations >>> >> >>> >> Thanks for the update, I went through the patches and they looked good >>> >> enough to be put into for-next. I may have more comments about a few >>> >> things, but nothing serious that would hinder testing. >>> > >>> > I did a final pass through the patches and edited comments wehre I was >>> > not able to undrerstand them. Please check the updated patches in [1] if >>> > I did not accidentally change the meaning. >>> >>> I don't see a link [1] in mail, may be you missed it? >> >> Yeah, sorry: >> https://github.com/kdave/btrfs-devel/commits/ext/timofey/heuristic > > I did re-read updated comments, looks ok to me > (i only found one typo, leave a comment). > > > Thanks > -- > Have a nice day, > Timofey. Can you please try that patch? (in attach) I think some time about performance hit of heuristic and how to avoid using sorting, That patch will try prefind min/max values (before sorting) in array, and (max - min), used to filter edge data cases where byte core size < 64 or bigger > 200 It's a bit hacky workaround =\, That show a ~same speedup on my data set as show using of radix sort. (i.e. x2 speed up) Thanks. -- Have a nice day, Timofey. From fb2a329828e64ad0e224a8cb97dbc17147149629 Mon Sep 17 00:00:00 2001 From: Timofey Titovets Date: Mon, 23 Oct 2017 21:24:29 +0300 Subject: [PATCH] Btrfs: heuristic try avoid bucket sorting on edge data cases Heap sort used in kernel are too slow and costly, So let's make some statistic assume about egde input data cases Based on observation of difference between min/max values in bucket. Signed-off-by: Timofey Titovets --- fs/btrfs/compression.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 0ca16909894e..56b67ec4fb5b 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -1310,8 +1310,46 @@ static int byte_core_set_size(struct heuristic_ws *ws) u32 i; u32 coreset_sum = 0; const u32 core_set_threshold = ws->sample_size * 90 / 100; + struct bucket_item *max, *min; + struct bucket_item tmp; struct bucket_item *bucket = ws->bucket; + + /* Presort for find min/max value */ + max = &bucket[0]; + min = &bucket[BUCKET_SIZE - 1]; + for (i = 1; i < BUCKET_SIZE - 1; i++) { + if (bucket[i].count > max->count) { + tmp = *max; + *max = bucket[i]; + bucket[i] = tmp; + } + if (bucket[i].count < min->count) { + tmp = *min; + *min = bucket[i]; + bucket[i] = tmp; + } + } + + /* + * Hacks for avoid sorting on Edge data cases (sorting too constly) + * i.e. that will fast filter easy compressible + * and bad compressible data + * Based on observation of number distribution on different data sets + * + * Assume 1: For bad compressible data distribution between min/max + * will be less then 0.6% of sample size + * + * Assume 2: For good compressible data distribution between min/max + * will be far bigger then 4% of sample size + */ + + if (max->count - min->count < ws->sample_size * 6 / 1000) + return BYTE_CORE_SET_HIGH + 1; + + if (max->count - min->count > ws->sample_size * 4 / 100) + return BYTE_CORE_SET_LOW - 1; + /* Sort in reverse order */ sort(bucket, BUCKET_SIZE, sizeof(*bucket), &bucket_comp_rev, NULL); -- 2.14.2
Re: [PATCH v8 0/6] Btrfs: populate heuristic with code
2017-10-20 16:45 GMT+03:00 David Sterba : > On Fri, Oct 20, 2017 at 01:48:01AM +0300, Timofey Titovets wrote: >> 2017-10-19 18:39 GMT+03:00 David Sterba : >> > On Fri, Sep 29, 2017 at 06:22:00PM +0200, David Sterba wrote: >> >> On Thu, Sep 28, 2017 at 05:33:35PM +0300, Timofey Titovets wrote: >> >> > Compile tested, hand tested on live system >> >> > >> >> > Change v7 -> v8 >> >> > - All code moved to compression.c (again) >> >> > - Heuristic workspaces inmplemented another way >> >> > i.e. only share logic with compression workspaces >> >> > - Some style fixes suggested by Devid >> >> > - Move sampling function from heuristic code >> >> > (I'm afraid of big functions) >> >> > - Much more comments and explanations >> >> >> >> Thanks for the update, I went through the patches and they looked good >> >> enough to be put into for-next. I may have more comments about a few >> >> things, but nothing serious that would hinder testing. >> > >> > I did a final pass through the patches and edited comments wehre I was >> > not able to undrerstand them. Please check the updated patches in [1] if >> > I did not accidentally change the meaning. >> >> I don't see a link [1] in mail, may be you missed it? > > Yeah, sorry: > https://github.com/kdave/btrfs-devel/commits/ext/timofey/heuristic I did re-read updated comments, looks ok to me (i only found one typo, leave a comment). Thanks -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html