> On Nov 13, 2018, at 5:29 AM, Timofey Titovets <nefelim...@gmail.com> wrote: > > вт, 13 нояб. 2018 г. в 04:52, Nick Terrell <terre...@fb.com>: >> >> >> >>> On Nov 12, 2018, at 4:33 PM, David Sterba <dste...@suse.cz> wrote: >>> >>> On Wed, Oct 31, 2018 at 11:11:08AM -0700, Nick Terrell wrote: >>>> From: Jennifer Liu <jenniferliu...@fb.com> >>>> >>>> Adds zstd compression level support to btrfs. Zstd requires >>>> different amounts of memory for each level, so the design had >>>> to be modified to allow set_level() to allocate memory. We >>>> preallocate one workspace of the maximum size to guarantee >>>> forward progress. This feature is expected to be useful for >>>> read-mostly filesystems, or when creating images. >>>> >>>> Benchmarks run in qemu on Intel x86 with a single core. >>>> The benchmark measures the time to copy the Silesia corpus [0] to >>>> a btrfs filesystem 10 times, then read it back. >>>> >>>> The two important things to note are: >>>> - The decompression speed and memory remains constant. >>>> The memory required to decompress is the same as level 1. >>>> - The compression speed and ratio will vary based on the source. >>>> >>>> Level Ratio Compression Decompression Compression Memory >>>> 1 2.59 153 MB/s 112 MB/s 0.8 MB >>>> 2 2.67 136 MB/s 113 MB/s 1.0 MB >>>> 3 2.72 106 MB/s 115 MB/s 1.3 MB >>>> 4 2.78 86 MB/s 109 MB/s 0.9 MB >>>> 5 2.83 69 MB/s 109 MB/s 1.4 MB >>>> 6 2.89 53 MB/s 110 MB/s 1.5 MB >>>> 7 2.91 40 MB/s 112 MB/s 1.4 MB >>>> 8 2.92 34 MB/s 110 MB/s 1.8 MB >>>> 9 2.93 27 MB/s 109 MB/s 1.8 MB >>>> 10 2.94 22 MB/s 109 MB/s 1.8 MB >>>> 11 2.95 17 MB/s 114 MB/s 1.8 MB >>>> 12 2.95 13 MB/s 113 MB/s 1.8 MB >>>> 13 2.95 10 MB/s 111 MB/s 2.3 MB >>>> 14 2.99 7 MB/s 110 MB/s 2.6 MB >>>> 15 3.03 6 MB/s 110 MB/s 2.6 MB >>>> >>>> [0] >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__sun.aei.polsl.pl_-7Esdeor_index.php-3Fpage-3Dsilesia&d=DwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=HQM5IQdWOB8WaMoii2dYTw&m=5LQRTUqZnx_a8dGSa5bGsd0Fm4ejQQOcH50wi7nRewY&s=gFUm-SA3aeQI7PBe3zmxUuxk4AEEZegB0cRsbjWUToo&e= >>>> >>>> Signed-off-by: Jennifer Liu <jenniferliu...@fb.com> >>>> Signed-off-by: Nick Terrell <terre...@fb.com> >>>> Reviewed-by: Omar Sandoval <osan...@fb.com> >>>> --- >>>> v1 -> v2: >>>> - Don't reflow the unchanged line. >>>> >>>> fs/btrfs/compression.c | 169 +++++++++++++++++++++++++---------------- >>>> fs/btrfs/compression.h | 18 +++-- >>>> fs/btrfs/lzo.c | 5 +- >>>> fs/btrfs/super.c | 7 +- >>>> fs/btrfs/zlib.c | 33 ++++---- >>>> fs/btrfs/zstd.c | 74 +++++++++++++----- >>>> 6 files changed, 202 insertions(+), 104 deletions(-) >>>> >>>> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c >>>> index 2955a4ea2fa8..b46652cb653e 100644 >>>> --- a/fs/btrfs/compression.c >>>> +++ b/fs/btrfs/compression.c >>>> @@ -822,9 +822,12 @@ void __init btrfs_init_compress(void) >>>> >>>> /* >>>> * Preallocate one workspace for each compression type so >>>> - * we can guarantee forward progress in the worst case >>>> + * we can guarantee forward progress in the worst case. >>>> + * Provide the maximum compression level to guarantee large >>>> + * enough workspace. >>>> */ >>>> - workspace = btrfs_compress_op[i]->alloc_workspace(); >>>> + workspace = btrfs_compress_op[i]->alloc_workspace( >>>> + btrfs_compress_op[i]->max_level); >> >> We provide the max level here, so we have at least one workspace per >> compression type that is large enough. >> >>>> if (IS_ERR(workspace)) { >>>> pr_warn("BTRFS: cannot preallocate compression >>>> workspace, will try later\n"); >>>> } else { >>>> @@ -835,23 +838,78 @@ void __init btrfs_init_compress(void) >>>> } >>>> } >>>> >>>> +/* >>>> + * put a workspace struct back on the list or free it if we have enough >>>> + * idle ones sitting around >>>> + */ >>>> +static void __free_workspace(int type, struct list_head *workspace, >>>> + bool heuristic) >>>> +{ >>>> + int idx = type - 1; >>>> + struct list_head *idle_ws; >>>> + spinlock_t *ws_lock; >>>> + atomic_t *total_ws; >>>> + wait_queue_head_t *ws_wait; >>>> + int *free_ws; >>>> + >>>> + if (heuristic) { >>>> + idle_ws = &btrfs_heuristic_ws.idle_ws; >>>> + ws_lock = &btrfs_heuristic_ws.ws_lock; >>>> + total_ws = &btrfs_heuristic_ws.total_ws; >>>> + ws_wait = &btrfs_heuristic_ws.ws_wait; >>>> + free_ws = &btrfs_heuristic_ws.free_ws; >>>> + } else { >>>> + idle_ws = &btrfs_comp_ws[idx].idle_ws; >>>> + ws_lock = &btrfs_comp_ws[idx].ws_lock; >>>> + total_ws = &btrfs_comp_ws[idx].total_ws; >>>> + ws_wait = &btrfs_comp_ws[idx].ws_wait; >>>> + free_ws = &btrfs_comp_ws[idx].free_ws; >>>> + } >>>> + >>>> + spin_lock(ws_lock); >>>> + if (*free_ws <= num_online_cpus()) { >>>> + list_add(workspace, idle_ws); >>>> + (*free_ws)++; >>>> + spin_unlock(ws_lock); >>>> + goto wake; >>>> + } >>>> + spin_unlock(ws_lock); >>>> + >>>> + if (heuristic) >>>> + free_heuristic_ws(workspace); >>>> + else >>>> + btrfs_compress_op[idx]->free_workspace(workspace); >>>> + atomic_dec(total_ws); >>>> +wake: >>>> + cond_wake_up(ws_wait); >>>> +} >>>> + >>>> +static void free_workspace(int type, struct list_head *ws) >>>> +{ >>>> + return __free_workspace(type, ws, false); >>>> +} >>>> + >>>> /* >>>> * This finds an available workspace or allocates a new one. >>>> * If it's not possible to allocate a new one, waits until there's one. >>>> * Preallocation makes a forward progress guarantees and we do not return >>>> * errors. >>>> */ >>>> -static struct list_head *__find_workspace(int type, bool heuristic) >>>> +static struct list_head *__find_workspace(unsigned int type_level, >>>> + bool heuristic) >>>> { >>>> struct list_head *workspace; >>>> int cpus = num_online_cpus(); >>>> + int type = type_level & 0xF; >>>> int idx = type - 1; >>>> - unsigned nofs_flag; >>>> + unsigned int level = (type_level & 0xF0) >> 4; >>>> + unsigned int nofs_flag; >>>> struct list_head *idle_ws; >>>> spinlock_t *ws_lock; >>>> atomic_t *total_ws; >>>> wait_queue_head_t *ws_wait; >>>> int *free_ws; >>>> + int ret; >>>> >>>> if (heuristic) { >>>> idle_ws = &btrfs_heuristic_ws.idle_ws; >>>> @@ -874,8 +932,17 @@ static struct list_head *__find_workspace(int type, >>>> bool heuristic) >>>> list_del(workspace); >>>> (*free_ws)--; >>>> spin_unlock(ws_lock); >>>> + if (!heuristic) { >>>> + nofs_flag = memalloc_nofs_save(); >>>> + ret = btrfs_compress_op[idx]->set_level(workspace, >>>> + level); >>>> + memalloc_nofs_restore(nofs_flag); >>>> + if (!ret) { >>>> + free_workspace(type, workspace); >>>> + goto again; >>>> + } >>>> + } >>>> return workspace; >>>> - >>>> } >>>> if (atomic_read(total_ws) > cpus) { >>>> DEFINE_WAIT(wait); >>>> @@ -899,7 +966,8 @@ static struct list_head *__find_workspace(int type, >>>> bool heuristic) >>>> if (heuristic) >>>> workspace = alloc_heuristic_ws(); >>>> else >>>> - workspace = btrfs_compress_op[idx]->alloc_workspace(); >>>> + workspace = btrfs_compress_op[idx]->alloc_workspace(level); >>>> + >>>> memalloc_nofs_restore(nofs_flag); >>>> >>>> if (IS_ERR(workspace)) { >>>> @@ -930,60 +998,22 @@ static struct list_head *__find_workspace(int type, >>>> bool heuristic) >>>> return workspace; >>>> } >>>> >>>> -static struct list_head *find_workspace(int type) >>>> +static struct list_head *find_workspace(unsigned int type_level) >>>> { >>>> - return __find_workspace(type, false); >>>> + return __find_workspace(type_level, false); >>>> } >>>> >>>> -/* >>>> - * put a workspace struct back on the list or free it if we have enough >>>> - * idle ones sitting around >>>> - */ >>>> -static void __free_workspace(int type, struct list_head *workspace, >>>> - bool heuristic) >>>> +static struct list_head *find_decompression_workspace(unsigned int type) >>>> { >>>> - int idx = type - 1; >>>> - struct list_head *idle_ws; >>>> - spinlock_t *ws_lock; >>>> - atomic_t *total_ws; >>>> - wait_queue_head_t *ws_wait; >>>> - int *free_ws; >>>> + /* >>>> + * Use the lowest level for decompression, since we don't need to >>>> + * compress. This can help us save memory when using levels lower than >>>> + * the default level. >>>> + */ >>>> + const unsigned int level = 1; >>>> + const unsigned int type_level = (level << 4) | (type & 0xF); >>>> >>>> - if (heuristic) { >>>> - idle_ws = &btrfs_heuristic_ws.idle_ws; >>>> - ws_lock = &btrfs_heuristic_ws.ws_lock; >>>> - total_ws = &btrfs_heuristic_ws.total_ws; >>>> - ws_wait = &btrfs_heuristic_ws.ws_wait; >>>> - free_ws = &btrfs_heuristic_ws.free_ws; >>>> - } else { >>>> - idle_ws = &btrfs_comp_ws[idx].idle_ws; >>>> - ws_lock = &btrfs_comp_ws[idx].ws_lock; >>>> - total_ws = &btrfs_comp_ws[idx].total_ws; >>>> - ws_wait = &btrfs_comp_ws[idx].ws_wait; >>>> - free_ws = &btrfs_comp_ws[idx].free_ws; >>>> - } >>>> - >>>> - spin_lock(ws_lock); >>>> - if (*free_ws <= num_online_cpus()) { >>>> - list_add(workspace, idle_ws); >>>> - (*free_ws)++; >>>> - spin_unlock(ws_lock); >>>> - goto wake; >>>> - } >>>> - spin_unlock(ws_lock); >>>> - >>>> - if (heuristic) >>>> - free_heuristic_ws(workspace); >>>> - else >>>> - btrfs_compress_op[idx]->free_workspace(workspace); >>>> - atomic_dec(total_ws); >>>> -wake: >>>> - cond_wake_up(ws_wait); >>>> -} >>>> - >>>> -static void free_workspace(int type, struct list_head *ws) >>>> -{ >>>> - return __free_workspace(type, ws, false); >>>> + return find_workspace(type_level); >>>> } >>>> >>>> /* >>>> @@ -1044,9 +1074,7 @@ int btrfs_compress_pages(unsigned int type_level, >>>> struct address_space *mapping, >>>> int ret; >>>> int type = type_level & 0xF; >>>> >>>> - workspace = find_workspace(type); >>>> - >>>> - btrfs_compress_op[type - 1]->set_level(workspace, type_level); >>>> + workspace = find_workspace(type_level); >>>> ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping, >>>> start, pages, >>>> out_pages, >>>> @@ -1075,7 +1103,7 @@ static int btrfs_decompress_bio(struct >>>> compressed_bio *cb) >>>> int ret; >>>> int type = cb->compress_type; >>>> >>>> - workspace = find_workspace(type); >>>> + workspace = find_decompression_workspace(type); >>>> ret = btrfs_compress_op[type - 1]->decompress_bio(workspace, cb); >>>> free_workspace(type, workspace); >>>> >>>> @@ -1093,7 +1121,7 @@ int btrfs_decompress(int type, unsigned char >>>> *data_in, struct page *dest_page, >>>> struct list_head *workspace; >>>> int ret; >>>> >>>> - workspace = find_workspace(type); >>>> + workspace = find_decompression_workspace(type); >>>> >>>> ret = btrfs_compress_op[type-1]->decompress(workspace, data_in, >>>> dest_page, start_byte, >>>> @@ -1591,12 +1619,23 @@ int btrfs_compress_heuristic(struct inode *inode, >>>> u64 start, u64 end) >>>> >>>> unsigned int btrfs_compress_str2level(const char *str) >>>> { >>>> - if (strncmp(str, "zlib", 4) != 0) >>>> + int ret; >>>> + int type; >>>> + unsigned int level; >>>> + >>>> + if (strncmp(str, "zlib", 4) == 0) >>>> + type = BTRFS_COMPRESS_ZLIB; >>>> + else if (strncmp(str, "zstd", 4) == 0) >>>> + type = BTRFS_COMPRESS_ZSTD; >>>> + else >>>> return 0; >>>> >>>> - /* Accepted form: zlib:1 up to zlib:9 and nothing left after the >>>> number */ >>>> - if (str[4] == ':' && '1' <= str[5] && str[5] <= '9' && str[6] == 0) >>>> - return str[5] - '0'; >>>> + if (str[4] == ':') { >>>> + ret = kstrtouint(str + 5, 10, &level); >>>> + if (ret == 0 && 0 < level && >>>> + level <= btrfs_compress_op[type-1]->max_level) >>>> + return level; >>>> + } >>>> >>>> - return BTRFS_ZLIB_DEFAULT_LEVEL; >>>> + return btrfs_compress_op[type-1]->default_level; >>>> } >>>> diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h >>>> index ddda9b80bf20..a582a4483077 100644 >>>> --- a/fs/btrfs/compression.h >>>> +++ b/fs/btrfs/compression.h >>>> @@ -23,8 +23,6 @@ >>>> /* Maximum size of data before compression */ >>>> #define BTRFS_MAX_UNCOMPRESSED (SZ_128K) >>>> >>>> -#define BTRFS_ZLIB_DEFAULT_LEVEL 3 >>>> - >>>> struct compressed_bio { >>>> /* number of bios pending for this compressed extent */ >>>> refcount_t pending_bios; >>>> @@ -87,7 +85,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode >>>> *inode, u64 start, >>>> blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio >>>> *bio, >>>> int mirror_num, unsigned long bio_flags); >>>> >>>> -unsigned btrfs_compress_str2level(const char *str); >>>> +unsigned int btrfs_compress_str2level(const char *str); >>>> >>>> enum btrfs_compression_type { >>>> BTRFS_COMPRESS_NONE = 0, >>>> @@ -98,7 +96,7 @@ enum btrfs_compression_type { >>>> }; >>>> >>>> struct btrfs_compress_op { >>>> - struct list_head *(*alloc_workspace)(void); >>>> + struct list_head *(*alloc_workspace)(unsigned int level); >>>> >>>> void (*free_workspace)(struct list_head *workspace); >>>> >>>> @@ -119,7 +117,17 @@ struct btrfs_compress_op { >>>> unsigned long start_byte, >>>> size_t srclen, size_t destlen); >>>> >>>> - void (*set_level)(struct list_head *ws, unsigned int type); >>>> + /* >>>> + * Check if memory allocated in workspace is greater than >>>> + * or equal to memory needed to compress with given level. >>>> + * If not, try to reallocate memory for the compression level. >>>> + * Returns true on success. >>>> + */ >>>> + bool (*set_level)(struct list_head *ws, unsigned int level); >>>> + >>>> + unsigned int max_level; >>>> + >>>> + unsigned int default_level; >>>> }; >>>> >>>> extern const struct btrfs_compress_op btrfs_zlib_compress; >>>> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c >>>> index b6a4cc178bee..ed9f0da8ceda 100644 >>>> --- a/fs/btrfs/lzo.c >>>> +++ b/fs/btrfs/lzo.c >>>> @@ -71,7 +71,7 @@ static void lzo_free_workspace(struct list_head *ws) >>>> kfree(workspace); >>>> } >>>> >>>> -static struct list_head *lzo_alloc_workspace(void) >>>> +static struct list_head *lzo_alloc_workspace(unsigned int level) >>>> { >>>> struct workspace *workspace; >>>> >>>> @@ -485,8 +485,9 @@ static int lzo_decompress(struct list_head *ws, >>>> unsigned char *data_in, >>>> return ret; >>>> } >>>> >>>> -static void lzo_set_level(struct list_head *ws, unsigned int type) >>>> +static bool lzo_set_level(struct list_head *ws, unsigned int level) >>>> { >>>> + return true; >>>> } >>>> >>>> const struct btrfs_compress_op btrfs_lzo_compress = { >>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >>>> index b362b45dd757..77ebd69371f1 100644 >>>> --- a/fs/btrfs/super.c >>>> +++ b/fs/btrfs/super.c >>>> @@ -520,7 +520,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, >>>> char *options, >>>> compress_type = "zlib"; >>>> >>>> info->compress_type = BTRFS_COMPRESS_ZLIB; >>>> - info->compress_level = >>>> BTRFS_ZLIB_DEFAULT_LEVEL; >>>> + info->compress_level = >>>> + btrfs_zlib_compress.default_level; >>>> /* >>>> * args[0] contains uninitialized data since >>>> * for these tokens we don't expect any >>>> @@ -542,9 +543,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, >>>> char *options, >>>> btrfs_clear_opt(info->mount_opt, NODATASUM); >>>> btrfs_set_fs_incompat(info, COMPRESS_LZO); >>>> no_compress = 0; >>>> - } else if (strcmp(args[0].from, "zstd") == 0) { >>>> + } else if (strncmp(args[0].from, "zstd", 4) == 0) { >>>> compress_type = "zstd"; >>>> info->compress_type = BTRFS_COMPRESS_ZSTD; >>>> + info->compress_level = >>>> + >>>> btrfs_compress_str2level(args[0].from); >>>> btrfs_set_opt(info->mount_opt, COMPRESS); >>>> btrfs_clear_opt(info->mount_opt, NODATACOW); >>>> btrfs_clear_opt(info->mount_opt, NODATASUM); >>>> diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c >>>> index 970ff3e35bb3..4c30a99b80f6 100644 >>>> --- a/fs/btrfs/zlib.c >>>> +++ b/fs/btrfs/zlib.c >>>> @@ -20,6 +20,9 @@ >>>> #include <linux/refcount.h> >>>> #include "compression.h" >>>> >>>> +#define BTRFS_ZLIB_DEFAULT_LEVEL 3 >>>> +#define BTRFS_ZLIB_MAX_LEVEL 9 >>>> + >>>> struct workspace { >>>> z_stream strm; >>>> char *buf; >>>> @@ -36,7 +39,19 @@ static void zlib_free_workspace(struct list_head *ws) >>>> kfree(workspace); >>>> } >>>> >>>> -static struct list_head *zlib_alloc_workspace(void) >>>> +static bool zlib_set_level(struct list_head *ws, unsigned int level) >>>> +{ >>>> + struct workspace *workspace = list_entry(ws, struct workspace, list); >>>> + >>>> + if (level > BTRFS_ZLIB_MAX_LEVEL) >>>> + level = BTRFS_ZLIB_MAX_LEVEL; >>>> + >>>> + workspace->level = level > 0 ? level : BTRFS_ZLIB_DEFAULT_LEVEL; >>>> + >>>> + return true; >>>> +} >>>> + >>>> +static struct list_head *zlib_alloc_workspace(unsigned int level) >>>> { >>>> struct workspace *workspace; >>>> int workspacesize; >>>> @@ -53,6 +68,7 @@ static struct list_head *zlib_alloc_workspace(void) >>>> goto fail; >>>> >>>> INIT_LIST_HEAD(&workspace->list); >>>> + zlib_set_level(&workspace->list, level); >>>> >>>> return &workspace->list; >>>> fail: >>>> @@ -390,22 +406,13 @@ static int zlib_decompress(struct list_head *ws, >>>> unsigned char *data_in, >>>> return ret; >>>> } >>>> >>>> -static void zlib_set_level(struct list_head *ws, unsigned int type) >>>> -{ >>>> - struct workspace *workspace = list_entry(ws, struct workspace, list); >>>> - unsigned level = (type & 0xF0) >> 4; >>>> - >>>> - if (level > 9) >>>> - level = 9; >>>> - >>>> - workspace->level = level > 0 ? level : 3; >>>> -} >>>> - >>>> const struct btrfs_compress_op btrfs_zlib_compress = { >>>> .alloc_workspace = zlib_alloc_workspace, >>>> .free_workspace = zlib_free_workspace, >>>> .compress_pages = zlib_compress_pages, >>>> .decompress_bio = zlib_decompress_bio, >>>> .decompress = zlib_decompress, >>>> - .set_level = zlib_set_level, >>>> + .set_level = zlib_set_level, >>>> + .max_level = BTRFS_ZLIB_MAX_LEVEL, >>>> + .default_level = BTRFS_ZLIB_DEFAULT_LEVEL, >>>> }; >>>> diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c >>>> index af6ec59972f5..e5d7c2eae65c 100644 >>>> --- a/fs/btrfs/zstd.c >>>> +++ b/fs/btrfs/zstd.c >>>> @@ -19,12 +19,13 @@ >>>> >>>> #define ZSTD_BTRFS_MAX_WINDOWLOG 17 >>>> #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG) >>>> -#define ZSTD_BTRFS_DEFAULT_LEVEL 3 >>>> +#define BTRFS_ZSTD_DEFAULT_LEVEL 3 >>>> +#define BTRFS_ZSTD_MAX_LEVEL 15 >>>> >>>> -static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len) >>>> +static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len, >>>> + unsigned int level) >>>> { >>>> - ZSTD_parameters params = ZSTD_getParams(ZSTD_BTRFS_DEFAULT_LEVEL, >>>> - src_len, 0); >>>> + ZSTD_parameters params = ZSTD_getParams(level, src_len, 0); >>>> >>>> if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG) >>>> params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG; >>>> @@ -37,10 +38,25 @@ struct workspace { >>>> size_t size; >>>> char *buf; >>>> struct list_head list; >>>> + unsigned int level; >>>> ZSTD_inBuffer in_buf; >>>> ZSTD_outBuffer out_buf; >>>> }; >>>> >>>> +static bool zstd_reallocate_mem(struct workspace *workspace, int size) >>>> +{ >>>> + void *new_mem; >>>> + >>>> + new_mem = kvmalloc(size, GFP_KERNEL); >>>> + if (new_mem) { >>>> + kvfree(workspace->mem); >>>> + workspace->mem = new_mem; >>>> + workspace->size = size; >>>> + return true; >>>> + } >>>> + return false; >>>> +} >>>> + >>>> static void zstd_free_workspace(struct list_head *ws) >>>> { >>>> struct workspace *workspace = list_entry(ws, struct workspace, list); >>>> @@ -50,10 +66,34 @@ static void zstd_free_workspace(struct list_head *ws) >>>> kfree(workspace); >>>> } >>>> >>>> -static struct list_head *zstd_alloc_workspace(void) >>>> +static bool zstd_set_level(struct list_head *ws, unsigned int level) >>>> +{ >>>> + struct workspace *workspace = list_entry(ws, struct workspace, list); >>>> + ZSTD_parameters params; >>>> + int size; >>>> + >>>> + if (level > BTRFS_ZSTD_MAX_LEVEL) >>>> + level = BTRFS_ZSTD_MAX_LEVEL; >>>> + >>>> + if (level == 0) >>>> + level = BTRFS_ZSTD_DEFAULT_LEVEL; >>>> + >>>> + params = ZSTD_getParams(level, ZSTD_BTRFS_MAX_INPUT, 0); >>>> + size = max_t(size_t, >>>> + ZSTD_CStreamWorkspaceBound(params.cParams), >>>> + ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT)); >>>> + if (size > workspace->size) { >>>> + if (!zstd_reallocate_mem(workspace, size)) >>> >>> This can allocate memory and this can appen on the writeout path, ie. >>> one of the reasons for that might be that system needs more memory. >>> >>> By the table above, the size can be up to 2.6MiB, which is a lot in >>> terms of kernel memory as there must be either contiguous unmapped >>> memory, the virtual mappings must be created. Both are scarce resource >>> or should be treated as such. >>> >>> Given that there's no logic that would try to minimize the usage for >>> workspaces, this can allocate many workspaces of that size. >>> >>> Currently the workspace allocations have been moved to the early module >>> loading phase so that they don't happen later and we don't have to >>> allocate memory nor handle the failures. Your patch brings that back. >> >> Before this patch, the compression module initialization preloads one >> workspace per algorithm. We added the compression level as a parameter >> to the allocation, and the initialization uses the maximum level, 15. This >> guarantees forward progress, since each algorithm will have at least one >> workspace that will work for every compression level. >> >> The only new failure condition is when the compression level changes, >> we have to reallocate the workspaces next time we use them. If we run >> into memory pressure, we might free every workspace but one, reducing >> throughput, but maintaining correctness. This is basically the same scenario >> as before, but now it can occur when writing after changing compression >> levels, not just changing compression algorithm. >> >> I'm not too worried about increased memory usage. We only preallocate >> one of the maximum level (1 MB wasted if not used). Then we only up-size >> the workspaces when needed. I don't expect that high compression levels >> will be a common use case. >> >> The one potential problem with this solution is if the user rarely mounts >> with level 15 (say once a day), but commonly uses level 3. Then their >> workspaces will be sized at 15 forever. However, this is the same >> problem as if the user commonly uses no compression, but occasionally >> uses zstd. The zstd contexts will stick around forever. Ideally we would >> delete workspaces that go unused for a certain amount of time (leaving >> the single preallocated workspace). I think this belongs in a separate patch. >> I'll plan on looking into it as a follow up. >> >>> The solution I'm currently thinking about can make the levels work but >>> would be limited in throughput as a trade-off for the memory >>> consumption. >>> >>> - preallocate one workspace for level 15 per mounted filesystem, using >>> get_free_pages_exact or kvmalloc >>> >>> - preallocate workspaces for the default level, the number same as for >>> lzo/zlib >>> >>> - add logic to select the zstd:15 workspace last for other compressors, >>> ie. make it almost exclusive for zstd levels > default >>> >>> Further refinement could allocate the workspaces on-demand and free if >>> not used. Allocation failures would still be handled gracefully at the >>> cost of waiting for the remainig worspaces, at least one would be always >>> available. >> >> Nick > > May be i will say some crazy things, > but i think simpler solution, will be fallback to low compression levels > on memory starvation. > > i.e. any memory size what used in kernel and can't be handled by kmalloc, > is a huge amount of memory (IMHO). > > And if user get situation where we not have enough free memory, to handle > high compression ratio, may be that better to not try handle that > compression level. > And just fallback to level, with available preallocated memory areas.
That makes sense to me. This was actually our first approach, and it was a bit too complex. However, the patch has evolved quite a bit since then, so we can do this pretty easily. > Thanks! > > -- > Have a nice day, > Timofey. Nick