On Tue, Nov 13, 2018 at 01:33:32AM +0100, David Sterba wrote: > On Wed, Oct 31, 2018 at 11:11:08AM -0700, Nick Terrell wrote: > > From: Jennifer Liu <jenniferliu...@fb.com> > > > > Adds zstd compression level support to btrfs. Zstd requires > > different amounts of memory for each level, so the design had > > to be modified to allow set_level() to allocate memory. We > > preallocate one workspace of the maximum size to guarantee > > forward progress. This feature is expected to be useful for > > read-mostly filesystems, or when creating images. > > > > Benchmarks run in qemu on Intel x86 with a single core. > > The benchmark measures the time to copy the Silesia corpus [0] to > > a btrfs filesystem 10 times, then read it back. > > > > The two important things to note are: > > - The decompression speed and memory remains constant. > > The memory required to decompress is the same as level 1. > > - The compression speed and ratio will vary based on the source. > > > > Level Ratio Compression Decompression Compression Memory > > 1 2.59 153 MB/s 112 MB/s 0.8 MB > > 2 2.67 136 MB/s 113 MB/s 1.0 MB > > 3 2.72 106 MB/s 115 MB/s 1.3 MB > > 4 2.78 86 MB/s 109 MB/s 0.9 MB > > 5 2.83 69 MB/s 109 MB/s 1.4 MB > > 6 2.89 53 MB/s 110 MB/s 1.5 MB > > 7 2.91 40 MB/s 112 MB/s 1.4 MB > > 8 2.92 34 MB/s 110 MB/s 1.8 MB > > 9 2.93 27 MB/s 109 MB/s 1.8 MB > > 10 2.94 22 MB/s 109 MB/s 1.8 MB > > 11 2.95 17 MB/s 114 MB/s 1.8 MB > > 12 2.95 13 MB/s 113 MB/s 1.8 MB > > 13 2.95 10 MB/s 111 MB/s 2.3 MB > > 14 2.99 7 MB/s 110 MB/s 2.6 MB > > 15 3.03 6 MB/s 110 MB/s 2.6 MB > > > > [0] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia > > > > Signed-off-by: Jennifer Liu <jenniferliu...@fb.com> > > Signed-off-by: Nick Terrell <terre...@fb.com> > > Reviewed-by: Omar Sandoval <osan...@fb.com> > > --- > > v1 -> v2: > > - Don't reflow the unchanged line. > >
[snip] > > -static struct list_head *zstd_alloc_workspace(void) > > +static bool zstd_set_level(struct list_head *ws, unsigned int level) > > +{ > > + struct workspace *workspace = list_entry(ws, struct workspace, list); > > + ZSTD_parameters params; > > + int size; > > + > > + if (level > BTRFS_ZSTD_MAX_LEVEL) > > + level = BTRFS_ZSTD_MAX_LEVEL; > > + > > + if (level == 0) > > + level = BTRFS_ZSTD_DEFAULT_LEVEL; > > + > > + params = ZSTD_getParams(level, ZSTD_BTRFS_MAX_INPUT, 0); > > + size = max_t(size_t, > > + ZSTD_CStreamWorkspaceBound(params.cParams), > > + ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT)); > > + if (size > workspace->size) { > > + if (!zstd_reallocate_mem(workspace, size)) > > This can allocate memory and this can appen on the writeout path, ie. > one of the reasons for that might be that system needs more memory. > > By the table above, the size can be up to 2.6MiB, which is a lot in > terms of kernel memory as there must be either contiguous unmapped > memory, the virtual mappings must be created. Both are scarce resource > or should be treated as such. > > Given that there's no logic that would try to minimize the usage for > workspaces, this can allocate many workspaces of that size. > > Currently the workspace allocations have been moved to the early module > loading phase so that they don't happen later and we don't have to > allocate memory nor handle the failures. Your patch brings that back. Even before this patch, we may try to allocate a workspace. See __find_workspace(): https://github.com/kdave/btrfs-devel/blob/fd0f5617a8a2ee92dd461d01cf9c5c37363ccc8d/fs/btrfs/compression.c#L897 We already limit it to one per CPU, and only allocate when needed. Anything greater than that has to wait. Maybe we should improve that to also include a limit on the total amount of memory allocated? That would be more flexible than your approach below of making the > default case special, and I like it more than Timofey's idea of falling back to a lower level. > The solution I'm currently thinking about can make the levels work but > would be limited in throughput as a trade-off for the memory > consumption. > > - preallocate one workspace for level 15 per mounted filesystem, using > get_free_pages_exact or kvmalloc > > - preallocate workspaces for the default level, the number same as for > lzo/zlib > > - add logic to select the zstd:15 workspace last for other compressors, > ie. make it almost exclusive for zstd levels > default > > Further refinement could allocate the workspaces on-demand and free if > not used. Allocation failures would still be handled gracefully at the > cost of waiting for the remainig worspaces, at least one would be always > available.