On Tue, Nov 13, 2018 at 01:33:32AM +0100, David Sterba wrote:
> On Wed, Oct 31, 2018 at 11:11:08AM -0700, Nick Terrell wrote:
> > From: Jennifer Liu <jenniferliu...@fb.com>
> > 
> > Adds zstd compression level support to btrfs. Zstd requires
> > different amounts of memory for each level, so the design had
> > to be modified to allow set_level() to allocate memory. We
> > preallocate one workspace of the maximum size to guarantee
> > forward progress. This feature is expected to be useful for
> > read-mostly filesystems, or when creating images.
> > 
> > Benchmarks run in qemu on Intel x86 with a single core.
> > The benchmark measures the time to copy the Silesia corpus [0] to
> > a btrfs filesystem 10 times, then read it back.
> > 
> > The two important things to note are:
> > - The decompression speed and memory remains constant.
> >   The memory required to decompress is the same as level 1.
> > - The compression speed and ratio will vary based on the source.
> > 
> > Level       Ratio   Compression     Decompression   Compression Memory
> > 1           2.59    153 MB/s        112 MB/s        0.8 MB
> > 2           2.67    136 MB/s        113 MB/s        1.0 MB
> > 3           2.72    106 MB/s        115 MB/s        1.3 MB
> > 4           2.78    86  MB/s        109 MB/s        0.9 MB
> > 5           2.83    69  MB/s        109 MB/s        1.4 MB
> > 6           2.89    53  MB/s        110 MB/s        1.5 MB
> > 7           2.91    40  MB/s        112 MB/s        1.4 MB
> > 8           2.92    34  MB/s        110 MB/s        1.8 MB
> > 9           2.93    27  MB/s        109 MB/s        1.8 MB
> > 10          2.94    22  MB/s        109 MB/s        1.8 MB
> > 11          2.95    17  MB/s        114 MB/s        1.8 MB
> > 12          2.95    13  MB/s        113 MB/s        1.8 MB
> > 13          2.95    10  MB/s        111 MB/s        2.3 MB
> > 14          2.99    7   MB/s        110 MB/s        2.6 MB
> > 15          3.03    6   MB/s        110 MB/s        2.6 MB
> > 
> > [0] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia
> > 
> > Signed-off-by: Jennifer Liu <jenniferliu...@fb.com>
> > Signed-off-by: Nick Terrell <terre...@fb.com>
> > Reviewed-by: Omar Sandoval <osan...@fb.com>
> > ---
> > v1 -> v2:
> > - Don't reflow the unchanged line.
> > 

[snip]

> > -static struct list_head *zstd_alloc_workspace(void)
> > +static bool zstd_set_level(struct list_head *ws, unsigned int level)
> > +{
> > +   struct workspace *workspace = list_entry(ws, struct workspace, list);
> > +   ZSTD_parameters params;
> > +   int size;
> > +
> > +   if (level > BTRFS_ZSTD_MAX_LEVEL)
> > +           level = BTRFS_ZSTD_MAX_LEVEL;
> > +
> > +   if (level == 0)
> > +           level = BTRFS_ZSTD_DEFAULT_LEVEL;
> > +
> > +   params = ZSTD_getParams(level, ZSTD_BTRFS_MAX_INPUT, 0);
> > +   size = max_t(size_t,
> > +                   ZSTD_CStreamWorkspaceBound(params.cParams),
> > +                   ZSTD_DStreamWorkspaceBound(ZSTD_BTRFS_MAX_INPUT));
> > +   if (size > workspace->size) {
> > +           if (!zstd_reallocate_mem(workspace, size))
> 
> This can allocate memory and this can appen on the writeout path, ie.
> one of the reasons for that might be that system needs more memory.
> 
> By the table above, the size can be up to 2.6MiB, which is a lot in
> terms of kernel memory as there must be either contiguous unmapped
> memory, the virtual mappings must be created. Both are scarce resource
> or should be treated as such.
> 
> Given that there's no logic that would try to minimize the usage for
> workspaces, this can allocate many workspaces of that size.
> 
> Currently the workspace allocations have been moved to the early module
> loading phase so that they don't happen later and we don't have to
> allocate memory nor handle the failures. Your patch brings that back.

Even before this patch, we may try to allocate a workspace. See
__find_workspace():

https://github.com/kdave/btrfs-devel/blob/fd0f5617a8a2ee92dd461d01cf9c5c37363ccc8d/fs/btrfs/compression.c#L897

We already limit it to one per CPU, and only allocate when needed.
Anything greater than that has to wait. Maybe we should improve that to
also include a limit on the total amount of memory allocated? That would
be more flexible than your approach below of making the > default case
special, and I like it more than Timofey's idea of falling back to a
lower level.

> The solution I'm currently thinking about can make the levels work but
> would be limited in throughput as a trade-off for the memory
> consumption.
> 
> - preallocate one workspace for level 15 per mounted filesystem, using
>   get_free_pages_exact or kvmalloc
> 
> - preallocate workspaces for the default level, the number same as for
>   lzo/zlib
> 
> - add logic to select the zstd:15 workspace last for other compressors,
>   ie. make it almost exclusive for zstd levels > default
> 
> Further refinement could allocate the workspaces on-demand and free if
> not used. Allocation failures would still be handled gracefully at the
> cost of waiting for the remainig worspaces, at least one would be always
> available.

Reply via email to