Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

Lasse Collin Sun, 06 Feb 2022 15:46:40 -0800

On 2021-12-31 Sebastian Andrzej Siewior wrote:
> On 2021-12-15 23:33:58 [+0200], Lasse Collin wrote:
> > Yes. It's fairly simple from implementation point of view but is it
> > clear enough for the users, I'm not sure.
> > 
> > I suppose the alternative is having just one limit value and a flag
> > to tell if it is a soft limit (so no limit for single-threaded
> > case) or a hard limit (return LZMA_MEM_ERROR if too low for even
> > single thread). Having separate soft and hard limits instead can
> > achieve the same and a little more, so I think I'll choose the
> > two-value approach and hope it's clear enough for users.  
> 
> The value approach might work. I'm not sure if the term `soft' and
> `hard' are good here. Using `memlimit' and `memlimit_threaded' (or so)
> might make more obvious and easier to understand.
> But then this just some documentation that needs to be read and
> understood so maybe `softlimit' and `hardlimit' will work just fine.


I now plan to use memlimit_threading and memlimit_stop in the lzma_mt
structure. Documentation is still needed but hopefully those are a bit
more obvious.

> > I was hoping to get this finished by Christmas but due to a recent
> > sad event, late January is my target for the next alpha release
> > now.

And I'm late again. :-(

This is more work than I had expected because there unfortunately are a
few problems in the code and fixing them all requires quite significant
changes (and I'm slow). As a bonus, working on this made me notice a few
small bugs in the old liblzma code too (not yet committed).

The following tries to explain some of the problems and what I have
done locally. I don't have code to show yet because it still contains
too many small FIXMEs but, as unbelievable as it might sound, this will
get done. I need a few more days; I have other things I must do too.


The biggest issue is handling of memory usage and threaded vs. direct
mode. The memory usage limiting code makes assumptions that are true
with the most common files but there are situations where these
assumptions fail:

(1) If a non-first Block requires a lot more memory than the first
    Block and so the memory limit would be exceeded in threaded mode,
    the decoder will not switch to direct mode even with
    LZMA_MEMLIMIT_COMPLETE. Instead the decoder proceeds with one
    thread and uses as much memory as that needs.

(2) If a non-first Block lacks size info in its Block Header, the
    decoder won't switch to direct mode. It returns LZMA_PROG_ERROR
    instead.

(3) The per-thread input buffers can grow as bigger Blocks are seen but
    the buffers cannot shrink. This has pros and cons. It's a problem if
    a single Block is very big and others are not.

I thought it's better to first decode the Block Header to
coder->block_options and then, based on the facts from that Block
Header, determine memory usage and how to proceed (including switching
to/from direct mode). This way there is no need to assume or expect
anything. (coder->block_options need to be copied to a thread-specific
structure before initializing the decoder.)

For direct mode, I added separate SEQ states for it. This also helps
making the code more similar to the single-threaded decoder in both
looks and behavior. I hope that with memlimit_threading = 0 the
threaded version can have identical externally-visible behavior as the
original single-threaded version. This way xz doesn't need both
functions (the single-threaded function is still needed if built with
--disable-threads).


Corner cases of the buffer-to-buffer API:

(4) In some use cases there might be long pauses where no new input is
    available (for example, sending a live log file over network with
    compression). It is essential that the decoder will still provide
    all output that is easily possible from the input so far. That is,
    if the decoder was called without providing any new input, it might
    need to be handled specially.

    SEQ_BLOCK_HEADER and SEQ_INDEX return immediately if the application
    isn't providing any new input data, and so eventually lzma_code()
    will return LZMA_BUF_ERROR even when there would be output
    available from the worker threads. try_copy_decoded() could be
    called earlier but there is more to fix (see (5) and (6)).

    (Also remember my comment above that I changed the code so that
    Block Header is decoded first before getting a thread. That adds
    one more SEQ point where waiting for output is needed.)

(5) The decoder must work when the application provides an output
    buffer whose size is exactly the uncompressed size of the file.
    This means that one cannot simply use *out_pos == out_size to
    determine when to return LZMA_OK. Perhaps the decoder hasn't marked
    its lzma_outbuf as finished but no more output will be coming, or
    there is an empty Block (empty Blocks perhaps shouldn't have been
    allowed in the format at all but it's too late for that).

    In short, instead of *out_pos == out_size one has to check
    !lzma_outq_is_empty() && lzma_outq_is_readable() after the call
    to lzma_outq_read() to determine if more output space is truly
    required.

    try_copy_decoded() and SEQ_INDEX have this problem but fixing is not
    complicated.


Race condition in SEQ_BLOCK_HEADER:

        ret = get_thread(coder, allocator);
        if (ret != LZMA_OK)
                return ret;

        if (!coder->thr_write) {
                // No out buffer but making progress?
                if (start_out_pos != *out_pos)
                        return LZMA_OK;

                mythread_mutex_lock(&coder->mutex);
                if (!lzma_outq_is_readable(&coder->outq))
                        ret = wait_cond_progress(coder, &wait_abs);

                mythread_mutex_unlock(&coder->mutex);

The call to get_thread() locks coder->mutex for a while. If getting a
thread fails and no output has been produced, the mutex is locked again
to wait on coder->cond if no output is currently readable from the
output queue.

It is possible that a thread finishes and puts itself to
coder->threads_free after the call to get_thread() fails but before the
mutex is locked again. Thus, it's possible that get_thread() would now
succeed but the code will wait on coder->cond anyway. In practice this
is only a minor issue as there will likely be output fairly soon, but
it may sometimes add a small unneeded delay.

To fix it, it should check for all conditions that would allow
continuing. I did it by creating a function that contains the
functionality of try_copy_decoded() and checking for the conditions to
start a Block, all while keeping coder->mutex locked. This way the
conditions to start a new Block need to be checked in one place only.
This helper function also made it easy to fix (4) and (5).


The sizes from Block Header are untrusted so one has to be paranoid
with them. This means checking for integer overflows when doing
addition. Also, since the sizes are 63-bit (lzma_vli is uint64_t but
valid known values fit into 63 bits), assigning these to a size_t needs
care to avoid truncation on 32-bit systems. These are easy to forget
but forgetting can cause security issues.

I added support for lzma_get_progress(). It is needed to make xz -v
progress indicator useful.

Other news: Gitweb on git.tukaani.org is currently broken. I'll get it
fixed in a few days. The repositories still work.

-- 
Lasse Collin

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

Reply via email to