Hello! I haven't made much progress with this still, I'm sorry. :-( Below are comments about a few small details. It's not much but I will (slowly) keep reading and testing.
I applied the outq patch too. The performance numbers you posted looked promising. (1) Segfault due to thr->outbuf == NULL I changed CHUNK_SIZE to 1 to test corner cases. I used good-1-block_header-1.xz as the test file. It can segfault in worker_decoder() on the line calling thr->block_decoder.code(...) because thr->outbuf is NULL (so the problem was introduced in the outq patch). This happens because of "thr->outbuf = NULL;" later in the function. It looks like that it marks the outbuf finished and returns the thread to the pool too early or forgets to set thr->state = THR_IDLE. As a temporary workaround, I added "thr->state = THR_IDLE;" after "thr->outbuf = NULL;". (2) Block decoder must return LZMA_STREAM_END on success Because of end marker and integrity check, the output buffer will be full before the last bytes of input have been processed by the Block decoder. Thus it is not enough to look at the input and output positions to determine when decoding has been finished; only LZMA_STREAM_END should be used to determine that decoding was successful. In theory it is OK to mark the outbuf as finished once the output is full but for simplicity I suggest doing so (and returning the thread to the pool) only after LZMA_STREAM_END. I committed a new test file bad-1-check-crc32-2.xz. The last byte in the Block (last byte of Check) is wrong. Change CHUNK_SIZE to 1 and try "xz -t -T2 file bad-1-check-crc32-2.xz". The file must be detected to be corrupt (LZMA_DATA_ERROR). (3) Bad input where the whole input or output buffer cannot be used In the old single-threaded decoding, lzma_code() will eventually return LZMA_BUF_ERROR if the calls to lzma_code() cannot make any progress, that is, no more input is consumed and no more output is produced. This condition can happen with correct code if the input file is corrupt in a certain way, for example, a truncated .xz file. Since the no-progress detection is centralized in lzma_code(), the internal decoders including Block decoder don't try to detect this situation. Currently this means that worker_decoder() should detect it to catch bad input and prevent hanging on certain malformed Blocks. However, since the Block decoder knows both Compressed Size and Uncompressed Size, I think I will improve Block decoder instead so don't do anything about this for now. I committed two test files, bad-1-lzma2-9.xz and bad-1-lzma2-10.xz. The -9 may make worker_decoder() not notice that the Block is invalid. The -10 makes the decoder hang. Like I said, I might fix these by changing the Block decoder. (4) Usage of partial_update in worker_decoder() Terminology: main mutex means coder->mutex alias thr->coder->mutex. In worker_decoder(), the main mutex is locked every time there is new output available in the worker thread. partial_update is only used to determine when to signal thr->coder->cond. To reduce contention on the main mutex, worker_decoder() could lock it only when - decoding of the Block has been finished (successfully or unsuccessfully, that is, ret != LZMA_OK), or - there is new output available and partial_update is true; if partial_update is false, thr->outbuf->pos is not touched. This way only one worker will be frequently locking the main mutex. However, I haven't tried it and thus don't know how much this affects performance in practice. One possible problem might be that it may introduce a small delay in output availability when the main thread switches reading from the next outbuf in the list. (5) Use of mythread_condtime_set() In the encoder the absolute time is calculated once per lzma_code() call. The comment in wait_for_work() in in stream_encoder_mt.c was wrong. The reason the absolute time is calculated once per lzma_code() call is to ensure that blocking multiple times won't make the timeout ineffective if each blocking takes less than timeout milliseconds. So it should be done similarly in the decoder. (6) Use of lzma_outq_enable_partial_output() It should be safe to call it unconditionally: if (thr->outbuf == coder->outq.head) lzma_outq_enable_partial_output(&coder->outq, thr_do_partial_update); If outq.head is something else, it is either already finished or partial output has already been enabled. In both cases lzma_outq_enable_partial_output() will do nothing. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode