It seems you just sent the same email twice, please check my previous reply.
Thanks, Gao Xiang On 2019/5/16 10:11, Mao, Chenxi wrote: > Hi Xiang: > > I checked the deliver history. > There is only below delivery related with lz4.c > Pull request:616 > 4e3accc Fix Dict Size Test in `LZ4_compress_fast_continue()` > 535636f Don't Attach Very Small Dictionaries > This 2 changes seems like ONLY bug fixes for dictionary fix baesd on 1.8.3 > > Based on current investigation result, I think pick decompress patches is > safe for v1.8.3 > > If there is any misunderstanding or faults, please feel free to let me know. > > Chenxi > > On 5/16/19 9:35 AM, Gao Xiang wrote: >> >> >> On 2019/5/16 7:48, Mao, Chenxi wrote: >>> Hi Yann and Xiang: >>> For this FAST_DEC_LOOP change, I only pick up decompress related patches to >>> current kernel LZ4 implementation(based on 1.8.3). >>> Here is my cherry-pick list: >>> 2589c44 created LZ4_FAST_DEC_LOOP build macro >>> 605d811 enable LZ4_FAST_DEC_LOOP build macro on aarch64/GCC by default >>> 5d7d116 decompress_generic: Limit fastpath to x86 >>> 75fb878 decompress_generic: Add fastpath for small offsets >>> faac110 decompress_generic: Unroll loops a bit more >>> 1fbaf84 decompress_generic: remove msan write >>> 28b8249 decompress_generic: re-add fastpath >>> 232f1e2 decompress_generic: drop partial copy check in fast loop >>> 59332a3 decompress_generic: Optimize literal copies >>> 5dfa7d4 decompress_generic: optimize match copy >>> 28356e0 decompress_generic: Add a loop fastpath >>> 4da3360 decompress_generic: Refactor variable length fields >>> >>> Only cherry-pick these changes to LZ4 1.8.3 would not be introduce risks. >> >> The fact is that ... LZ4_FAST_DEC_LOOP almost influences all decompression >> apis. >> You should keep LZ4_decompress_generic from all known issues. >> >>> >>> @Xiang: >>> Do you prefer to upgrade kernel lz4 to 1.9.1 directly or only pick some >>> optimization patches? >> >> I think you should pick all patches which are related to >> LZ4_decompress_generic() >> and LZ4_FAST_DEC_LOOP since known issues of LZ4_FAST_DEC_LOOP should be >> fixed of course. >> >> Thanks, >> Gao Xiang >> >>> >>> Chenxi >>> >>> >>> On 5/16/19 1:47 AM, Gao Xiang wrote: >>>> Hi Yann, >>>> >>>> On 2019/5/16 1:03, Cyan wrote: >>>>> Re-posted, >>>>> it seems the previous message was rejected by the linux-kernel server >>>>> due to some kind of format limitation (no html). >>>>> >>>>> >>>>> Le mer. 15 mai 2019 à 09:56, Cyan <yann.collet...@gmail.com> a écrit : >>>>>> >>>>>> The v1.9.0 version has a bug which makes it read a few bytes out of >>>>>> bound in certain cases. >>>>>> This was fixed in v1.9.1. >>>>>> Therefore, if you plan upgrading version, skip directly to v1.9.1. >>>>>> >>>>>> Note that, in v1.9.1, LZ4_decompress_fast() is now deprecated. >>>>>> LZ4_decompress_fast() is a security liability, since it's unable to cope >>>>>> with malicious inputs. >>>>>> While this is not a problem when the producer is under direct control >>>>>> (local production and consumption, no intermediate storage), >>>>>> this entry point seems misused in multiple scenarios where input cannot >>>>>> be decently trusted. >>>>>> The deprecation warning is meant to "scare away" such usages. >>>>>> It's still possible to intentionally remove the warning, for users who >>>>>> know what they are doing. >>>>>> >>>>>> Note that, on top of that, the _fast() variant is now slower that >>>>>> LZ4_decompress_safe(), removing its most important positive >>>>>> differentiator. >>>>>> That's because LZ4_decompress_fast() is "blind", it doesn't know the >>>>>> input size. >>>>>> It just guesses it, from the requested output size, and by relying on >>>>>> end-of-block conditions. >>>>>> This effectively limits its capability to copy up to 8 bytes at a time, >>>>>> which proves slower than the 32-bytes at a time used in >>>>>> LZ4_decompress_safe(). >>>>>> >>>>>> LZ4_decompress_fast() still has some possible usage : >>>>>> it doesn't need the compressed size. >>>>>> Therefore, in a scenario where decompressed size is known, but >>>>>> compressed size isn't, >>>>>> it cannot be replaced by `LZ4_decompress_safe()`. >>>>>> This is pretty rare, but can happen in some cases, erofs for example. >>>>>> `LZ4_decompress_safe_partial()` *might* be able to replace it for such >>>>>> scenario, >>>>>> but I haven't thoroughly tested it, so can't yet vouch for it. >>>> >>>> Yes, that is my real concern... >>>> >>>> In fact, we have shipped erofs with LZ4_decompress_safe_partial() of >>>> lz4-1.8.3 >>>> for our HUAWEI P30/P30 Pro and other HUAWEI EMUI 9.1 phones on the >>>> market... >>>> >>>> It seemd lz4-1.8.3 is rather stable since no stability report from our >>>> internal >>>> beta test and real consumers... >>>> >>>> I haven't tested lz4-1.9.0 yet since I am working on the erofs paper [1]... >>>> >>>> But anyway, the patch looks good to me since it has obvious performance >>>> gain... >>>> >>>>>> >>>>>> If it doesn't, then it will be necessary to create a new entry point. >>>>>> The main difference is that this new entry point will need a bound for >>>>>> the input size, in order to guarantee it never reads beyond input buffer. >>>>>> But as said, for the time being, such dedicated entry point doesn't >>>>>> exist. >>>> >>>> I agree, but current erofs still have to choose LZ4_decompress_fast... >>>> I will test these 2 apis later and will report if something is wrong as >>>> well. >>>> >>>> [1] https://www.usenix.org/conference/atc19/presentation/gao >>>> https://www.usenix.org/conference/atc19/technical-sessions >>>> >>>> Thanks, >>>> Gao Xiang >>>> >>>> >>>>>> >>>>>> The new FAST_DEC_LOOP setting works well on x86 and x64. >>>>>> Recently, it has been extended to gcc+arm64 (`dev` branch). >>>>>> The combination of clang+arm64 is less successful though (impacts vary, >>>>>> depending on version and exact hardware), so it's still disabled by >>>>>> default. >>>>>> But anyone can test it by setting FAST_DEC_LOOP definition to 1, thus >>>>>> enabling the new decoder loop. >>>>>> >>>>>> >>>>>> Y. >>>>>> >>>>>> Le mar. 14 mai 2019 à 20:28, Mao, Chenxi <chenxi....@sony.com> a écrit : >>>>>>> >>>>>>> Hi Xiang: >>>>>>> >>>>>>> Thanks for your reply, I will have a stress test on my device later. >>>>>>> I didn't have chance to test LZ4 with clang build because of device >>>>>>> limitation. I think I could do it later. >>>>>>> I guess the clang performance downgrade is caused by some compiler >>>>>>> optimization options. >>>>>>> I will double check it later if I got the device which can build kernel >>>>>>> with clang. >>>>>>> >>>>>>> BTW, the FAST_DEC_LOOP leverage LZ4_wildCopy8 instead of LZ4_wildCopy, >>>>>>> however based on my test, original LZ4_wildCopy has the better >>>>>>> performance. >>>>>>> Original LZ4_wildCopy API still invoked by FAST_DEC_LOOP apis. >>>>>>> That might be a problem for X86 devices for current patch. >>>>>>> >>>>>>> Chenxi >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Gao Xiang [mailto:gaoxian...@huawei.com] >>>>>>> Sent: Wednesday, May 15, 2019 10:21 AM >>>>>>> To: Mao, Chenxi <chenxi....@sony.com> >>>>>>> Cc: a...@linux-foundation.org; linux-kernel@vger.kernel.org; Feng, Roy >>>>>>> <roy.f...@sony.com>; Xu, Yuanli 2 <yuanli...@sony.com>; Alm, Robert 2 >>>>>>> <robert....@sony.com>; Takahashi, Masaya A (Sony Mobile) >>>>>>> <masaya.a.takaha...@sony.com>; Miao Xie <miao...@huawei.com> >>>>>>> Subject: Re: [PATCH 1/1] LZ4: Port LZ4 1.9.x FAST_DEC_LOOP and enable >>>>>>> it on x86 and ARM64 >>>>>>> >>>>>>> Hi Chenxi, >>>>>>> >>>>>>> On 2019/5/15 8:43, Chenxi Mao wrote: >>>>>>>> FAST_DEC_LOOP was introduced from LZ4 1.9. >>>>>>>> This change would be introduce 10% on decompress operation according >>>>>>>> to LZ4 benchmark result on X86 devices. >>>>>>>> Meanwhile, LZ4 with FAST_DEC_LOOP could get improvements, however >>>>>>>> clang compiler has downgrade if FAST_DEC_LOOP enabled. >>>>>>> >>>>>>> I noticed this optimization and lz4-1.9.0 [1] was just released month >>>>>>> ago with this big change therefore I'm a little afraid of its current >>>>>>> stability (especially some rare used decompression apis...) >>>>>>> >>>>>>> Could you Cc more people (e.g. lz4 original author Yann Collet) in >>>>>>> order to get some more ideas about this if possible? >>>>>>> >>>>>>> Anyway, I like this optimization as well since FAST_DEC_LOOP improves >>>>>>> decompression speed a lot :) >>>>>>> >>>>>>> [1] https://github.com/lz4/lz4/releases/tag/v1.9.0 >>>>>>> >>>>>>> Thanks, >>>>>>> Gao Xiang >>>>>>> >>>>>>>> >>>>>>>> So FAST_DEC_LOOP only enabled on X86/X86-64 or ARM64 with GCC build. >>>>>>>> >>>>>>>> Here is the test result on ARM64(cortex-A53) >>>>>>>> >>>>>>>> Benchmark via ZRAM: >>>>>>>> >>>>>>>> Test case: >>>>>>>> fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers \ >>>>>>>> --buffer_compress_percentage=75 \ >>>>>>>> --scramble_buffers=1 --direct=1 --loops=100 --numjobs=8 \ >>>>>>>> --filename=/dev/block/zram0 --name=seq-write --rw=write --stonewall \ >>>>>>>> --name=seq-read --rw=read --stonewall --name=seq-readwrite \ --rw=rw >>>>>>>> --stonewall --name=rand-readwrite --rw=randrw --stonewall >>>>>>>> >>>>>>>> Patched: >>>>>>>> READ: bw=7077MiB/s (7421MB/s) >>>>>>>> Vanilla: >>>>>>>> READ: bw=5134MiB/s (5384MB/s) >>>>>>>> >>>>>>>> Reference: >>>>>>>> 1. https://github.com/lz4/lz4/pull/645 >>>>>>>> 2. https://github.com/lz4/lz4/pull/707 >>>>>>>> >>>>>>>> Signed-off-by: chenxi.mao <chenxi....@sony.com> >>>>>>>> --- >>>>>>>> lib/lz4/lz4_decompress.c | 425 >>>>>>>> +++++++++++++++++++++++++++++++++------ >>>>>>>> 1 file changed, 361 insertions(+), 64 deletions(-) >>>>>>>> >>>>>>>> diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c index >>>>>>>> 0c9d3ad17e0f..a4c87e32b3c0 100644 >>>>>>>> --- a/lib/lz4/lz4_decompress.c >>>>>>>> +++ b/lib/lz4/lz4_decompress.c >>>>>>>> @@ -50,6 +50,131 @@ >>>>>>>> #define assert(condition) ((void)0) >>>>>>>> #endif >>>>>>>> >>>>>>>> +#ifndef LZ4_FAST_DEC_LOOP >>>>>>>> +#if defined(__i386__) || defined(__x86_64__) #define >>>>>>>> +LZ4_FAST_DEC_LOOP 1 #elif defined(__aarch64__) && !defined(__clang__) >>>>>>>> + /* On aarch64, we disable this optimization for clang because on >>>>>>>> certain >>>>>>>> + * mobile chipsets and clang, it reduces performance. For more >>>>>>>> information >>>>>>>> + * refer to https://github.com/lz4/lz4/pull/707. */ #define >>>>>>>> +LZ4_FAST_DEC_LOOP 1 #else #define LZ4_FAST_DEC_LOOP 0 #endif #endif >>>>>>>> + >>>>>>>> +static const unsigned inc32table[8] = { 0, 1, 2, 1, 0, 4, 4, 4 }; >>>>>>>> +static const int dec64table[8] = { 0, 0, 0, -1, -4, 1, 2, 3 }; >>>>>>>> + >>>>>>>> +#if LZ4_FAST_DEC_LOOP >>>>>>>> +#define FASTLOOP_SAFE_DISTANCE 64 >>>>>>>> +FORCE_INLINE void >>>>>>>> +LZ4_memcpy_using_offset_base(BYTE * dstPtr, const BYTE * srcPtr, BYTE >>>>>>>> * dstEnd, >>>>>>>> + const size_t offset) >>>>>>>> +{ >>>>>>>> + if (offset < 8) { >>>>>>>> + dstPtr[0] = srcPtr[0]; >>>>>>>> + >>>>>>>> + dstPtr[1] = srcPtr[1]; >>>>>>>> + dstPtr[2] = srcPtr[2]; >>>>>>>> + dstPtr[3] = srcPtr[3]; >>>>>>>> + srcPtr += inc32table[offset]; >>>>>>>> + memcpy(dstPtr + 4, srcPtr, 4); >>>>>>>> + srcPtr -= dec64table[offset]; >>>>>>>> + dstPtr += 8; >>>>>>>> + } else { >>>>>>>> + memcpy(dstPtr, srcPtr, 8); >>>>>>>> + dstPtr += 8; >>>>>>>> + srcPtr += 8; >>>>>>>> + } >>>>>>>> + >>>>>>>> + LZ4_wildCopy(dstPtr, srcPtr, dstEnd); } >>>>>>>> + >>>>>>>> +/* customized variant of memcpy, which can overwrite up to 32 bytes >>>>>>>> +beyond dstEnd >>>>>>>> + * this version copies two times 16 bytes (instead of one time 32 >>>>>>>> +bytes) >>>>>>>> + * because it must be compatible with offsets >= 16. */ FORCE_INLINE >>>>>>>> +void LZ4_wildCopy32(void *dstPtr, const void *srcPtr, void *dstEnd) { >>>>>>>> + BYTE *d = (BYTE *) dstPtr; >>>>>>>> + const BYTE *s = (const BYTE *)srcPtr; >>>>>>>> + BYTE *const e = (BYTE *) dstEnd; >>>>>>>> + >>>>>>>> + do { >>>>>>>> + memcpy(d, s, 16); >>>>>>>> + memcpy(d + 16, s + 16, 16); >>>>>>>> + d += 32; >>>>>>>> + s += 32; >>>>>>>> + } while (d < e); >>>>>>>> +} >>>>>>>> + >>>>>>>> +FORCE_INLINE void >>>>>>>> +LZ4_memcpy_using_offset(BYTE *dstPtr, const BYTE *srcPtr, BYTE >>>>>>>> *dstEnd, >>>>>>>> + const size_t offset) >>>>>>>> +{ >>>>>>>> + BYTE v[8]; >>>>>>>> + switch (offset) { >>>>>>>> + >>>>>>>> + case 1: >>>>>>>> + memset(v, *srcPtr, 8); >>>>>>>> + goto copy_loop; >>>>>>>> + case 2: >>>>>>>> + memcpy(v, srcPtr, 2); >>>>>>>> + memcpy(&v[2], srcPtr, 2); >>>>>>>> + memcpy(&v[4], &v[0], 4); >>>>>>>> + goto copy_loop; >>>>>>>> + case 4: >>>>>>>> + memcpy(v, srcPtr, 4); >>>>>>>> + memcpy(&v[4], srcPtr, 4); >>>>>>>> + goto copy_loop; >>>>>>>> + default: >>>>>>>> + LZ4_memcpy_using_offset_base(dstPtr, srcPtr, dstEnd, >>>>>>>> offset); >>>>>>>> + return; >>>>>>>> + } >>>>>>>> + >>>>>>>> + copy_loop: >>>>>>>> + memcpy(dstPtr, v, 8); >>>>>>>> + dstPtr += 8; >>>>>>>> + while (dstPtr < dstEnd) { >>>>>>>> + memcpy(dstPtr, v, 8); >>>>>>>> + dstPtr += 8; >>>>>>>> + } >>>>>>>> +} >>>>>>>> + >>>>>>>> +/* Read the variable-length literal or match length. >>>>>>>> + * >>>>>>>> + * ip - pointer to use as input. >>>>>>>> + * lencheck - end ip. Return an error if ip advances >= lencheck. >>>>>>>> + * loop_check - check ip >= lencheck in body of loop. Returns >>>>>>>> loop_error if so. >>>>>>>> + * initial_check - check ip >= lencheck before start of loop. >>>>>>>> Returns initial_error if so. >>>>>>>> + * error (output) - error code. Should be set to 0 before call. >>>>>>>> + */ >>>>>>>> +typedef enum { loop_error = -2, initial_error = -1, ok = 0} >>>>>>>> +variable_length_error; FORCE_INLINE unsigned >>>>>>>> read_variable_length(const BYTE **ip, >>>>>>>> + const BYTE *lencheck, >>>>>>>> + int loop_check, int >>>>>>>> initial_check, >>>>>>>> + variable_length_error *error) >>>>>>>> { >>>>>>>> + unsigned length = 0; >>>>>>>> + unsigned s; >>>>>>>> + if (initial_check && unlikely((*ip) >= lencheck)) { /* >>>>>>>> overflow detection */ >>>>>>>> + *error = initial_error; >>>>>>>> + return length; >>>>>>>> + } >>>>>>>> + do { >>>>>>>> + s = **ip; >>>>>>>> + (*ip)++; >>>>>>>> + length += s; >>>>>>>> + if (loop_check && unlikely((*ip) >= lencheck)) { >>>>>>>> /* overflow detection */ >>>>>>>> + *error = loop_error; >>>>>>>> + return length; >>>>>>>> + } >>>>>>>> + } while (s == 255); >>>>>>>> + >>>>>>>> + return length; >>>>>>>> +} >>>>>>>> +#endif >>>>>>>> + >>>>>>>> /* >>>>>>>> * LZ4_decompress_generic() : >>>>>>>> * This generic decompression function covers all use cases. >>>>>>>> @@ -80,25 +205,28 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> const size_t dictSize >>>>>>>> ) >>>>>>>> { >>>>>>>> - const BYTE *ip = (const BYTE *) src; >>>>>>>> - const BYTE * const iend = ip + srcSize; >>>>>>>> + const BYTE *ip = (const BYTE *)src; >>>>>>>> + const BYTE *const iend = ip + srcSize; >>>>>>>> >>>>>>>> BYTE *op = (BYTE *) dst; >>>>>>>> - BYTE * const oend = op + outputSize; >>>>>>>> + BYTE *const oend = op + outputSize; >>>>>>>> BYTE *cpy; >>>>>>>> >>>>>>>> - const BYTE * const dictEnd = (const BYTE *)dictStart + dictSize; >>>>>>>> - static const unsigned int inc32table[8] = {0, 1, 2, 1, 0, 4, 4, >>>>>>>> 4}; >>>>>>>> - static const int dec64table[8] = {0, 0, 0, -1, -4, 1, 2, 3}; >>>>>>>> + const BYTE *const dictEnd = (const BYTE *)dictStart + dictSize; >>>>>>>> >>>>>>>> const int safeDecode = (endOnInput == endOnInputSize); >>>>>>>> const int checkOffset = ((safeDecode) && (dictSize < (int)(64 * >>>>>>>> KB))); >>>>>>>> >>>>>>>> /* Set up the "end" pointers for the shortcut. */ >>>>>>>> const BYTE *const shortiend = iend - >>>>>>>> - (endOnInput ? 14 : 8) /*maxLL*/ - 2 /*offset*/; >>>>>>>> + (endOnInput ? 14 : 8) /*maxLL*/ - 2 /*offset*/; >>>>>>>> const BYTE *const shortoend = oend - >>>>>>>> - (endOnInput ? 14 : 8) /*maxLL*/ - 18 /*maxML*/; >>>>>>>> + (endOnInput ? 14 : 8) /*maxLL*/ - 18 /*maxML*/; >>>>>>>> + >>>>>>>> + const BYTE *match; >>>>>>>> + size_t offset; >>>>>>>> + unsigned int token; >>>>>>>> + size_t length; >>>>>>>> >>>>>>>> DEBUGLOG(5, "%s (srcSize:%i, dstSize:%i)", __func__, >>>>>>>> srcSize, outputSize); >>>>>>>> @@ -117,15 +245,194 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> if ((endOnInput) && unlikely(srcSize == 0)) >>>>>>>> return -1; >>>>>>>> >>>>>>>> - /* Main Loop : decode sequences */ >>>>>>>> +#if LZ4_FAST_DEC_LOOP >>>>>>>> + if ((oend - op) < FASTLOOP_SAFE_DISTANCE) { >>>>>>>> + DEBUGLOG(6, "skip fast decode loop"); >>>>>>>> + goto safe_decode; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* Fast loop : decode sequences as long as output < >>>>>>>> +iend-FASTLOOP_SAFE_DISTANCE */ >>>>>>>> while (1) { >>>>>>>> - size_t length; >>>>>>>> - const BYTE *match; >>>>>>>> - size_t offset; >>>>>>>> + /* Main fastloop assertion: We can always wildcopy >>>>>>>> FASTLOOP_SAFE_DISTANCE */ >>>>>>>> + assert(oend - op >= FASTLOOP_SAFE_DISTANCE); >>>>>>>> + if (endOnInput) { >>>>>>>> + assert(ip < iend); >>>>>>>> + } >>>>>>>> + token = *ip++; >>>>>>>> + length = token >> ML_BITS; /* literal length */ >>>>>>>> + >>>>>>>> + assert(!endOnInput || ip <= iend); /* ip < iend >>>>>>>> before the increment */ >>>>>>>> + >>>>>>>> + /* decode literal length */ >>>>>>>> + if (length == RUN_MASK) { >>>>>>>> + variable_length_error error = ok; >>>>>>>> + length += >>>>>>>> + read_variable_length(&ip, iend - RUN_MASK, >>>>>>>> + endOnInput, endOnInput, >>>>>>>> + &error); >>>>>>>> + if (error == initial_error) { >>>>>>>> + goto _output_error; >>>>>>>> + } >>>>>>>> + if ((safeDecode) >>>>>>>> + && unlikely((uptrval) (op) + length < >>>>>>>> + (uptrval) (op))) { >>>>>>>> + goto _output_error; >>>>>>>> + } /* overflow detection */ >>>>>>>> + if ((safeDecode) >>>>>>>> + && unlikely((uptrval) (ip) + length < >>>>>>>> + (uptrval) (ip))) { >>>>>>>> + goto _output_error; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* overflow detection */ >>>>>>>> + /* copy literals */ >>>>>>>> + cpy = op + length; >>>>>>>> + LZ4_STATIC_ASSERT(MFLIMIT >= WILDCOPYLENGTH); >>>>>>>> + if (endOnInput) { /* LZ4_decompress_safe() >>>>>>>> */ >>>>>>>> + if ((cpy > oend - 32) >>>>>>>> + || (ip + length > iend - 32)) { >>>>>>>> + goto safe_literal_copy; >>>>>>>> + } >>>>>>>> + LZ4_wildCopy32(op, ip, cpy); >>>>>>>> + } else { /* LZ4_decompress_fast() */ >>>>>>>> + if (cpy > oend - 8) { >>>>>>>> + goto safe_literal_copy; >>>>>>>> + } >>>>>>>> + LZ4_wildCopy(op, ip, cpy); >>>>>>>> + /* LZ4_decompress_fast() cannot copy >>>>>>>> more than 8 bytes at a time */ >>>>>>>> + /* it doesn't know input length, and >>>>>>>> only relies on end-of-block */ >>>>>>>> + /* properties */ >>>>>>>> + } >>>>>>>> + ip += length; >>>>>>>> + op = cpy; >>>>>>>> + } else { >>>>>>>> + cpy = op + length; >>>>>>>> + if (endOnInput) { /* LZ4_decompress_safe() >>>>>>>> */ >>>>>>>> + DEBUGLOG(7, >>>>>>>> + "copy %u bytes in a 16-bytes >>>>>>>> stripe", >>>>>>>> + (unsigned)length); >>>>>>>> + /* We don't need to check oend */ >>>>>>>> + /* since we check it once for each loop >>>>>>>> below */ >>>>>>>> + if (ip > iend - (16 + 1)) { /*max >>>>>>>> lit + offset + nextToken */ >>>>>>>> + goto safe_literal_copy; >>>>>>>> + } >>>>>>>> + /* Literals can only be 14, but hope >>>>>>>> compilers optimize */ >>>>>>>> + /*if we copy by a register size */ >>>>>>>> + memcpy(op, ip, 16); >>>>>>>> + } else { >>>>>>>> + /* LZ4_decompress_fast() cannot copy >>>>>>>> more than 8 bytes at a time */ >>>>>>>> + /* it doesn't know input length, and >>>>>>>> relies on end-of-block */ >>>>>>>> + /* properties */ >>>>>>>> + memcpy(op, ip, 8); >>>>>>>> + if (length > 8) { >>>>>>>> + memcpy(op + 8, ip + 8, 8); >>>>>>>> + } >>>>>>>> + } >>>>>>>> + ip += length; >>>>>>>> + op = cpy; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* get offset */ >>>>>>>> + offset = LZ4_readLE16(ip); >>>>>>>> + ip += 2; /* end-of-block condition violated */ >>>>>>>> + match = op - offset; >>>>>>>> + >>>>>>>> + /* get matchlength */ >>>>>>>> + length = token & ML_MASK; >>>>>>>> >>>>>>>> - /* get literal length */ >>>>>>>> - unsigned int const token = *ip++; >>>>>>>> - length = token>>ML_BITS; >>>>>>>> + if ((checkOffset) && (unlikely(match + dictSize < >>>>>>>> lowPrefix))) { >>>>>>>> + goto _output_error; >>>>>>>> + } >>>>>>>> + /* Error : offset outside buffers */ >>>>>>>> + if (length == ML_MASK) { >>>>>>>> + variable_length_error error = ok; >>>>>>>> + length += >>>>>>>> + read_variable_length(&ip, iend - >>>>>>>> LASTLITERALS + 1, >>>>>>>> + endOnInput, 0, &error); >>>>>>>> + if (error != ok) { >>>>>>>> + goto _output_error; >>>>>>>> + } >>>>>>>> + if ((safeDecode) >>>>>>>> + && unlikely((uptrval) (op) + length < >>>>>>>> (uptrval) op)) { >>>>>>>> + goto _output_error; >>>>>>>> + } /* overflow detection */ >>>>>>>> + length += MINMATCH; >>>>>>>> + if (op + length >= oend - >>>>>>>> FASTLOOP_SAFE_DISTANCE) { >>>>>>>> + goto safe_match_copy; >>>>>>>> + } >>>>>>>> + } else { >>>>>>>> + length += MINMATCH; >>>>>>>> + if (op + length >= oend - >>>>>>>> FASTLOOP_SAFE_DISTANCE) { >>>>>>>> + goto safe_match_copy; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* Fastpath check: Avoids a branch in >>>>>>>> LZ4_wildCopy32 if true */ >>>>>>>> + if (!(dict == usingExtDict) || (match >= >>>>>>>> lowPrefix)) { >>>>>>>> + if (offset >= 8) { >>>>>>>> + memcpy(op, match, 8); >>>>>>>> + memcpy(op + 8, match + 8, 8); >>>>>>>> + memcpy(op + 16, match + 16, 2); >>>>>>>> + op += length; >>>>>>>> + continue; >>>>>>>> + } >>>>>>>> + } >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* match starting within external dictionary */ >>>>>>>> + if ((dict == usingExtDict) && (match < lowPrefix)) { >>>>>>>> + if (unlikely(op + length > oend - LASTLITERALS)) >>>>>>>> { >>>>>>>> + if (partialDecoding) { >>>>>>>> + /* reach end of buffer */ >>>>>>>> + length = >>>>>>>> + min(length, (size_t) (oend - >>>>>>>> op)); >>>>>>>> + } else { >>>>>>>> + /* end-of-block condition >>>>>>>> violated */ >>>>>>>> + goto _output_error; >>>>>>>> + } >>>>>>>> + } >>>>>>>> + >>>>>>>> + if (length <= (size_t) (lowPrefix - match)) { >>>>>>>> + /* match fits entirely within external >>>>>>>> dictionary : just copy */ >>>>>>>> + memmove(op, dictEnd - (lowPrefix - >>>>>>>> match), length); >>>>>>>> + op += length; >>>>>>>> + } else { >>>>>>>> + /* match stretches into both external >>>>>>>> dict and current block */ >>>>>>>> + size_t const copySize = >>>>>>>> + (size_t) (lowPrefix - match); >>>>>>>> + size_t const restSize = length - >>>>>>>> copySize; >>>>>>>> + memcpy(op, dictEnd - copySize, copySize); >>>>>>>> + op += copySize; >>>>>>>> + if (restSize > (size_t) (op - >>>>>>>> lowPrefix)) { /* overlap copy */ >>>>>>>> + BYTE *const endOfMatch = op + >>>>>>>> restSize; >>>>>>>> + const BYTE *copyFrom = lowPrefix; >>>>>>>> + while (op < endOfMatch) { >>>>>>>> + *op++ = *copyFrom++; >>>>>>>> + } >>>>>>>> + } else { >>>>>>>> + memcpy(op, lowPrefix, restSize); >>>>>>>> + op += restSize; >>>>>>>> + } >>>>>>>> + } >>>>>>>> + continue; >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* copy match within block */ >>>>>>>> + cpy = op + length; >>>>>>>> + >>>>>>>> + assert((op <= oend) && (oend - op >= 32)); >>>>>>>> + if (unlikely(offset < 16)) { >>>>>>>> + LZ4_memcpy_using_offset(op, match, cpy, offset); >>>>>>>> + } else { >>>>>>>> + LZ4_wildCopy32(op, match, cpy); >>>>>>>> + } >>>>>>>> + >>>>>>>> + op = cpy; /* wildcopy correction */ >>>>>>>> + } >>>>>>>> + safe_decode: >>>>>>>> +#endif >>>>>>>> + /* Main Loop : decode sequences */ >>>>>>>> + while (1) { >>>>>>>> + length = token >> ML_BITS; >>>>>>>> >>>>>>>> /* ip < iend before the increment */ >>>>>>>> assert(!endOnInput || ip <= iend); >>>>>>>> @@ -143,26 +450,27 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> * combined check for both stages). >>>>>>>> */ >>>>>>>> if ((endOnInput ? length != RUN_MASK : length <= 8) >>>>>>>> - /* >>>>>>>> - * strictly "less than" on input, to re-enter >>>>>>>> - * the loop with at least one byte >>>>>>>> - */ >>>>>>>> - && likely((endOnInput ? ip < shortiend : 1) & >>>>>>>> - (op <= shortoend))) { >>>>>>>> + /* >>>>>>>> + * strictly "less than" on input, to re-enter >>>>>>>> + * the loop with at least one byte >>>>>>>> + */ >>>>>>>> + && likely((endOnInput ? ip < shortiend : 1) & >>>>>>>> + (op <= shortoend))) { >>>>>>>> /* Copy the literals */ >>>>>>>> memcpy(op, ip, endOnInput ? 16 : 8); >>>>>>>> - op += length; ip += length; >>>>>>>> + op += length; >>>>>>>> + ip += length; >>>>>>>> >>>>>>>> /* >>>>>>>> * The second stage: >>>>>>>> * prepare for match copying, decode full info. >>>>>>>> * If it doesn't work out, the info won't be >>>>>>>> wasted. >>>>>>>> */ >>>>>>>> - length = token & ML_MASK; /* match length */ >>>>>>>> + length = token & ML_MASK; /* match length >>>>>>>> */ >>>>>>>> offset = LZ4_readLE16(ip); >>>>>>>> ip += 2; >>>>>>>> match = op - offset; >>>>>>>> - assert(match <= op); /* check overflow */ >>>>>>>> + assert(match <= op); /* check overflow */ >>>>>>>> >>>>>>>> /* Do not deal with overlapping matches. */ >>>>>>>> if ((length != ML_MASK) && >>>>>>>> @@ -187,28 +495,24 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> >>>>>>>> /* decode literal length */ >>>>>>>> if (length == RUN_MASK) { >>>>>>>> - unsigned int s; >>>>>>>> >>>>>>>> - if (unlikely(endOnInput ? ip >= iend - RUN_MASK >>>>>>>> : 0)) { >>>>>>>> - /* overflow detection */ >>>>>>>> + variable_length_error error = ok; >>>>>>>> + length += >>>>>>>> + read_variable_length(&ip, iend - RUN_MASK, >>>>>>>> + endOnInput, endOnInput, >>>>>>>> + &error); >>>>>>>> + if (error == initial_error) >>>>>>>> goto _output_error; >>>>>>>> - } >>>>>>>> - do { >>>>>>>> - s = *ip++; >>>>>>>> - length += s; >>>>>>>> - } while (likely(endOnInput >>>>>>>> - ? ip < iend - RUN_MASK >>>>>>>> - : 1) & (s == 255)); >>>>>>>> >>>>>>>> if ((safeDecode) >>>>>>>> - && unlikely((uptrval)(op) + >>>>>>>> - length < (uptrval)(op))) { >>>>>>>> + && unlikely((uptrval) (op) + >>>>>>>> + length < (uptrval) (op))) { >>>>>>>> /* overflow detection */ >>>>>>>> goto _output_error; >>>>>>>> } >>>>>>>> if ((safeDecode) >>>>>>>> - && unlikely((uptrval)(ip) + >>>>>>>> - length < (uptrval)(ip))) { >>>>>>>> + && unlikely((uptrval) (ip) + >>>>>>>> + length < (uptrval) (ip))) { >>>>>>>> /* overflow detection */ >>>>>>>> goto _output_error; >>>>>>>> } >>>>>>>> @@ -216,11 +520,15 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> >>>>>>>> /* copy literals */ >>>>>>>> cpy = op + length; >>>>>>>> +#if LZ4_FAST_DEC_LOOP >>>>>>>> + safe_literal_copy: >>>>>>>> +#endif >>>>>>>> LZ4_STATIC_ASSERT(MFLIMIT >= WILDCOPYLENGTH); >>>>>>>> >>>>>>>> if (((endOnInput) && ((cpy > oend - MFLIMIT) >>>>>>>> - || (ip + length > iend - (2 + 1 + >>>>>>>> LASTLITERALS)))) >>>>>>>> - || ((!endOnInput) && (cpy > oend - >>>>>>>> WILDCOPYLENGTH))) { >>>>>>>> + || (ip + length > >>>>>>>> + iend - (2 + 1 + >>>>>>>> LASTLITERALS)))) >>>>>>>> + || ((!endOnInput) && (cpy > oend - WILDCOPYLENGTH))) >>>>>>>> { >>>>>>>> if (partialDecoding) { >>>>>>>> if (cpy > oend) { >>>>>>>> /* >>>>>>>> @@ -231,7 +539,7 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> length = oend - op; >>>>>>>> } >>>>>>>> if ((endOnInput) >>>>>>>> - && (ip + length > iend)) { >>>>>>>> + && (ip + length > iend)) { >>>>>>>> /* >>>>>>>> * Error : >>>>>>>> * read attempt beyond >>>>>>>> @@ -241,7 +549,7 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> } >>>>>>>> } else { >>>>>>>> if ((!endOnInput) >>>>>>>> - && (cpy != oend)) { >>>>>>>> + && (cpy != oend)) { >>>>>>>> /* >>>>>>>> * Error : >>>>>>>> * block decoding must >>>>>>>> @@ -250,7 +558,7 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> goto _output_error; >>>>>>>> } >>>>>>>> if ((endOnInput) >>>>>>>> - && ((ip + length != iend) >>>>>>>> + && ((ip + length != iend) >>>>>>>> || (cpy > oend))) { >>>>>>>> /* >>>>>>>> * Error : >>>>>>>> @@ -288,29 +596,14 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> goto _output_error; >>>>>>>> } >>>>>>>> >>>>>>>> - /* costs ~1%; silence an msan warning when offset == 0 */ >>>>>>>> - /* >>>>>>>> - * note : when partialDecoding, there is no guarantee >>>>>>>> that >>>>>>>> - * at least 4 bytes remain available in output buffer >>>>>>>> - */ >>>>>>>> - if (!partialDecoding) { >>>>>>>> - assert(oend > op); >>>>>>>> - assert(oend - op >= 4); >>>>>>>> - >>>>>>>> - LZ4_write32(op, (U32)offset); >>>>>>>> - } >>>>>>>> - >>>>>>>> if (length == ML_MASK) { >>>>>>>> - unsigned int s; >>>>>>>> - >>>>>>>> - do { >>>>>>>> - s = *ip++; >>>>>>>> - >>>>>>>> - if ((endOnInput) && (ip > iend - >>>>>>>> LASTLITERALS)) >>>>>>>> - goto _output_error; >>>>>>>> >>>>>>>> - length += s; >>>>>>>> - } while (s == 255); >>>>>>>> + variable_length_error error = ok; >>>>>>>> + length += >>>>>>>> + read_variable_length(&ip, iend - >>>>>>>> LASTLITERALS + 1, >>>>>>>> + endOnInput, 0, &error); >>>>>>>> + if (error != ok) >>>>>>>> + goto _output_error; >>>>>>>> >>>>>>>> if ((safeDecode) >>>>>>>> && unlikely( >>>>>>>> @@ -322,6 +615,10 @@ static FORCE_INLINE int LZ4_decompress_generic( >>>>>>>> >>>>>>>> length += MINMATCH; >>>>>>>> >>>>>>>> +#if LZ4_FAST_DEC_LOOP >>>>>>>> +safe_match_copy: >>>>>>>> +#endif >>>>>>>> + >>>>>>>> /* match starting within external dictionary */ >>>>>>>> if ((dict == usingExtDict) && (match < lowPrefix)) { >>>>>>>> if (unlikely(op + length > oend - LASTLITERALS)) >>>>>>>> { >>>>>>>>