Re: [PATCH v4 2/5] lib/lz4: update LZ4 decompressor module
On 2024/5/28 21:28, Jonathan Liu wrote: Hi Jianan, Here are the LZ4 decompression times I got running "time unlz4 ..." on Rock 4 SE with RK3399 CPU. v2024.04: 1.329 seconds v2024.04 with 26c7fdadcb7 ("lib/lz4: update LZ4 decompressor module") reverted: 0.665 seconds v2024.04 with your patch: 1.216 seconds I managed to get better performance by optimizing it myself. v2024.04 with my patch: 0.785 seconds With my patch, it makes no difference if I use __builtin_memcpy or memcpy in lz4.c and lz4_wrapper.c so I just left it using memcpy. It is still slower than reverting the LZ4 update though. I'm not sure what's the old version come from, but from the copyright itself it seems some earlier 2015 version. Note that there is no absolute outperform between old version and new version, old version may be lack of some necessary boundary check or lead to some msan warning or something, like a new commit of year 2016: https://github.com/lz4/lz4/commit/f094f531441140f10fd461ba769f49d10f5cd581 /* costs ~1%; silence an msan warning when offset == 0 */ /* * note : when partialDecoding, there is no guarantee that * at least 4 bytes remain available in output buffer */ if (!partialDecoding) { assert(oend > op); assert(oend - op >= 4); LZ4_write32(op, (U32)offset); } For the example above, you could completely remove the line above if you don't care about such warning, as it claims ~1% performance loss. Also since no u-boot user uses in-place decompression and if you memmove() doesn't behave well, you could update the following line /* * supports overlapping memory regions; only matters * for in-place decompression scenarios */ memmove(op, ip, length); into memcpy() instead. The new lz4 codebase relies on memcpy() / memmove() code optimization more than old version, if memcpy() assembly doesn't generate well on your platform, it might have some issue. Thanks, Gao Xiang My patch: --- a/lib/lz4.c +++ b/lib/lz4.c @@ -41,6 +41,16 @@ static FORCE_INLINE u16 LZ4_readLE16(const void *src) return get_unaligned_le16(src); } +static FORCE_INLINE void LZ4_copy2(void *dst, const void *src) +{ + put_unaligned(get_unaligned((const u16 *)src), (u16 *)dst); +} + +static FORCE_INLINE void LZ4_copy4(void *dst, const void *src) +{ + put_unaligned(get_unaligned((const u32 *)src), (u32 *)dst); +} + static FORCE_INLINE void LZ4_copy8(void *dst, const void *src) { put_unaligned(get_unaligned((const u64 *)src), (u64 *)dst); @@ -215,7 +225,10 @@ static FORCE_INLINE int LZ4_decompress_generic( && likely((endOnInput ? ip < shortiend : 1) & (op <= shortoend))) { /* Copy the literals */ - memcpy(op, ip, endOnInput ? 16 : 8); + LZ4_copy8(op, ip); + if (endOnInput) + LZ4_copy8(op + 8, ip + 8); + op += length; ip += length; /* @@ -234,9 +247,9 @@ static FORCE_INLINE int LZ4_decompress_generic( (offset >= 8) && (dict == withPrefix64k || match >= lowPrefix)) { /* Copy the match. */ - memcpy(op + 0, match + 0, 8); - memcpy(op + 8, match + 8, 8); - memcpy(op + 16, match + 16, 2); + LZ4_copy8(op, match); + LZ4_copy8(op + 8, match + 8); + LZ4_copy2(op + 16, match + 16); op += length + MINMATCH; /* Both stages worked, load the next token. */ continue; @@ -466,7 +479,7 @@ _copy_match: op[2] = match[2]; op[3] = match[3]; match += inc32table[offset]; - memcpy(op + 4, match, 4); + LZ4_copy4(op + 4, match); match -= dec64table[offset]; } else { LZ4_copy8(op, match); Let me know if you have any further suggestions. Thanks. Regards, Jonathan On Sun, 26 May 2024 at 22:18, Jianan Huang wrote: Hi Jonathan, Could you please try the following patch ? It replaces all memcpy() calls in lz4 with __builtin_memcpy(). diff --git a/lib/lz4.c b/lib/lz4.c index d365dc727c..2afe31c1c3 100644 --- a/lib/lz4.c +++ b/lib/lz4.c @@ -34,6 +34,8 @@ #include #include +#define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size) + #define FORCE_INLINE inline __attribute__((always_inline)) static FORCE_INLINE u16
Re: [PATCH v4 2/5] lib/lz4: update LZ4 decompressor module
Hi Jianan, Here are the LZ4 decompression times I got running "time unlz4 ..." on Rock 4 SE with RK3399 CPU. v2024.04: 1.329 seconds v2024.04 with 26c7fdadcb7 ("lib/lz4: update LZ4 decompressor module") reverted: 0.665 seconds v2024.04 with your patch: 1.216 seconds I managed to get better performance by optimizing it myself. v2024.04 with my patch: 0.785 seconds With my patch, it makes no difference if I use __builtin_memcpy or memcpy in lz4.c and lz4_wrapper.c so I just left it using memcpy. It is still slower than reverting the LZ4 update though. My patch: --- a/lib/lz4.c +++ b/lib/lz4.c @@ -41,6 +41,16 @@ static FORCE_INLINE u16 LZ4_readLE16(const void *src) return get_unaligned_le16(src); } +static FORCE_INLINE void LZ4_copy2(void *dst, const void *src) +{ + put_unaligned(get_unaligned((const u16 *)src), (u16 *)dst); +} + +static FORCE_INLINE void LZ4_copy4(void *dst, const void *src) +{ + put_unaligned(get_unaligned((const u32 *)src), (u32 *)dst); +} + static FORCE_INLINE void LZ4_copy8(void *dst, const void *src) { put_unaligned(get_unaligned((const u64 *)src), (u64 *)dst); @@ -215,7 +225,10 @@ static FORCE_INLINE int LZ4_decompress_generic( && likely((endOnInput ? ip < shortiend : 1) & (op <= shortoend))) { /* Copy the literals */ - memcpy(op, ip, endOnInput ? 16 : 8); + LZ4_copy8(op, ip); + if (endOnInput) + LZ4_copy8(op + 8, ip + 8); + op += length; ip += length; /* @@ -234,9 +247,9 @@ static FORCE_INLINE int LZ4_decompress_generic( (offset >= 8) && (dict == withPrefix64k || match >= lowPrefix)) { /* Copy the match. */ - memcpy(op + 0, match + 0, 8); - memcpy(op + 8, match + 8, 8); - memcpy(op + 16, match + 16, 2); + LZ4_copy8(op, match); + LZ4_copy8(op + 8, match + 8); + LZ4_copy2(op + 16, match + 16); op += length + MINMATCH; /* Both stages worked, load the next token. */ continue; @@ -466,7 +479,7 @@ _copy_match: op[2] = match[2]; op[3] = match[3]; match += inc32table[offset]; - memcpy(op + 4, match, 4); + LZ4_copy4(op + 4, match); match -= dec64table[offset]; } else { LZ4_copy8(op, match); Let me know if you have any further suggestions. Thanks. Regards, Jonathan On Sun, 26 May 2024 at 22:18, Jianan Huang wrote: > > Hi Jonathan, > > Could you please try the following patch ? It replaces all memcpy() calls in > lz4 with __builtin_memcpy(). > > diff --git a/lib/lz4.c b/lib/lz4.c > index d365dc727c..2afe31c1c3 100644 > --- a/lib/lz4.c > +++ b/lib/lz4.c > @@ -34,6 +34,8 @@ > #include > #include > > +#define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size) > + > #define FORCE_INLINE inline __attribute__((always_inline)) > > static FORCE_INLINE u16 LZ4_readLE16(const void *src) > @@ -215,7 +217,7 @@ static FORCE_INLINE int LZ4_decompress_generic( > && likely((endOnInput ? ip < shortiend : 1) & > (op <= shortoend))) { > /* Copy the literals */ > - memcpy(op, ip, endOnInput ? 16 : 8); > + LZ4_memcpy(op, ip, endOnInput ? 16 : 8); > op += length; ip += length; > > /* > @@ -234,9 +236,9 @@ static FORCE_INLINE int LZ4_decompress_generic( > (offset >= 8) && > (dict == withPrefix64k || match >= lowPrefix)) { > /* Copy the match. */ > - memcpy(op + 0, match + 0, 8); > - memcpy(op + 8, match + 8, 8); > - memcpy(op + 16, match + 16, 2); > + LZ4_memcpy(op + 0, match + 0, 8); > + LZ4_memcpy(op + 8, match + 8, 8); > + LZ4_memcpy(op + 16, match + 16, 2); > op += length + MINMATCH; > /* Both stages worked, load the next token. */ > continue; > @@ -416,7 +418,7 @@ _copy_match: > size_t const copySize = (size_t)(lowPrefix - match); > size_t const restSize = length - copySize; > > - memcpy(op, dictEnd - copySize, copySize); > + LZ4_memcpy(op, dictEnd - copySize, copySize); > op += copySize; > if (restSize > (size_t)(op - lowPrefix)) { > /* overlap copy */ > @@ -426,7 +428,7 @@ _copy_match: > while (op < endOfMatch) > *op++ = *copyFrom++; > } else { > - memcpy(op, lowPrefix, restSize); > + LZ4_memcpy(op, lowPrefix, restSize); > op += restSize; > } > } > @@ -452,7 +454,7 @@ _copy_match: > while (op < copyEnd) > *op++ = *match++; > } else { > - memcpy(op, match, mlen); > + LZ4_memcpy(op, match, mlen); > } > op = copyEnd; > if (op == oend) > @@ -466,7 +468,7 @@ _copy_match: > op[2] = match[2]; > op[3] = match[3]; >
Re: [PATCH v4 2/5] lib/lz4: update LZ4 decompressor module
Hi Jonathan, Could you please try the following patch ? It replaces all memcpy() calls in lz4 with __builtin_memcpy(). diff --git a/lib/lz4.c b/lib/lz4.c index d365dc727c..2afe31c1c3 100644 --- a/lib/lz4.c +++ b/lib/lz4.c @@ -34,6 +34,8 @@ #include #include +#define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size) + #define FORCE_INLINE inline __attribute__((always_inline)) static FORCE_INLINE u16 LZ4_readLE16(const void *src) @@ -215,7 +217,7 @@ static FORCE_INLINE int LZ4_decompress_generic( && likely((endOnInput ? ip < shortiend : 1) & (op <= shortoend))) { /* Copy the literals */ - memcpy(op, ip, endOnInput ? 16 : 8); + LZ4_memcpy(op, ip, endOnInput ? 16 : 8); op += length; ip += length; /* @@ -234,9 +236,9 @@ static FORCE_INLINE int LZ4_decompress_generic( (offset >= 8) && (dict == withPrefix64k || match >= lowPrefix)) { /* Copy the match. */ - memcpy(op + 0, match + 0, 8); - memcpy(op + 8, match + 8, 8); - memcpy(op + 16, match + 16, 2); + LZ4_memcpy(op + 0, match + 0, 8); + LZ4_memcpy(op + 8, match + 8, 8); + LZ4_memcpy(op + 16, match + 16, 2); op += length + MINMATCH; /* Both stages worked, load the next token. */ continue; @@ -416,7 +418,7 @@ _copy_match: size_t const copySize = (size_t)(lowPrefix - match); size_t const restSize = length - copySize; - memcpy(op, dictEnd - copySize, copySize); + LZ4_memcpy(op, dictEnd - copySize, copySize); op += copySize; if (restSize > (size_t)(op - lowPrefix)) { /* overlap copy */ @@ -426,7 +428,7 @@ _copy_match: while (op < endOfMatch) *op++ = *copyFrom++; } else { - memcpy(op, lowPrefix, restSize); + LZ4_memcpy(op, lowPrefix, restSize); op += restSize; } } @@ -452,7 +454,7 @@ _copy_match: while (op < copyEnd) *op++ = *match++; } else { - memcpy(op, match, mlen); + LZ4_memcpy(op, match, mlen); } op = copyEnd; if (op == oend) @@ -466,7 +468,7 @@ _copy_match: op[2] = match[2]; op[3] = match[3]; match += inc32table[offset]; - memcpy(op + 4, match, 4); + LZ4_memcpy(op + 4, match, 4); match -= dec64table[offset]; } else { LZ4_copy8(op, match); diff --git a/lib/lz4_wrapper.c b/lib/lz4_wrapper.c index 4d48e7b0e8..e09c8d7057 100644 --- a/lib/lz4_wrapper.c +++ b/lib/lz4_wrapper.c @@ -80,7 +80,7 @@ int ulz4fn(const void *src, size_t srcn, void *dst, size_t *dstn) if (block_header & LZ4F_BLOCKUNCOMPRESSED_FLAG) { size_t size = min((ptrdiff_t)block_size, (ptrdiff_t)(end - out)); - memcpy(out, in, size); + LZ4_memcpy(out, in, size); out += size; if (size < block_size) { ret = -ENOBUFS; /* output overrun */ Thanks, Jianan On 2024/5/26 16:06, Jonathan Liu wrote: Hi Gao, On Sat, 25 May 2024 at 02:52, Gao Xiang wrote: Hi, On 2024/5/24 22:26, Jonathan Liu wrote: Hi Jianan, On Sat, 26 Feb 2022 at 18:05, Huang Jianan wrote: Update the LZ4 compression module based on LZ4 v1.8.3 in order to use the newest LZ4_decompress_safe_partial() which can now decode exactly the nb of bytes requested. Signed-off-by: Huang Jianan I noticed after this commit LZ4 decompression is slower. ulz4fn function call takes 1.209670 seconds with this commit. After reverting this commit, the ulz4fn function call takes 0.587032 seconds. I am decompressing a LZ4 compressed kernel (compressed with lz4 v1.9.4 using -9 option for maximum compression) on RK3399. Any ideas why it is slower with this commit and how the performance regression can be fixed? Just the quick glance, I think the issue may be due to memcpy/memmove since it seems the main difference between these two codebases (I'm not sure which LZ4 version the old codebase was based on) and the new version mainly relies on memcpy/memmove instead of its own versions. Would you mind to check the assembly how memcpy/memset is generated on your platform? Here is the assembly (-mcpu=cortex-a72.cortex-a53 -march=armv8-a+crc+crypto): 0028220c : #if !CONFIG_IS_ENABLED(TINY_MEMSET) unsigned long cl = 0; int i; /* do it one word at a time (32 bits or 64 bits) while possible */ if ( ((ulong)s & (sizeof(*sl) - 1)) == 0) { 28220c: f2400803andsx3, x0, #0x7 282210: 540002c1b.ne282268 // b.any for (i = 0; i < sizeof(*sl); i++) { cl <<= 8; cl |= c & 0xff; 282214: 92401c26and x6, x1, #0xff unsigned long cl = 0; 282218: d284mov x4, #0x0// #0 28221c: 52800105mov w5, #0x8// #8 cl |= c & 0xff; 282220: aa0420c4orr x4, x6, x4, lsl #8 for (i = 0; i < sizeof(*sl); i++) { 282224: 710004a5subsw5, w5, #0x1 282228: 54c1b.ne282220 // b.any } while (count >= sizeof(*sl)) { 28222c: cb030045sub x5, x2, x3 282230: f1001cbf
Re: [PATCH v4 2/5] lib/lz4: update LZ4 decompressor module
Hi Gao, On Sat, 25 May 2024 at 02:52, Gao Xiang wrote: > > Hi, > > On 2024/5/24 22:26, Jonathan Liu wrote: > > Hi Jianan, > > > > On Sat, 26 Feb 2022 at 18:05, Huang Jianan wrote: > >> > >> Update the LZ4 compression module based on LZ4 v1.8.3 in order to > >> use the newest LZ4_decompress_safe_partial() which can now decode > >> exactly the nb of bytes requested. > >> > >> Signed-off-by: Huang Jianan > > > > I noticed after this commit LZ4 decompression is slower. > > ulz4fn function call takes 1.209670 seconds with this commit. > > After reverting this commit, the ulz4fn function call takes 0.587032 > > seconds. > > > > I am decompressing a LZ4 compressed kernel (compressed with lz4 v1.9.4 > > using -9 option for maximum compression) on RK3399. > > > > Any ideas why it is slower with this commit and how the performance > > regression can be fixed? > > Just the quick glance, I think the issue may be due to memcpy/memmove > since it seems the main difference between these two codebases > (I'm not sure which LZ4 version the old codebase was based on) and > the new version mainly relies on memcpy/memmove instead of its own > versions. > > Would you mind to check the assembly how memcpy/memset is generated > on your platform? Here is the assembly (-mcpu=cortex-a72.cortex-a53 -march=armv8-a+crc+crypto): 0028220c : #if !CONFIG_IS_ENABLED(TINY_MEMSET) unsigned long cl = 0; int i; /* do it one word at a time (32 bits or 64 bits) while possible */ if ( ((ulong)s & (sizeof(*sl) - 1)) == 0) { 28220c: f2400803andsx3, x0, #0x7 282210: 540002c1b.ne282268 // b.any for (i = 0; i < sizeof(*sl); i++) { cl <<= 8; cl |= c & 0xff; 282214: 92401c26and x6, x1, #0xff unsigned long cl = 0; 282218: d284mov x4, #0x0// #0 28221c: 52800105mov w5, #0x8// #8 cl |= c & 0xff; 282220: aa0420c4orr x4, x6, x4, lsl #8 for (i = 0; i < sizeof(*sl); i++) { 282224: 710004a5subsw5, w5, #0x1 282228: 54c1b.ne282220 // b.any } while (count >= sizeof(*sl)) { 28222c: cb030045sub x5, x2, x3 282230: f1001cbfcmp x5, #0x7 282234: 54000148b.hi28225c // b.pmore 282238: d343fc43lsr x3, x2, #3 28223c: 928000e4mov x4, #0xfff8 // #-8 282240: 9b047c63mul x3, x3, x4 282244: 8b030042add x2, x2, x3 282248: cb030003sub x3, x0, x3 unsigned long *sl = (unsigned long *) s; 28224c: d284mov x4, #0x0// #0 count -= sizeof(*sl); } } #endif /* fill 8 bits at a time */ s8 = (char *)sl; while (count--) 282250: eb04005fcmp x2, x4 282254: 54e1b.ne282270 // b.any *s8++ = c; return s; } 282258: d65f03c0ret *sl++ = cl; 28225c: f8236804str x4, [x0, x3] count -= sizeof(*sl); 282260: 91002063add x3, x3, #0x8 282264: 17f2b 28222c unsigned long *sl = (unsigned long *) s; 282268: aa0003e3mov x3, x0 28226c: 17f8b 28224c *s8++ = c; 282270: 38246861strbw1, [x3, x4] 282274: 91000484add x4, x4, #0x1 282278: 17f6b 282250 0028227c : __used void * memcpy(void *dest, const void *src, size_t count) { unsigned long *dl = (unsigned long *)dest, *sl = (unsigned long *)src; char *d8, *s8; if (src == dest) 28227c: eb01001fcmp x0, x1 282280: 54000100b.eq2822a0 // b.none return dest; /* while all data is aligned (common case), copy a word at a time */ if ( (((ulong)dest | (ulong)src) & (sizeof(*dl) - 1)) == 0) { 282284: aa010003orr x3, x0, x1 282288: f2400863andsx3, x3, #0x7 28228c: 54000120b.eq2822b0 // b.none 282290: aa0003e4mov x4, x0 282294: d283mov x3, #0x0// #0 } } /* copy the reset one byte at a time */ d8 = (char *)dl; s8 = (char *)sl; while (count--) 282298: eb03005fcmp x2, x3 28229c: 540001e1b.ne2822d8 // b.any *d8++ = *s8++; return dest; } 2822a0: d65f03c0ret
Re: [PATCH v4 2/5] lib/lz4: update LZ4 decompressor module
Hi Jianan, On Sat, 26 Feb 2022 at 18:05, Huang Jianan wrote: > > Update the LZ4 compression module based on LZ4 v1.8.3 in order to > use the newest LZ4_decompress_safe_partial() which can now decode > exactly the nb of bytes requested. > > Signed-off-by: Huang Jianan I noticed after this commit LZ4 decompression is slower. ulz4fn function call takes 1.209670 seconds with this commit. After reverting this commit, the ulz4fn function call takes 0.587032 seconds. I am decompressing a LZ4 compressed kernel (compressed with lz4 v1.9.4 using -9 option for maximum compression) on RK3399. Any ideas why it is slower with this commit and how the performance regression can be fixed? Thanks. Regards, Jonathan
Re: [PATCH v4 2/5] lib/lz4: update LZ4 decompressor module
Hi, On 2024/5/24 22:26, Jonathan Liu wrote: Hi Jianan, On Sat, 26 Feb 2022 at 18:05, Huang Jianan wrote: Update the LZ4 compression module based on LZ4 v1.8.3 in order to use the newest LZ4_decompress_safe_partial() which can now decode exactly the nb of bytes requested. Signed-off-by: Huang Jianan I noticed after this commit LZ4 decompression is slower. ulz4fn function call takes 1.209670 seconds with this commit. After reverting this commit, the ulz4fn function call takes 0.587032 seconds. I am decompressing a LZ4 compressed kernel (compressed with lz4 v1.9.4 using -9 option for maximum compression) on RK3399. Any ideas why it is slower with this commit and how the performance regression can be fixed? Just the quick glance, I think the issue may be due to memcpy/memmove since it seems the main difference between these two codebases (I'm not sure which LZ4 version the old codebase was based on) and the new version mainly relies on memcpy/memmove instead of its own versions. Would you mind to check the assembly how memcpy/memset is generated on your platform? Thanks, Gao Xiang Thanks. Regards, Jonathan
Re: [PATCH v4 2/5] lib/lz4: update LZ4 decompressor module
On Sat, Feb 26, 2022 at 03:05:48PM +0800, Huang Jianan wrote: > Update the LZ4 compression module based on LZ4 v1.8.3 in order to > use the newest LZ4_decompress_safe_partial() which can now decode > exactly the nb of bytes requested. > > Signed-off-by: Huang Jianan Applied to u-boot/next, thanks! -- Tom signature.asc Description: PGP signature