On Wed, 2014-05-28 at 10:40 -0400, George Spelvin wrote: > While following a number of tangents in the code (I was figuring out > how to edit lib/Kconfig; don't ask), I came across a table of 256 64-bit > words, all of which had the high half set to zero. > > Since the code depends on both pclmulq and crc32, SSE 4.1 is obviously > present, so it could use pmovzxdq and save 1K of kernel data. > > The following patch obviously lacks the kludges for old binutils, > but should convey the general idea. > > Jan: Is support for SLE10's pre-2.18 binutils still required? > Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask. > > Two other minor additional changes: > > 1. The current code unnecessarily puts the table in the read-write > .data section. Moved to .text. > 2. I'm also not sure why it's necessary to force such large alignment > on K_table. Comments on reducing it? > > Signed-off-by: George Spelvin <li...@horizon.com> > > > diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S > b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S > index dbc4339b..9f885ee4 100644 > --- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S > +++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S > @@ -216,15 +216,11 @@ LABEL crc_ %i > ## 4) Combine three results: > ################################################################ > > - lea (K_table-16)(%rip), bufp # first entry is for idx 1 > + lea (K_table-8)(%rip), bufp # first entry is for idx 1 > shlq $3, %rax # rax *= 8 > - subq %rax, tmp # tmp -= rax*8 > - shlq $1, %rax > - subq %rax, tmp # tmp -= rax*16 > - # (total tmp -= rax*24) > - addq %rax, bufp > - > - movdqa (bufp), %xmm0 # 2 consts: K1:K2 > + pmovzxdq (bufp,%rax), %xmm0 # 2 consts: K1:K2
Changing from the aligned move (movdqa) to unaligned move and zeroing (pmovzxdq), is going to make things slower. If the table is aligned on 8 byte boundary, some of the table can span 2 cache lines, which can slow things further. We are trading speed for only 4096 bytes of memory save, which is likely not a good trade for most systems except for those really constrained of memory. For this kind of non-performance critical system, it may as well use the generic crc32c algorithm and compile out this module. Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/