Thanks for the reply! > Changing from the aligned move (movdqa) to unaligned move and zeroing > (pmovzxdq), is going to make things slower. If the table is aligned > on 8 byte boundary, some of the table can span 2 cache lines, which > can slow things further.
Um, two notes: 1) This load is performed once per 3072-byte block, which is a minimum of 128 cycles just for the crc32q instructions, never mind all the pcmulqdq folderol. Is it really more than 2 cycles? Heck, is it *any* overall time given that it's preceded by a stretch of 384 instructions that it's not data-dependent on? I'll do some benchmarking to find out. 2) The shrunk table entries are 8 bytes long, and so can't span a cache line. Is there any benefit to using a larger alignment, other than the very small issue of the full table needing 1 more cache line to be fully cached? > We are trading speed for only 4096 bytes of memory save, > which is likely not a good trade for most systems except for > those really constrained of memory. For this kind of non-performance > critical system, it may as well use the generic crc32c algorithm and > compile out this module. I hadn't intended to cause any speed penalty at all. Do you really think there will be one? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/