Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
On Wed, 2014-05-28 at 23:26 -0400, George Spelvin wrote: > > Can you do a tcrypt speed measurement with and without your changes? > > Check to see if there's any slowdown. Please make sure you pin > > the frequency of your cpu when running the test. > > > > e.g. > > echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > > I just now re-read your e-mail and noticed you suggested a specific tool. Try to run the standard kernel crypto test with tcrypt. For speed test of crc32c, use test 319: modprobe tcrypt mode=319 Then you will see the output in dmesg (or tail of /var/log/messages). It will give you the cycles you spent for various block sizes. For consistent test numbers, before test, disable turbo mode of cpu in BIOS and pin frequency of all your cpus to max with something like i=0 num_cpus=`cat /proc/cpuinfo| grep "^processor"| wc -l ` while [ $i -lt $num_cpus ] do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor i=`expr $i + 1` done > Oops, I haven't run that yet. I just made up my own in user space. > As I mentioned, since the changes are to the main loop that operates on > aligned buffers in multiples of 24 bytes, I focused my benchmarking there: > > #define BUFFER 6114 > static unsigned char buf[BUFFER] __attribute__ ((aligned(8))); > #define ITER 24 /* Number of test iterations */ > > uint32_t > do_test(uint32_t crc, uint32_t (*f)(void const *, unsigned, uint32_t)) > { > int i, j; > for (i = 0; i < BUFFER; i += 8) > for (j = i+24; j <= BUFFER; j += 24) > crc = f(buf+i, j-i, crc); > return crc; > } > > uint32_t > time_test(uint64_t *time, uint32_t crc, uint32_t (*f)(void const *, unsigned, > ui > nt32_t)) > { > uint64_t start = rdtsc(); > crc = do_test(crc, f); > *time = rdtsc() - start; > return crc; > } > > The actual test goes in ABBA order to reduce bias: > > for (i = 0; i < ITER; i += 2) { > crc1 = time_test(times[i]+0, crc1, crc_pcl_1); > crc2 = time_test(times[i]+1, crc2, crc_pcl_2); > crc2 = time_test(times[i+1]+1, crc2, crc_pcl_2); > crc1 = time_test(times[i+1]+0, crc1, crc_pcl_1); > } > > crc_pcl_1 is the old code, crc_pcl_2 is my revised version. > > > The results are as follows (the last line is a total): > > Old code New code > 0: 85009953 71812457 (-13197496) > 1: 57408829 63361572 (+5952743) Maybe your cpu has not been pinned to constant frequency? The cycles are much higher in the first few iterations. Likely cpu frequency is going up when governor detect the load on cpu. Please also check that turbo is turned off as this can introduce much variations in your testing. > 2: 52552399 49195266 (-3357133) > 3: 43595130 45988364 (+2393234) > 4: 41541760 39714198 (-1827562) > 5: 36576082 38021344 (+1445262) > 6: 35307854 34150656 (-1157198) > 7: 32182230 33134236 (+952006) > 8: 31341596 31307004 (-34592) > 9: 31340900 31329408 (-11492) > 10: 31344884 31329144 (-15740) > 11: 31334144 31312492 (-21652) > 12: 31338992 31330356 (-8636) > 13: 31343744 31311344 (-32400) > 14: 31339000 31340196 (+1196) > 15: 31337492 31313988 (-23504) > 16: 31341688 31334040 (-7648) > 17: 31341804 31308936 (-32868) > 18: 31339936 31332020 (-7916) > 19: 31323228 31324240 (+1012) > 20: 31339744 31331768 (-7976) > 21: 31321536 31332688 (+11152) > 22: 31340280 31335212 (-5068) > 23: 31332056 31335768 (+3712) Looks encouraging that the time difference is fairly small between the two algorithms. > 24:885575261876586697 (-8988564) > > It doesn't look like a slowdown; more like a 1% speedup. You will need to throw away the first few iterations of the test to account for cache warming effects. Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
> Can you do a tcrypt speed measurement with and without your changes? > Check to see if there's any slowdown. Please make sure you pin > the frequency of your cpu when running the test. > > e.g. > echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor I just now re-read your e-mail and noticed you suggested a specific tool. Oops, I haven't run that yet. I just made up my own in user space. As I mentioned, since the changes are to the main loop that operates on aligned buffers in multiples of 24 bytes, I focused my benchmarking there: #define BUFFER 6114 static unsigned char buf[BUFFER] __attribute__ ((aligned(8))); #define ITER 24 /* Number of test iterations */ uint32_t do_test(uint32_t crc, uint32_t (*f)(void const *, unsigned, uint32_t)) { int i, j; for (i = 0; i < BUFFER; i += 8) for (j = i+24; j <= BUFFER; j += 24) crc = f(buf+i, j-i, crc); return crc; } uint32_t time_test(uint64_t *time, uint32_t crc, uint32_t (*f)(void const *, unsigned, ui nt32_t)) { uint64_t start = rdtsc(); crc = do_test(crc, f); *time = rdtsc() - start; return crc; } The actual test goes in ABBA order to reduce bias: for (i = 0; i < ITER; i += 2) { crc1 = time_test(times[i]+0, crc1, crc_pcl_1); crc2 = time_test(times[i]+1, crc2, crc_pcl_2); crc2 = time_test(times[i+1]+1, crc2, crc_pcl_2); crc1 = time_test(times[i+1]+0, crc1, crc_pcl_1); } crc_pcl_1 is the old code, crc_pcl_2 is my revised version. The results are as follows (the last line is a total): Old code New code 0: 85009953 71812457 (-13197496) 1: 57408829 63361572 (+5952743) 2: 52552399 49195266 (-3357133) 3: 43595130 45988364 (+2393234) 4: 41541760 39714198 (-1827562) 5: 36576082 38021344 (+1445262) 6: 35307854 34150656 (-1157198) 7: 32182230 33134236 (+952006) 8: 31341596 31307004 (-34592) 9: 31340900 31329408 (-11492) 10: 31344884 31329144 (-15740) 11: 31334144 31312492 (-21652) 12: 31338992 31330356 (-8636) 13: 31343744 31311344 (-32400) 14: 31339000 31340196 (+1196) 15: 31337492 31313988 (-23504) 16: 31341688 31334040 (-7648) 17: 31341804 31308936 (-32868) 18: 31339936 31332020 (-7916) 19: 31323228 31324240 (+1012) 20: 31339744 31331768 (-7976) 21: 31321536 31332688 (+11152) 22: 31340280 31335212 (-5068) 23: 31332056 31335768 (+3712) 24:885575261876586697 (-8988564) I swapped the link order of the two .o files in case cache placement made a difference: 0: 84305981 71483150 (-12822831) 1: 57341376 63129024 (+5787648) 2: 52361618 49240069 (-3121549) 3: 43520576 45822670 (+2302094) 4: 41500104 39684116 (-1815988) 5: 36542864 37940196 (+1397332) 6: 35281570 34144348 (-1137222) 7: 32149420 33088652 (+939232) 8: 31342368 31329056 (-13312) 9: 31338788 31313212 (-25576) 10: 31336324 31335612 (-712) 11: 31341892 31319576 (-22316) 12: 31336224 31322808 (-13416) 13: 31338560 31315084 (-23476) 14: 31338332 31332976 (-5356) 15: 31337300 31315088 (-22212) 16: 31334300 31330884 (-3416) 17: 31318660 31329916 (+11256) 18: 31334984 31327740 (-7244) 19: 31315084 31327768 (+12684) 20: 31334708 31345872 (+11164) 21: 31325988 31330948 (+4960) 22: 31333956 31339800 (+5844) 23: 31322880 31327316 (+4436) 24:884333857875775881 (-8557976) It doesn't look like a slowdown; more like a 1% speedup. I'll figure out tcrypt in a bit. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
> Can you do a tcrypt speed measurement with and without your changes? > Check to see if there's any slowdown. Please make sure you pin > the frequency of your cpu when running the test. Sure thing; I was already inspired to do that based on your concerns. Do you have any particular buffer sizes or alignments you'd suggest? Since I'm changing only the three-part core, I was going to avoid unaligned or short buffers, stick with a single buffer so it stays in L1 D-cache, but vary the length so we use lots of the K_table. It's not the RAM I was worried about, but the D-cache wasted on on the K table. Which doesn't affect the CRC code itself, but the surrounding kernel code. I'm also thinking of some ideas for handling even larger buffer sizes without having to interrupt the 3-way main loop. Pclmulqdq can mutiply up to 4 32-bit values to produce a 128-bit result, which crc32 can efficiently reduce. So if we have three tables, of x^(64*n) x^(4096*n), and x^(262144*n), each for n=0..63, we can multiply them all together to handle up to a 16 MiB chunk. The other option is to schedule the pclmulqdq in parallel with the crc32q iterations and, after arranging a staggered start, have a 4-part main loop, where 3 parts are performing crc32q iterations and the fourth is using SSE to shift itself forward (at which point it gets XORed into the data stream that one other part is working on). I haven't got all the details of that idea worked out in my head, but it seems possible. I have to study the optimization guide in detail to see how many micro-ops the crc32q instruction from memory is (and thus how much of the decoder it requires). As of Nehalem, a small inner loop that fits in the decoded uop cache has the potential to be faster than a hugely unrolled one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words
On Wed, 2014-05-28 at 18:15 -0400, George Spelvin wrote: > crypto: crc32c-pclmul - Shrink K_table to 32-bit words > > There's no need for the K_table to be made of 64-bit words. For some > reason, the original authors didn't fully reduce the values modulo the > CRC32C polynomial, and so had some 33-bit number in there. They > can all be reduced to 32 bits. > > Doing that cuts the table size in half. Since the code depends on both > pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq > to fetch it in the correct format. > > Two other related fixes: > * K_table is read-only, so belongs in .text, not .data, and > * There's no need for more than 8-byte alignment George, Can you do a tcrypt speed measurement with and without your changes? Check to see if there's any slowdown. Please make sure you pin the frequency of your cpu when running the test. e.g. echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor Thanks. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/