Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words

2014-05-29 Thread Tim Chen
On Wed, 2014-05-28 at 23:26 -0400, George Spelvin wrote:
> > Can you do a tcrypt speed measurement with and without your changes?
> > Check to see if there's any slowdown.  Please make sure you pin
> > the frequency of your cpu when running the test.  
> > 
> > e.g.
> > echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> 
> I just now re-read your e-mail and noticed you suggested a specific tool.

Try to run the standard kernel crypto test with tcrypt.  For speed test
of crc32c, use test 319:

modprobe tcrypt mode=319

Then you will see the output in dmesg (or tail of /var/log/messages).
It will give you the cycles you spent for various block sizes.

For consistent test numbers, before test, 
disable turbo mode of cpu in BIOS and pin 
frequency of all your cpus to max with something like

i=0
num_cpus=`cat /proc/cpuinfo| grep "^processor"| wc -l `
while [ $i -lt $num_cpus ]
do
  echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
  i=`expr $i + 1`
done

> Oops, I haven't run that yet.  I just made up my own in user space.
> As I mentioned, since the changes are to the main loop that operates on
> aligned buffers in multiples of 24 bytes, I focused my benchmarking there:
> 
> #define BUFFER 6114
> static unsigned char buf[BUFFER] __attribute__ ((aligned(8)));
> #define ITER 24 /* Number of test iterations */
> 
> uint32_t
> do_test(uint32_t crc, uint32_t (*f)(void const *, unsigned, uint32_t))
> {
>   int i, j;
>   for (i = 0; i < BUFFER; i += 8)
>   for (j = i+24; j <= BUFFER; j += 24)
>   crc = f(buf+i, j-i, crc);
>   return crc;
> }
> 
> uint32_t
> time_test(uint64_t *time, uint32_t crc, uint32_t (*f)(void const *, unsigned, 
> ui
> nt32_t))
> {
>   uint64_t start = rdtsc();
>   crc = do_test(crc, f);
>   *time = rdtsc() - start;
>   return crc;
> }
> 
> The actual test goes in ABBA order to reduce bias:
> 
>   for (i = 0; i < ITER; i += 2) {
>   crc1 = time_test(times[i]+0, crc1, crc_pcl_1);
>   crc2 = time_test(times[i]+1, crc2, crc_pcl_2);
>   crc2 = time_test(times[i+1]+1, crc2, crc_pcl_2);
>   crc1 = time_test(times[i+1]+0, crc1, crc_pcl_1);
>   }
> 
> crc_pcl_1 is the old code, crc_pcl_2 is my revised version.
> 
> 
> The results are as follows (the last line is a total):
> 
> Old code New code
>  0: 85009953 71812457 (-13197496)
>  1: 57408829 63361572 (+5952743)

Maybe your cpu has not been pinned to constant frequency?
The cycles are much higher in the first few iterations.  
Likely cpu frequency is going up when governor detect 
the load on cpu. Please also check that turbo is 
turned off as this can introduce much variations
in your testing.

>  2: 52552399 49195266 (-3357133)
>  3: 43595130 45988364 (+2393234)
>  4: 41541760 39714198 (-1827562)
>  5: 36576082 38021344 (+1445262)
>  6: 35307854 34150656 (-1157198)
>  7: 32182230 33134236 (+952006)
>  8: 31341596 31307004 (-34592)
>  9: 31340900 31329408 (-11492)
> 10: 31344884 31329144 (-15740)
> 11: 31334144 31312492 (-21652)
> 12: 31338992 31330356 (-8636)
> 13: 31343744 31311344 (-32400)
> 14: 31339000 31340196 (+1196)
> 15: 31337492 31313988 (-23504)
> 16: 31341688 31334040 (-7648)
> 17: 31341804 31308936 (-32868)
> 18: 31339936 31332020 (-7916)
> 19: 31323228 31324240 (+1012)
> 20: 31339744 31331768 (-7976)
> 21: 31321536 31332688 (+11152)
> 22: 31340280 31335212 (-5068)
> 23: 31332056 31335768 (+3712)

Looks encouraging that the time difference is fairly
small between the two algorithms.

> 24:885575261876586697 (-8988564)

> 
> It doesn't look like a slowdown; more like a 1% speedup.

You will need to throw away the first few iterations of
the test to account for cache warming effects.

Thanks.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words

2014-05-28 Thread George Spelvin
> Can you do a tcrypt speed measurement with and without your changes?
> Check to see if there's any slowdown.  Please make sure you pin
> the frequency of your cpu when running the test.  
> 
> e.g.
> echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

I just now re-read your e-mail and noticed you suggested a specific tool.
Oops, I haven't run that yet.  I just made up my own in user space.
As I mentioned, since the changes are to the main loop that operates on
aligned buffers in multiples of 24 bytes, I focused my benchmarking there:

#define BUFFER 6114
static unsigned char buf[BUFFER] __attribute__ ((aligned(8)));
#define ITER 24 /* Number of test iterations */

uint32_t
do_test(uint32_t crc, uint32_t (*f)(void const *, unsigned, uint32_t))
{
int i, j;
for (i = 0; i < BUFFER; i += 8)
for (j = i+24; j <= BUFFER; j += 24)
crc = f(buf+i, j-i, crc);
return crc;
}

uint32_t
time_test(uint64_t *time, uint32_t crc, uint32_t (*f)(void const *, unsigned, ui
nt32_t))
{
uint64_t start = rdtsc();
crc = do_test(crc, f);
*time = rdtsc() - start;
return crc;
}

The actual test goes in ABBA order to reduce bias:

for (i = 0; i < ITER; i += 2) {
crc1 = time_test(times[i]+0, crc1, crc_pcl_1);
crc2 = time_test(times[i]+1, crc2, crc_pcl_2);
crc2 = time_test(times[i+1]+1, crc2, crc_pcl_2);
crc1 = time_test(times[i+1]+0, crc1, crc_pcl_1);
}

crc_pcl_1 is the old code, crc_pcl_2 is my revised version.


The results are as follows (the last line is a total):

Old code New code
 0: 85009953 71812457 (-13197496)
 1: 57408829 63361572 (+5952743)
 2: 52552399 49195266 (-3357133)
 3: 43595130 45988364 (+2393234)
 4: 41541760 39714198 (-1827562)
 5: 36576082 38021344 (+1445262)
 6: 35307854 34150656 (-1157198)
 7: 32182230 33134236 (+952006)
 8: 31341596 31307004 (-34592)
 9: 31340900 31329408 (-11492)
10: 31344884 31329144 (-15740)
11: 31334144 31312492 (-21652)
12: 31338992 31330356 (-8636)
13: 31343744 31311344 (-32400)
14: 31339000 31340196 (+1196)
15: 31337492 31313988 (-23504)
16: 31341688 31334040 (-7648)
17: 31341804 31308936 (-32868)
18: 31339936 31332020 (-7916)
19: 31323228 31324240 (+1012)
20: 31339744 31331768 (-7976)
21: 31321536 31332688 (+11152)
22: 31340280 31335212 (-5068)
23: 31332056 31335768 (+3712)
24:885575261876586697 (-8988564)

I swapped the link order of the two .o files in case cache
placement made a difference:

 0: 84305981 71483150 (-12822831)
 1: 57341376 63129024 (+5787648)
 2: 52361618 49240069 (-3121549)
 3: 43520576 45822670 (+2302094)
 4: 41500104 39684116 (-1815988)
 5: 36542864 37940196 (+1397332)
 6: 35281570 34144348 (-1137222)
 7: 32149420 33088652 (+939232)
 8: 31342368 31329056 (-13312)
 9: 31338788 31313212 (-25576)
10: 31336324 31335612 (-712)
11: 31341892 31319576 (-22316)
12: 31336224 31322808 (-13416)
13: 31338560 31315084 (-23476)
14: 31338332 31332976 (-5356)
15: 31337300 31315088 (-22212)
16: 31334300 31330884 (-3416)
17: 31318660 31329916 (+11256)
18: 31334984 31327740 (-7244)
19: 31315084 31327768 (+12684)
20: 31334708 31345872 (+11164)
21: 31325988 31330948 (+4960)
22: 31333956 31339800 (+5844)
23: 31322880 31327316 (+4436)
24:884333857875775881 (-8557976)

It doesn't look like a slowdown; more like a 1% speedup.

I'll figure out tcrypt in a bit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words

2014-05-28 Thread George Spelvin
> Can you do a tcrypt speed measurement with and without your changes?
> Check to see if there's any slowdown.  Please make sure you pin
> the frequency of your cpu when running the test.  

Sure thing; I was already inspired to do that based on your concerns.
Do you have any particular buffer sizes or alignments you'd suggest?

Since I'm changing only the three-part core, I was going to
avoid unaligned or short buffers, stick with a single buffer so
it stays in L1 D-cache, but vary the length so we use lots of
the K_table.

It's not the RAM I was worried about, but the D-cache wasted on
on the K table.  Which doesn't affect the CRC code itself, but the
surrounding kernel code.


I'm also thinking of some ideas for handling even larger buffer sizes
without having to interrupt the 3-way main loop.  Pclmulqdq can
mutiply up to 4 32-bit values to produce a 128-bit result, which
crc32 can efficiently reduce.  So if we have three tables, of
x^(64*n) x^(4096*n), and x^(262144*n), each for n=0..63, we can
multiply them all together to handle up to a 16 MiB chunk.

The other option is to schedule the pclmulqdq in parallel with
the crc32q iterations and, after arranging a staggered start,
have a 4-part main loop, where 3 parts are performing crc32q
iterations and the fourth is using SSE to shift itself
forward (at which point it gets XORed into the data stream
that one other part is working on).

I haven't got all the details of that idea worked out in my head, but
it seems possible.  I have to study the optimization guide in detail to
see how many micro-ops the crc32q instruction from memory is (and thus
how much of the decoder it requires).

As of Nehalem, a small inner loop that fits in the decoded uop cache
has the potential to be faster than a hugely unrolled one.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] crypto: crc32c-pclmul - Shrink K_table to 32-bit words

2014-05-28 Thread Tim Chen
On Wed, 2014-05-28 at 18:15 -0400, George Spelvin wrote:
> crypto: crc32c-pclmul - Shrink K_table to 32-bit words
> 
> There's no need for the K_table to be made of 64-bit words.  For some
> reason, the original authors didn't fully reduce the values modulo the
> CRC32C polynomial, and so had some 33-bit number in there.  They
> can all be reduced to 32 bits.
> 
> Doing that cuts the table size in half.  Since the code depends on both
> pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
> to fetch it in the correct format.
> 
> Two other related fixes:
> * K_table is read-only, so belongs in .text, not .data, and
> * There's no need for more than 8-byte alignment

George,

Can you do a tcrypt speed measurement with and without your changes?
Check to see if there's any slowdown.  Please make sure you pin
the frequency of your cpu when running the test.  

e.g.
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Thanks.

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/