On Thu, Feb 4, 2016 at 8:51 AM, Alexander Duyck <alexander.du...@gmail.com> wrote: > On Thu, Feb 4, 2016 at 3:08 AM, David Laight <david.lai...@aculab.com> wrote: >> From: Tom Herbert >>> Sent: 03 February 2016 19:19 >> ... >>> + /* Main loop */ >>> +50: adcq 0*8(%rdi),%rax >>> + adcq 1*8(%rdi),%rax >>> + adcq 2*8(%rdi),%rax >>> + adcq 3*8(%rdi),%rax >>> + adcq 4*8(%rdi),%rax >>> + adcq 5*8(%rdi),%rax >>> + adcq 6*8(%rdi),%rax >>> + adcq 7*8(%rdi),%rax >>> + adcq 8*8(%rdi),%rax >>> + adcq 9*8(%rdi),%rax >>> + adcq 10*8(%rdi),%rax >>> + adcq 11*8(%rdi),%rax >>> + adcq 12*8(%rdi),%rax >>> + adcq 13*8(%rdi),%rax >>> + adcq 14*8(%rdi),%rax >>> + adcq 15*8(%rdi),%rax >>> + lea 128(%rdi), %rdi >>> + loop 50b >> >> I'd need convincing that unrolling the loop like that gives any significant >> gain. >> You have a dependency chain on the carry flag so have delays between the >> 'adcq' >> instructions (these may be more significant than the memory reads from l1 >> cache). >> >> I also don't remember (might be wrong) the 'loop' instruction being executed >> quickly. >> If 'loop' is fast then you will probably find that: >> >> 10: adcq 0(%rdi),%rax >> lea 8(%rdi),%rdi >> loop 10b >> >> is just as fast since the three instructions could all be executed in >> parallel. >> But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute >> as >> a single micro-op). >> IIRC 'adc' and 'dec' will both have dependencies on the flags register >> so cannot execute together (which is a shame here). >> >> It is also possible that breaking the carry-chain dependency by doing 32bit >> adds (possibly after 64bit reads) can be made to be faster. > > If nothing else reducing the size of this main loop may be desirable. > I know the newer x86 is supposed to have a loop buffer so that it can > basically loop on already decoded instructions. Normally it is only > something like 64 or 128 bytes in size though. You might find that > reducing this loop to that smaller size may improve the performance > for larger payloads. > I saw 128 to be better in my testing. For large packets this loop does all the work. I see performance dependent on the amount of loop overhead, i.e. we got it down to two non-adcq instructions but it is still noticeable. Also, this helps a lot on sizes up to 128 bytes since we only need to do single call in the jump table and no trip through the loop.
Tom > - Alex