RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

David Laight Wed, 10 Feb 2016 07:21:41 -0800

From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10: adcq    0(%rdi,%rcx,8),%rax
> >     inc     %rcx
> >     jnz     10b
> > That loop looks like it will have no overhead on recent cpu.
> 
> Well, it should execute at 1 instruction/cycle.


I presume you do mean 1 adc/cycle.
If it doesn't unrolling once might help.

> (No, a scaled offset doesn't take extra time.)
Maybe I'm remembering the 386 book.

> To break that requires ADCX/ADOX:
> 
> 10:   adcxq   0(%rdi,%rcx),%rax
>       adoxq   8(%rdi,%rcx),%rdx
>       leaq    16(%rcx),%rcx
>       jrcxz   11f
>       j       10b
> 11:

Getting 2 adc/cycle probably does require a little unrolling.
With luck the adcxq, adoxq and leaq will execute together.
The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair.

Experiments would be needed to confirm guesses though.

        David

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Reply via email to