RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread David Laight
From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10: adcq0(%rdi,%rcx,8),%rax
> > inc %rcx
> > jnz 10b
> > That loop looks like it will have no overhead on recent cpu.
> 
> Well, it should execute at 1 instruction/cycle.

I presume you do mean 1 adc/cycle.
If it doesn't unrolling once might help.

> (No, a scaled offset doesn't take extra time.)
Maybe I'm remembering the 386 book.

> To break that requires ADCX/ADOX:
> 
> 10:   adcxq   0(%rdi,%rcx),%rax
>   adoxq   8(%rdi,%rcx),%rdx
>   leaq16(%rcx),%rcx
>   jrcxz   11f
>   j   10b
> 11:

Getting 2 adc/cycle probably does require a little unrolling.
With luck the adcxq, adoxq and leaq will execute together.
The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair.

Experiments would be needed to confirm guesses though.

David



RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread George Spelvin
David Laight wrote:
> Separate renaming allows:
> 1) The value to tested without waiting for pending updates to complete.
>Useful for IE and DIR.

I don't quite follow.  It allows the value to be tested without waiting
for pending updates *of other bits* to complete.

Obviusly, the update of the bit being tested has to complete!

> I can't see any obvious gain from separating out O or Z (even with
> adcx and adox). You'd need some other instructions that don't set O (or Z)
> but set some other useful flags.
> (A decrement that only set Z for instance.)

I tried to describe the advantages in the previous message.

The problems arise much less often than the INC/DEC pair, but there are
instructions whick write only the O and C flags, (ROL, ROR) and only
the Z flag (CMPXCHG).

The sign, aux carry, and parity flags are *always* updated as
a group, so they can be renamed as a group.

> While LOOP could be used on Bulldozer+ an equivalently fast loop
> can be done with inc/dec and jnz.
> So you only care about LOOP/JCXZ when ADOX is supported.
> 
> I think the fastest loop is:
> 10:   adc %rax,0(%rdi,%rcx,8)
>   inc %rcx
>   jnz 10b
> but check if any cpu add an extra clock for the 'scaled' offset
> (they might be faster if %rdi is incremented).
> That loop looks like it will have no overhead on recent cpu.

Well, it should execute at 1 instruction/cycle.  (No, a scaled offset
doesn't take extra time.)  To break that requires ADCX/ADOX:

10: adcxq   0(%rdi,%rcx),%rax
adoxq   8(%rdi,%rcx),%rdx
leaq16(%rcx),%rcx
jrcxz   11f
j   10b
11:


RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread David Laight
From: George Spelvin
> Sent: 10 February 2016 00:54
> To: David Laight; linux-kernel@vger.kernel.org; li...@horizon.com; 
> net...@vger.kernel.org;
> David Laight wrote:
> > Since adcx and adox must execute in parallel I clearly need to re-remember
> > how dependencies against the flags register work. I'm sure I remember
> > issues with 'false dependencies' against the flags.
> 
> The issue is with flags register bits that are *not* modified by
> an instruction.  If the register is treated as a monolithic entity,
> then the previous values of those bits must be considered an *input*
> to the instruction, forcing serialization.
> 
> The first step in avoiding this problem is to consider the rarely-modified
> bits (interrupt, direction, trap, etc.) to be a separate logical register
> from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
> which are updated by almost every instruction.
> 
> An arithmetic instruction overwrites the arithmetic flags (so it's only
> a WAW dependency which can be broken by renaming) and doesn't touch the
> status flags (so no dependency).
> 
> However, on x86 even the arithmetic flags aren't updated consistently.
> The biggest offender are the (very common!) INC/DEC instructions,
> which update all of the arithmetic flags *except* the carry flag.
> 
> Thus, the carry flag is also renamed separately on every superscalar
> x86 implementation I've ever heard of.

Ah, that is the little fact I'd forgotten.
...
> Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
> it was reasonable to commit to always renaming CF and OF separately.

Separate renaming allows:
1) The value to tested without waiting for pending updates to complete.
   Useful for IE and DIR.
2) Instructions that modify almost all the flags to execute without
   waiting for a previous instruction to complete.
   So separating 'carry' allows inc/dec to execute without waiting
   for previous arithmetic to complete.

The latter should remove the dependency (both ways) between 'adc' and
'dec, jnz' in a checksum loop.

I can't see any obvious gain from separating out O or Z (even with
adcx and adox). You'd need some other instructions that don't set O (or Z)
but set some other useful flags.
(A decrement that only set Z for instance.)

> > However you still need a loop construct that doesn't modify 'o' or 'c'.
> > Using leal, jcxz, jmp might work.
> > (Unless broadwell actually has a fast 'loop' instruction.)
> 
> According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
> JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
> to K8 and Merom.  The one exception is Precott.  JCXZ and LOOP are 4
> uops on those processors.  But 64 bit in general sucked on Precott,
> so how much do we care?
> 
> AMD:  LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
>   JCXZ is acceptable on all of them.
>   LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
> Intel:LOOP is slow (7+ uops) on all processors up to and including 
> Skylake.
>   JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
>   - Prescott (JCXZ & loop both 4 uops)
>   - 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
>   I can't find any that it's fast on.

While LOOP could be used on Bulldozer+ an equivalently fast loop
can be done with inc/dec and jnz.
So you only care about LOOP/JCXZ when ADOX is supported.

I think the fastest loop is:
10: adc %rax,0(%rdi,%rcx,8)
inc %rcx
jnz 10b
but check if any cpu add an extra clock for the 'scaled' offset
(they might be faster if %rdi is incremented).
That loop looks like it will have no overhead on recent cpu.

David


RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread David Laight
From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10: adcq0(%rdi,%rcx,8),%rax
> > inc %rcx
> > jnz 10b
> > That loop looks like it will have no overhead on recent cpu.
> 
> Well, it should execute at 1 instruction/cycle.

I presume you do mean 1 adc/cycle.
If it doesn't unrolling once might help.

> (No, a scaled offset doesn't take extra time.)
Maybe I'm remembering the 386 book.

> To break that requires ADCX/ADOX:
> 
> 10:   adcxq   0(%rdi,%rcx),%rax
>   adoxq   8(%rdi,%rcx),%rdx
>   leaq16(%rcx),%rcx
>   jrcxz   11f
>   j   10b
> 11:

Getting 2 adc/cycle probably does require a little unrolling.
With luck the adcxq, adoxq and leaq will execute together.
The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair.

Experiments would be needed to confirm guesses though.

David



RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread George Spelvin
David Laight wrote:
> Separate renaming allows:
> 1) The value to tested without waiting for pending updates to complete.
>Useful for IE and DIR.

I don't quite follow.  It allows the value to be tested without waiting
for pending updates *of other bits* to complete.

Obviusly, the update of the bit being tested has to complete!

> I can't see any obvious gain from separating out O or Z (even with
> adcx and adox). You'd need some other instructions that don't set O (or Z)
> but set some other useful flags.
> (A decrement that only set Z for instance.)

I tried to describe the advantages in the previous message.

The problems arise much less often than the INC/DEC pair, but there are
instructions whick write only the O and C flags, (ROL, ROR) and only
the Z flag (CMPXCHG).

The sign, aux carry, and parity flags are *always* updated as
a group, so they can be renamed as a group.

> While LOOP could be used on Bulldozer+ an equivalently fast loop
> can be done with inc/dec and jnz.
> So you only care about LOOP/JCXZ when ADOX is supported.
> 
> I think the fastest loop is:
> 10:   adc %rax,0(%rdi,%rcx,8)
>   inc %rcx
>   jnz 10b
> but check if any cpu add an extra clock for the 'scaled' offset
> (they might be faster if %rdi is incremented).
> That loop looks like it will have no overhead on recent cpu.

Well, it should execute at 1 instruction/cycle.  (No, a scaled offset
doesn't take extra time.)  To break that requires ADCX/ADOX:

10: adcxq   0(%rdi,%rcx),%rax
adoxq   8(%rdi,%rcx),%rdx
leaq16(%rcx),%rcx
jrcxz   11f
j   10b
11:


RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread David Laight
From: George Spelvin
> Sent: 10 February 2016 00:54
> To: David Laight; linux-kernel@vger.kernel.org; li...@horizon.com; 
> net...@vger.kernel.org;
> David Laight wrote:
> > Since adcx and adox must execute in parallel I clearly need to re-remember
> > how dependencies against the flags register work. I'm sure I remember
> > issues with 'false dependencies' against the flags.
> 
> The issue is with flags register bits that are *not* modified by
> an instruction.  If the register is treated as a monolithic entity,
> then the previous values of those bits must be considered an *input*
> to the instruction, forcing serialization.
> 
> The first step in avoiding this problem is to consider the rarely-modified
> bits (interrupt, direction, trap, etc.) to be a separate logical register
> from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
> which are updated by almost every instruction.
> 
> An arithmetic instruction overwrites the arithmetic flags (so it's only
> a WAW dependency which can be broken by renaming) and doesn't touch the
> status flags (so no dependency).
> 
> However, on x86 even the arithmetic flags aren't updated consistently.
> The biggest offender are the (very common!) INC/DEC instructions,
> which update all of the arithmetic flags *except* the carry flag.
> 
> Thus, the carry flag is also renamed separately on every superscalar
> x86 implementation I've ever heard of.

Ah, that is the little fact I'd forgotten.
...
> Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
> it was reasonable to commit to always renaming CF and OF separately.

Separate renaming allows:
1) The value to tested without waiting for pending updates to complete.
   Useful for IE and DIR.
2) Instructions that modify almost all the flags to execute without
   waiting for a previous instruction to complete.
   So separating 'carry' allows inc/dec to execute without waiting
   for previous arithmetic to complete.

The latter should remove the dependency (both ways) between 'adc' and
'dec, jnz' in a checksum loop.

I can't see any obvious gain from separating out O or Z (even with
adcx and adox). You'd need some other instructions that don't set O (or Z)
but set some other useful flags.
(A decrement that only set Z for instance.)

> > However you still need a loop construct that doesn't modify 'o' or 'c'.
> > Using leal, jcxz, jmp might work.
> > (Unless broadwell actually has a fast 'loop' instruction.)
> 
> According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
> JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
> to K8 and Merom.  The one exception is Precott.  JCXZ and LOOP are 4
> uops on those processors.  But 64 bit in general sucked on Precott,
> so how much do we care?
> 
> AMD:  LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
>   JCXZ is acceptable on all of them.
>   LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
> Intel:LOOP is slow (7+ uops) on all processors up to and including 
> Skylake.
>   JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
>   - Prescott (JCXZ & loop both 4 uops)
>   - 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
>   I can't find any that it's fast on.

While LOOP could be used on Bulldozer+ an equivalently fast loop
can be done with inc/dec and jnz.
So you only care about LOOP/JCXZ when ADOX is supported.

I think the fastest loop is:
10: adc %rax,0(%rdi,%rcx,8)
inc %rcx
jnz 10b
but check if any cpu add an extra clock for the 'scaled' offset
(they might be faster if %rdi is incremented).
That loop looks like it will have no overhead on recent cpu.

David


RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-09 Thread George Spelvin
David Laight wrote:
> Since adcx and adox must execute in parallel I clearly need to re-remember
> how dependencies against the flags register work. I'm sure I remember
> issues with 'false dependencies' against the flags.

The issue is with flags register bits that are *not* modified by
an instruction.  If the register is treated as a monolithic entity,
then the previous values of those bits must be considered an *input*
to the instruction, forcing serialization.

The first step in avoiding this problem is to consider the rarely-modified
bits (interrupt, direction, trap, etc.) to be a separate logical register
from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
which are updated by almost every instruction.

An arithmetic instruction overwrites the arithmetic flags (so it's only
a WAW dependency which can be broken by renaming) and doesn't touch the
status flags (so no dependency).

However, on x86 even the arithmetic flags aren't updated consistently.
The biggest offender are the (very common!) INC/DEC instructions,
which update all of the arithmetic flags *except* the carry flag.

Thus, the carry flag is also renamed separately on every superscalar
x86 implementation I've ever heard of.

The bit test instructions (BT, BTC, BTR, BTS) also affect *only*
the carry flag, leaving other flags unmodified.  This is also
handled properly by renaming the carry flag separately.


Here's a brief summary chart of flags updated by common instructions:
http://www.logix.cz/michal/doc/i386/app-c.htm
and the full list with all the corner cases:
http://www.logix.cz/michal/doc/i386/app-b.htm

The other two flags that can be worth separating are the overflow
and zero flags.

The rotate instructions modify *only* the carry and overflow flags.
While overflow is undefined for multi-bit rotates (and thus leaving it
unmodified is a valid implementation), it's defined for single-bit rotates,
so must be written.

There are several less common instructions, notably BSF, BSR, CMPXCHG8B,
and a bunch of 80286 segment instructions that nobody cares about,
which retort the result of a test in the zero flag and are defined to
not affect the other flags.


So an aggressive x86 implementation breaks the flags register into five
separately renamed registers:
- CF (carry)
- OF (overflow)
- ZF (zero)
- SF, AF, PF (sign, aux carry, and parity)
- DF, IF, TF, IOPL, etc.

Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
it was reasonable to commit to always renaming CF and OF separately.

> However you still need a loop construct that doesn't modify 'o' or 'c'.
> Using leal, jcxz, jmp might work.
> (Unless broadwell actually has a fast 'loop' instruction.)

According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
to K8 and Merom.  The one exception is Precott.  JCXZ and LOOP are 4
uops on those processors.  But 64 bit in general sucked on Precott,
so how much do we care?

AMD:LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
JCXZ is acceptable on all of them.
LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
Intel:  LOOP is slow (7+ uops) on all processors up to and including Skylake.
JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
- Prescott (JCXZ & loop both 4 uops)
- 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
I can't find any that it's fast on.


RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-09 Thread David Laight
From: George Spelvin [mailto:li...@horizon.com]
> Sent: 08 February 2016 20:13
> David Laight wrote:
> > I'd need convincing that unrolling the loop like that gives any significant 
> > gain.
> > You have a dependency chain on the carry flag so have delays between the 
> > 'adcq'
> > instructions (these may be more significant than the memory reads from l1 
> > cache).
> 
> If the carry chain is a bottleneck, on Broadwell+ (feature flag
> X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
> separate flag bits for their carry chains and so can be interleaved.
> 
> I don't have such a machine to test on, but if someone who does
> would like to do a little benchmarking, that would be an interesting
> data point.
> 
> Unfortunately, that means yet another version of the main loop,
> but if there's a significant benefit...

Well, the only part actually worth writing in assembler is the 'adc' loop.
So run-time substitution of separate versions (as is done for memcpy())
wouldn't be hard.

Since adcx and adox must execute in parallel I clearly need to re-remember
how dependencies against the flags register work. I'm sure I remember
issues with 'false dependencies' against the flags.

However you still need a loop construct that doesn't modify 'o' or 'c'.
Using leal, jcxz, jmp might work.
(Unless broadwell actually has a fast 'loop' instruction.)

(I've not got a suitable test cpu.)

David



RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-09 Thread David Laight
From: George Spelvin [mailto:li...@horizon.com]
> Sent: 08 February 2016 20:13
> David Laight wrote:
> > I'd need convincing that unrolling the loop like that gives any significant 
> > gain.
> > You have a dependency chain on the carry flag so have delays between the 
> > 'adcq'
> > instructions (these may be more significant than the memory reads from l1 
> > cache).
> 
> If the carry chain is a bottleneck, on Broadwell+ (feature flag
> X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
> separate flag bits for their carry chains and so can be interleaved.
> 
> I don't have such a machine to test on, but if someone who does
> would like to do a little benchmarking, that would be an interesting
> data point.
> 
> Unfortunately, that means yet another version of the main loop,
> but if there's a significant benefit...

Well, the only part actually worth writing in assembler is the 'adc' loop.
So run-time substitution of separate versions (as is done for memcpy())
wouldn't be hard.

Since adcx and adox must execute in parallel I clearly need to re-remember
how dependencies against the flags register work. I'm sure I remember
issues with 'false dependencies' against the flags.

However you still need a loop construct that doesn't modify 'o' or 'c'.
Using leal, jcxz, jmp might work.
(Unless broadwell actually has a fast 'loop' instruction.)

(I've not got a suitable test cpu.)

David



RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-09 Thread George Spelvin
David Laight wrote:
> Since adcx and adox must execute in parallel I clearly need to re-remember
> how dependencies against the flags register work. I'm sure I remember
> issues with 'false dependencies' against the flags.

The issue is with flags register bits that are *not* modified by
an instruction.  If the register is treated as a monolithic entity,
then the previous values of those bits must be considered an *input*
to the instruction, forcing serialization.

The first step in avoiding this problem is to consider the rarely-modified
bits (interrupt, direction, trap, etc.) to be a separate logical register
from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity)
which are updated by almost every instruction.

An arithmetic instruction overwrites the arithmetic flags (so it's only
a WAW dependency which can be broken by renaming) and doesn't touch the
status flags (so no dependency).

However, on x86 even the arithmetic flags aren't updated consistently.
The biggest offender are the (very common!) INC/DEC instructions,
which update all of the arithmetic flags *except* the carry flag.

Thus, the carry flag is also renamed separately on every superscalar
x86 implementation I've ever heard of.

The bit test instructions (BT, BTC, BTR, BTS) also affect *only*
the carry flag, leaving other flags unmodified.  This is also
handled properly by renaming the carry flag separately.


Here's a brief summary chart of flags updated by common instructions:
http://www.logix.cz/michal/doc/i386/app-c.htm
and the full list with all the corner cases:
http://www.logix.cz/michal/doc/i386/app-b.htm

The other two flags that can be worth separating are the overflow
and zero flags.

The rotate instructions modify *only* the carry and overflow flags.
While overflow is undefined for multi-bit rotates (and thus leaving it
unmodified is a valid implementation), it's defined for single-bit rotates,
so must be written.

There are several less common instructions, notably BSF, BSR, CMPXCHG8B,
and a bunch of 80286 segment instructions that nobody cares about,
which retort the result of a test in the zero flag and are defined to
not affect the other flags.


So an aggressive x86 implementation breaks the flags register into five
separately renamed registers:
- CF (carry)
- OF (overflow)
- ZF (zero)
- SF, AF, PF (sign, aux carry, and parity)
- DF, IF, TF, IOPL, etc.

Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that
it was reasonable to commit to always renaming CF and OF separately.

> However you still need a loop construct that doesn't modify 'o' or 'c'.
> Using leal, jcxz, jmp might work.
> (Unless broadwell actually has a fast 'loop' instruction.)

According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf),
JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back
to K8 and Merom.  The one exception is Precott.  JCXZ and LOOP are 4
uops on those processors.  But 64 bit in general sucked on Precott,
so how much do we care?

AMD:LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar.
JCXZ is acceptable on all of them.
LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller.
Intel:  LOOP is slow (7+ uops) on all processors up to and including Skylake.
JCXZ is 2 upos on everything from P6 to Skylake exacpt for:
- Prescott (JCXZ & loop both 4 uops)
- 1st gen Atom (JCXZ 3 uops, LOOP 8 uops)
I can't find any that it's fast on.


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-08 Thread George Spelvin
David Laight wrote:
> I'd need convincing that unrolling the loop like that gives any significant 
> gain.
> You have a dependency chain on the carry flag so have delays between the 
> 'adcq'
> instructions (these may be more significant than the memory reads from l1 
> cache).

If the carry chain is a bottleneck, on Broadwell+ (feature flag
X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
separate flag bits for their carry chains and so can be interleaved.

I don't have such a machine to test on, but if someone who does
would like to do a little benchmarking, that would be an interesting
data point.

Unfortunately, that means yet another version of the main loop,
but if there's a significant benefit...


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-08 Thread George Spelvin
David Laight wrote:
> I'd need convincing that unrolling the loop like that gives any significant 
> gain.
> You have a dependency chain on the carry flag so have delays between the 
> 'adcq'
> instructions (these may be more significant than the memory reads from l1 
> cache).

If the carry chain is a bottleneck, on Broadwell+ (feature flag
X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use
separate flag bits for their carry chains and so can be interleaved.

I don't have such a machine to test on, but if someone who does
would like to do a little benchmarking, that would be an interesting
data point.

Unfortunately, that means yet another version of the main loop,
but if there's a significant benefit...


RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread David Laight
From: Ingo Molnar
...
> As Linus noticed, data lookup tables are the intelligent solution: if you 
> manage
> to offload the logic into arithmetics and not affect the control flow then 
> that's
> a big win. The inherent branching will be hidden by executing on massively
> parallel arithmetics units which effectively execute everything fed to them 
> in a
> single cycle.

Not necessarily, in real-life they are likely to be a cache miss.

With the parallel execution units you just need to worry about the
register dependency chains.
In this case the 'adc' have dependencies on both the result register
and the flags register.

The best loop might even be:
10: adc %rax,0(%rdi,%rcx)
lea %rcx,8(%rcx)
jcxz20f
jmp 10b
20:

You then need to insert just enough copies of the adc so they take as long
as the other instructions.

David



Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread Ingo Molnar
* Tom Herbert  wrote:

> Thanks for the explanation and sample code. Expanding on your example, I 
> added a 
> switch statement to perform to function (code below).

So I think your new switch() based testcase is broken in a subtle way.

The problem is that in your added testcase GCC effectively optimizes out the 
function call, because it is able to recognize that all the (trivial) functions 
are identical. (At least here, with GCC 5.2.1)

So all overhead of the workload goes into just that do_switch() function you 
added:

#
# Total Lost Samples: 0
#
# Samples: 5K of event 'cycles:pp'
# Event count (approx.): 5738959683
#
# Overhead  Command  Shared Object  Symbol 
#   ...  .  ...
#
99.94%  jump-table2  jump-table2[.] do_switch  
 0.02%  jump-table2  [kernel.kallsyms]  [k] native_read_msr_safe   
 0.02%  jump-table2  [kernel.kallsyms]  [k] native_apic_mem_write  
 0.01%  jump-table2  [kernel.kallsyms]  [k] __fput 
 0.00%  jump-table2  [kernel.kallsyms]  [k] change_protection_range
 0.00%  perf [kernel.kallsyms]  [k] strrchr
 0.00%  perf [kernel.kallsyms]  [k] intel_pmu_handle_irq   
 0.00%  perf [kernel.kallsyms]  [k] native_write_msr_safe  


... and per instruction overhead in the do_switch() looks like this:

 :
 :  Disassembly of section .text:
 :
 :  00401100 :
 :  do_switch():
0.00 :401100:   mov0x201f61(%rip),%r8# 603068 

0.00 :401107:   test   %r8,%r8
0.00 :40110a:   je 401178 
0.00 :40110c:   mov0x201f5d(%rip),%rdi# 603070 

0.00 :401113:   xor%esi,%esi
0.00 :401115:   xor%ecx,%ecx
0.00 :401117:   nopw   0x0(%rax,%rax,1)
0.00 :401120:   mov0x201f61(%rip),%rax# 603088 

1.46 :401127:   xor%edx,%edx
0.00 :401129:   add$0x1,%rax
0.00 :40112d:   mov%rax,0x201f54(%rip)# 603088 

0.00 :401134:   mov0x201f4d(%rip),%rax# 603088 

   51.68 :40113b:   div%rdi
0.00 :40113e:   cmp$0x3f,%rdx
1.34 :401142:   ja 40116b 
0.02 :401144:   mov0x201f3d(%rip),%r9# 603088 

0.10 :40114b:   mov0x201f36(%rip),%rax# 603088 

6.91 :401152:   jmpq   *0x401420(,%rdx,8)
0.00 :401159:   nopl   0x0(%rax)
0.02 :401160:   xor%edx,%edx
   35.71 :401162:   div%r9
1.07 :401165:   add%rdx,%r9
1.69 :401168:   add%r9,%rsi
0.00 :40116b:   add$0x1,%rcx
0.00 :40116f:   cmp%r8,%rcx
0.00 :401172:   jne401120 
0.00 :401174:   mov%rsi,%rax
0.00 :401177:   retq   
0.00 :401178:   xor%esi,%esi
0.00 :40117a:   jmp401174 

Note that as you remarked there _is_ an indirect jump in there:

6.91 :401152:   jmpq   *0x401420(,%rdx,8)

But:

> gcc seems to be implementing this switch as a jump table:
> 
>jmpq   *0x400ac8(,%rdx,8)

... the 'jump table' has identical entries:

  40141f:   00 60 11add%ah,0x11(%rax)
  401422:   40 00 00add%al,(%rax)
  401425:   00 00   add%al,(%rax)
  401427:   00 60 11add%ah,0x11(%rax)
  40142a:   40 00 00add%al,(%rax)
  40142d:   00 00   add%al,(%rax)
  40142f:   00 60 11add%ah,0x11(%rax)
  401432:   40 00 00add%al,(%rax)
  401435:   00 00   add%al,(%rax)
  401437:   00 60 11add%ah,0x11(%rax)
  40143a:   40 00 00add%al,(%rax)

(misaligned dump - jump table starts at 0x401420)

Every jump table entry points to 0x401160 - which is the code after the 
indirect 
jump in do_switch().

So the hardware predictor simply recognizes that it's the same target address 
all 
the time, and thus (understandably) performs much faster - none of the 
functions 
are actually executed.

That is why there are no function call entries in perf report, while in the 
function pointer case we get:

#
# Total Lost Samples: 0
#
# Samples: 4K of event 'cycles:pp'
# Event count (approx.): 5482523153
#
# Overhead  Command  Shared Object  Symbol
#   ...  .  ..
#
13.47%  jump-table2  jump-table2[.] fun_1 
13.22%  jump-table2  jump-table2[.] fun_6 
13.12%  jump-table2 

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread Ingo Molnar

* Tom Herbert  wrote:

> [] gcc turns these switch statements into jump tables (not function 
> tables 
> which is what Ingo's example code was using). [...]

So to the extent this still matters, on most x86 microarchitectures that count, 
jump tables and function call tables (i.e. virtual functions that C++ uses) are 
generally optimized by the same branch predictor hardware mechanism. Indirect 
jumps (jump tables) and indirect calls (function pointer tables) are very 
similar 
conceptually. That is why posted the indirect calls test code.

( The only branching variant that will perform badly even on the latest uarchs 
are
  indirect returns: to modify the return address on the stack. )

So my narrow performance point stands, if any sort of indirect jump is used. 
They 
should be avoided if possible, because it's pretty hard for the hardware to get 
it 
right.

As Linus noticed, data lookup tables are the intelligent solution: if you 
manage 
to offload the logic into arithmetics and not affect the control flow then 
that's 
a big win. The inherent branching will be hidden by executing on massively 
parallel arithmetics units which effectively execute everything fed to them in 
a 
single cycle.

In any case, when submitting such patches, please get into the habit of looking 
at 
and posting perf stat output - it will give us a good idea about the quality of 
an 
implementation.

Thanks,

Ingo


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread Ingo Molnar

* Tom Herbert  wrote:

> [] gcc turns these switch statements into jump tables (not function 
> tables 
> which is what Ingo's example code was using). [...]

So to the extent this still matters, on most x86 microarchitectures that count, 
jump tables and function call tables (i.e. virtual functions that C++ uses) are 
generally optimized by the same branch predictor hardware mechanism. Indirect 
jumps (jump tables) and indirect calls (function pointer tables) are very 
similar 
conceptually. That is why posted the indirect calls test code.

( The only branching variant that will perform badly even on the latest uarchs 
are
  indirect returns: to modify the return address on the stack. )

So my narrow performance point stands, if any sort of indirect jump is used. 
They 
should be avoided if possible, because it's pretty hard for the hardware to get 
it 
right.

As Linus noticed, data lookup tables are the intelligent solution: if you 
manage 
to offload the logic into arithmetics and not affect the control flow then 
that's 
a big win. The inherent branching will be hidden by executing on massively 
parallel arithmetics units which effectively execute everything fed to them in 
a 
single cycle.

In any case, when submitting such patches, please get into the habit of looking 
at 
and posting perf stat output - it will give us a good idea about the quality of 
an 
implementation.

Thanks,

Ingo


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread Ingo Molnar
* Tom Herbert  wrote:

> Thanks for the explanation and sample code. Expanding on your example, I 
> added a 
> switch statement to perform to function (code below).

So I think your new switch() based testcase is broken in a subtle way.

The problem is that in your added testcase GCC effectively optimizes out the 
function call, because it is able to recognize that all the (trivial) functions 
are identical. (At least here, with GCC 5.2.1)

So all overhead of the workload goes into just that do_switch() function you 
added:

#
# Total Lost Samples: 0
#
# Samples: 5K of event 'cycles:pp'
# Event count (approx.): 5738959683
#
# Overhead  Command  Shared Object  Symbol 
#   ...  .  ...
#
99.94%  jump-table2  jump-table2[.] do_switch  
 0.02%  jump-table2  [kernel.kallsyms]  [k] native_read_msr_safe   
 0.02%  jump-table2  [kernel.kallsyms]  [k] native_apic_mem_write  
 0.01%  jump-table2  [kernel.kallsyms]  [k] __fput 
 0.00%  jump-table2  [kernel.kallsyms]  [k] change_protection_range
 0.00%  perf [kernel.kallsyms]  [k] strrchr
 0.00%  perf [kernel.kallsyms]  [k] intel_pmu_handle_irq   
 0.00%  perf [kernel.kallsyms]  [k] native_write_msr_safe  


... and per instruction overhead in the do_switch() looks like this:

 :
 :  Disassembly of section .text:
 :
 :  00401100 :
 :  do_switch():
0.00 :401100:   mov0x201f61(%rip),%r8# 603068 

0.00 :401107:   test   %r8,%r8
0.00 :40110a:   je 401178 
0.00 :40110c:   mov0x201f5d(%rip),%rdi# 603070 

0.00 :401113:   xor%esi,%esi
0.00 :401115:   xor%ecx,%ecx
0.00 :401117:   nopw   0x0(%rax,%rax,1)
0.00 :401120:   mov0x201f61(%rip),%rax# 603088 

1.46 :401127:   xor%edx,%edx
0.00 :401129:   add$0x1,%rax
0.00 :40112d:   mov%rax,0x201f54(%rip)# 603088 

0.00 :401134:   mov0x201f4d(%rip),%rax# 603088 

   51.68 :40113b:   div%rdi
0.00 :40113e:   cmp$0x3f,%rdx
1.34 :401142:   ja 40116b 
0.02 :401144:   mov0x201f3d(%rip),%r9# 603088 

0.10 :40114b:   mov0x201f36(%rip),%rax# 603088 

6.91 :401152:   jmpq   *0x401420(,%rdx,8)
0.00 :401159:   nopl   0x0(%rax)
0.02 :401160:   xor%edx,%edx
   35.71 :401162:   div%r9
1.07 :401165:   add%rdx,%r9
1.69 :401168:   add%r9,%rsi
0.00 :40116b:   add$0x1,%rcx
0.00 :40116f:   cmp%r8,%rcx
0.00 :401172:   jne401120 
0.00 :401174:   mov%rsi,%rax
0.00 :401177:   retq   
0.00 :401178:   xor%esi,%esi
0.00 :40117a:   jmp401174 

Note that as you remarked there _is_ an indirect jump in there:

6.91 :401152:   jmpq   *0x401420(,%rdx,8)

But:

> gcc seems to be implementing this switch as a jump table:
> 
>jmpq   *0x400ac8(,%rdx,8)

... the 'jump table' has identical entries:

  40141f:   00 60 11add%ah,0x11(%rax)
  401422:   40 00 00add%al,(%rax)
  401425:   00 00   add%al,(%rax)
  401427:   00 60 11add%ah,0x11(%rax)
  40142a:   40 00 00add%al,(%rax)
  40142d:   00 00   add%al,(%rax)
  40142f:   00 60 11add%ah,0x11(%rax)
  401432:   40 00 00add%al,(%rax)
  401435:   00 00   add%al,(%rax)
  401437:   00 60 11add%ah,0x11(%rax)
  40143a:   40 00 00add%al,(%rax)

(misaligned dump - jump table starts at 0x401420)

Every jump table entry points to 0x401160 - which is the code after the 
indirect 
jump in do_switch().

So the hardware predictor simply recognizes that it's the same target address 
all 
the time, and thus (understandably) performs much faster - none of the 
functions 
are actually executed.

That is why there are no function call entries in perf report, while in the 
function pointer case we get:

#
# Total Lost Samples: 0
#
# Samples: 4K of event 'cycles:pp'
# Event count (approx.): 5482523153
#
# Overhead  Command  Shared Object  Symbol
#   ...  .  ..
#
13.47%  jump-table2  jump-table2[.] fun_1 

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread David Laight
From: Ingo Molnar
...
> As Linus noticed, data lookup tables are the intelligent solution: if you 
> manage
> to offload the logic into arithmetics and not affect the control flow then 
> that's
> a big win. The inherent branching will be hidden by executing on massively
> parallel arithmetics units which effectively execute everything fed to them 
> in a
> single cycle.

Not necessarily, in real-life they are likely to be a cache miss.

With the parallel execution units you just need to worry about the
register dependency chains.
In this case the 'adc' have dependencies on both the result register
and the flags register.

The best loop might even be:
10: adc %rax,0(%rdi,%rcx)
lea %rcx,8(%rcx)
jcxz20f
jmp 10b
20:

You then need to insert just enough copies of the adc so they take as long
as the other instructions.

David



Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds
 wrote:
> sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
> return rotate_by8_if_odd(sum, align);

Actually, that last word-sized access to "buff" might be past the end
of the buffer. The code does the right thing if "len" is zero, except
for the possible page fault or address verifier complaint.

So that very last call to "csum_partial_lt8()" either needs to be
conditional (easy enough to add an "if (len)" around that whole
statement) or the thing could be unconditional but the load needs to
use "load_unaligned_zeropad()" so that the exception is harmless.

It's probably immaterial which one you pick. The few possibly useless
ALU operations vs a possible branch misprodict penalty are probably
going to make it a wash. The exception will never happen in practice,
but if DEBUG_PAGEALLOC is enabled, or if something like KASAN is
active, it will complain loudly if it happens to go past the
allocation.

   Linus


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds
 wrote:
>
> The "+" should be "-", of course - the point is to shift up the value
> by 8 bits for odd cases, and we need to load starting one byte early
> for that. The idea is that we use the byte shifter in the load unit to
> do some work for us.

Ok, so I thought some more about this, and the fact is, we don't
actually want to do the byte shifting at all for the first case (the
"length < 8" case), since the address of that one hasn't been shifted.

it's only for the "we're going to align things to 8 bytes" case that
we would want to do it. But then we might as well use the
rotate_by8_if_odd() model, so I suspect the address games are just
entirely pointless.

So here is something that is actually tested (although admittedly not
well), and uses that fairly simple model.

NOTE! I did not do the unrolling of the "adcq" loop in the middle, but
that's a totally trivial thing now. So this isn't very optimized,
because it will do a *lot* of extra "adcq $0" to get rid of the carry
bit. But with that core loop unrolled, you'd get rid of most of them.

  Linus

---
static unsigned long rotate_by8_if_odd(unsigned long sum, unsigned long aligned)
{
asm("rorq %b1,%0"
:"=r" (sum)
:"c" ((aligned & 1) << 3), "0" (sum));
return sum;
}

static unsigned long csum_partial_lt8(unsigned long val, int len,
unsigned long sum)
{
unsigned long mask = (1ul << len*8)-1;
val &= mask;
return add64_with_carry(val, sum);
}

static unsigned long csum_partial_64(const void *buff, unsigned long
len, unsigned long sum)
{
unsigned long align, val;

// This is the only potentially unaligned access, and it can
// also theoretically overflow into the next page
val = load_unaligned_zeropad(buff);
if (len < 8)
return csum_partial_lt8(val, len, sum);

align = 7 & -(unsigned long)buff;
sum = csum_partial_lt8(val, align, sum);
buff += align;
len -= align;

sum = rotate_by8_if_odd(sum, align);
while (len >= 8) {
val = *(unsigned long *) buff;
sum = add64_with_carry(sum, val);
buff += 8;
len -= 8;
}
sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
return rotate_by8_if_odd(sum, align);
}

__wsum csum_partial(const void *buff, unsigned long len, unsigned long sum)
{
sum = csum_partial_64(buff, len, sum);
return add32_with_carry(sum, sum >> 32);
}


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert  wrote:
>
> The reason I did this in assembly is precisely about the your point of
> having to close the carry chains with adcq $0. I do have a first
> implementation in C which using switch() to handle alignment, excess
> length less than 8 bytes, and the odd number of quads to sum in the
> main loop. gcc turns these switch statements into jump tables (not
> function tables which is what Ingo's example code was using). The
> problem I hit was that for each case I needed to close the carry chain
> in the inline asm so fall through wouldn't have much value and each
> case is expanded. The C version using switch gave a nice performance
> gain, moving to all assembly was somewhat better.

Yeah,. so I _think_ that if my approach works, the code generated for
the 0-8 byte case looks just something like

movq %rsi, %rax
andl $1, %eax
subq %rax, %rdi
movq %rdx, %rax
movq (%rdi), %rcx
andq mask(,%rsi,8), %rcx
addq %rcx,%rax
adcq $0,%rax

which is pretty close to optimal (that's with my hopefully fixed version).

That's not with the final narrowing from 64-bit to 32-bit, but it
looks like it should perform well on most x86 cores, and then the only
remaining branches end up being the actual size checking ones.

And yes, you could avoid a few "adc $0" instructions in asm, but quite
frankly, I think that's a tradeoff that is worth it.

My guess is that you can get to within a percent of optimal with the C
approach, and I really think it makes it easier to try different
things (ie things like the above that avoids the switch table
entirely)

> There is also question of alignment. I f we really don't need to worry
> about alignment at all on x86, then we should be able to eliminate the
> complexity of dealing with it.

So most x86 alignment issues are about crossing the cache bank size,
which is usually 16 or 32 bytes. Unaligned accesses *within* one of
those banks should be basically free (there's a whoppign big byte lane
shifter, so there's lots of hardware support for that).

Also, even when you do cross a cache bank, it's usually pretty cheap.
It extra cycles, but it's generally *less* extra cycles than it would
be to try to align things in software and doing two accesses.

The rule of thumb is that you should never worry about _individual_
unaligned accesses. It's only really worth it aligning things before
biggish loops. So aligning to an 8-byte boundary before you then do a
lot of "adcq" instructions makes sense, but worrying about unaligned
accesses for the beginning/end does generally not.

>>> For example, for the actual "8 bytes or shorter" case, I think
>> something like this might just work fine: [ snip ]
>
> I will look at doing that.

So I already found and fixed one bug in that approach, but I still
think it's a viable model.

But you will almost certainly have to fix a few more of my bugs before
it really works ;)

   Linus


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
 wrote:
> I missed the original email (I don't have net-devel in my mailbox),
> but based on Ingo's quoting have a more fundamental question:
>
> Why wasn't that done with C code instead of asm with odd numerical targets?
>
The reason I did this in assembly is precisely about the your point of
having to close the carry chains with adcq $0. I do have a first
implementation in C which using switch() to handle alignment, excess
length less than 8 bytes, and the odd number of quads to sum in the
main loop. gcc turns these switch statements into jump tables (not
function tables which is what Ingo's example code was using). The
problem I hit was that for each case I needed to close the carry chain
in the inline asm so fall through wouldn't have much value and each
case is expanded. The C version using switch gave a nice performance
gain, moving to all assembly was somewhat better.

There is also question of alignment. I f we really don't need to worry
about alignment at all on x86, then we should be able to eliminate the
complexity of dealing with it.

> It seems likely that the real issue is avoiding the short loops (that
> will cause branch prediction problems) and use a lookup table instead.
>
> But we can probably do better than that asm.
>
> For example, for the actual "8 bytes or shorter" case, I think
> something like this might just work fine:
>
>   unsigned long csum_partial_8orless(const unsigned char *buf,
> unsigned long len, unsigned long sum)
>   {
> static const unsigned long mask[9] = {
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x };
> unsigned long val = load_unaligned_zeropad(buf + (len & 1));
> val &= mask[len];
> asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
> return sum;
>   }
>
I will look at doing that.

Thanks,
Tom

> NOTE! The above is 100% untested. But I _think_ it may handle the
> odd-byte-case correctly, and it should result in just one 8-byte load
> (the "load_unaligned_zeropad()" is just in case that ends up
> overflowing and we have page-alloc-debug triggering a page fault at
> the end). All without looping or any conditional branches that might
> mispredict.
>
> My point is that going to assembly results in pretty hard-to-read
> code, but it's also fairly inflexible. If we stay with C, we can much
> more easily play tricks. So for example, make the above an inline
> function, and now you can do things like this:
>
>   static inline unsigned long csum_partial_64bit(void *buf, unsigned
> long len, unsigned long sum)
>   {
> if (len <= 8)
> return csum_partial_8orless(buf, len, sum);
>
> /* We know it's larger than 8 bytes, so handle alignment */
> align = 7 & -(unsigned long)buf;
> sum = csum_partial_8orless(buf, align, sum);
> buf += align;
>
> /* We need to do the big-endian thing */
> sum = rotate_by8_if_odd(sum, align);
>
> /* main loop for big packets */
> .. do the unrolled inline asm thing that we already do ..
>
> sum = rotate_by8_if_odd(sum, align);
>
> /* Handle the last bytes */
> return csum_partial_8orless(buf, len, sum);
>   }
>
>   /* Fold the 64-bit sum we computed down to 32 bits __wsum */
>   __wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
>   {
> unsigned long sum = csum_partial_64bit(ptr, len, partial);
> asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
> return sum;
>  }
>
> or something like that.
>
> NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
> "big-endian 16-bit add" thing right by doing the odd byte masking in
> the end cases, and by rotating the sum by 8 bits around the
> 8-byte-unrolled-loop, but I didn't test the above. It's literally
> written in my mail client. So I can almost guarantee that it is buggy,
> but it is meant as an *example* of "why not do it this way" rather
> than production code.
>
> I think that writing it in C and trying to be intelligent about it
> like the above would result in more maintainable code, and it is
> possible that it would even be faster.
>
> Yes, writing it in C *does* likely result in a few more cases of "adcq
> $0" in order to finish up the carry calculations. The *only* advantage
> of inline asm is how it allows you to keep the carry flag around. So
> there is downside to the C model, and it might cause a cycle or two of
> extra work, but the upside of C is that you can try to do clever
> things without turning the code completely unreadable.
>
> For example, doing the exception handling (that will never actually
> trigger) 

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
 wrote:
>
> static const unsigned long mask[9] = {
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x };
> unsigned long val = load_unaligned_zeropad(buf + (len & 1));
> val &= mask[len];

Yeah, that was buggy. I knew it was likely buggy,  but that "buf +
(len & 1)" was just stupid.

The "+" should be "-", of course - the point is to shift up the value
by 8 bits for odd cases, and we need to load starting one byte early
for that. The idea is that we use the byte shifter in the load unit to
do some work for us.

And the bitmasks are the wrong way around for the odd cases - it's the
low byte that ends up being bogus for those cases.

So it should probably look something like

static const unsigned long mask[9] = {
0x,
0xff00,
0x,
0xff00,
0x,
0xff00,
0x,
0xff00,
0x };
unsigned long val = load_unaligned_zeropad(buf - (len & 1));
val &= mask[len];

and then it *might* work.

Still entirely and utterly untested, I just decided to look at it a
bit more and noticed my thinko.

Linus


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
I missed the original email (I don't have net-devel in my mailbox),
but based on Ingo's quoting have a more fundamental question:

Why wasn't that done with C code instead of asm with odd numerical targets?

It seems likely that the real issue is avoiding the short loops (that
will cause branch prediction problems) and use a lookup table instead.

But we can probably do better than that asm.

For example, for the actual "8 bytes or shorter" case, I think
something like this might just work fine:

  unsigned long csum_partial_8orless(const unsigned char *buf,
unsigned long len, unsigned long sum)
  {
static const unsigned long mask[9] = {
0x,
0xff00,
0x,
0xff00,
0x,
0xff00,
0x,
0xff00,
0x };
unsigned long val = load_unaligned_zeropad(buf + (len & 1));
val &= mask[len];
asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
return sum;
  }

NOTE! The above is 100% untested. But I _think_ it may handle the
odd-byte-case correctly, and it should result in just one 8-byte load
(the "load_unaligned_zeropad()" is just in case that ends up
overflowing and we have page-alloc-debug triggering a page fault at
the end). All without looping or any conditional branches that might
mispredict.

My point is that going to assembly results in pretty hard-to-read
code, but it's also fairly inflexible. If we stay with C, we can much
more easily play tricks. So for example, make the above an inline
function, and now you can do things like this:

  static inline unsigned long csum_partial_64bit(void *buf, unsigned
long len, unsigned long sum)
  {
if (len <= 8)
return csum_partial_8orless(buf, len, sum);

/* We know it's larger than 8 bytes, so handle alignment */
align = 7 & -(unsigned long)buf;
sum = csum_partial_8orless(buf, align, sum);
buf += align;

/* We need to do the big-endian thing */
sum = rotate_by8_if_odd(sum, align);

/* main loop for big packets */
.. do the unrolled inline asm thing that we already do ..

sum = rotate_by8_if_odd(sum, align);

/* Handle the last bytes */
return csum_partial_8orless(buf, len, sum);
  }

  /* Fold the 64-bit sum we computed down to 32 bits __wsum */
  __wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
  {
unsigned long sum = csum_partial_64bit(ptr, len, partial);
asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
return sum;
 }

or something like that.

NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
"big-endian 16-bit add" thing right by doing the odd byte masking in
the end cases, and by rotating the sum by 8 bits around the
8-byte-unrolled-loop, but I didn't test the above. It's literally
written in my mail client. So I can almost guarantee that it is buggy,
but it is meant as an *example* of "why not do it this way" rather
than production code.

I think that writing it in C and trying to be intelligent about it
like the above would result in more maintainable code, and it is
possible that it would even be faster.

Yes, writing it in C *does* likely result in a few more cases of "adcq
$0" in order to finish up the carry calculations. The *only* advantage
of inline asm is how it allows you to keep the carry flag around. So
there is downside to the C model, and it might cause a cycle or two of
extra work, but the upside of C is that you can try to do clever
things without turning the code completely unreadable.

For example, doing the exception handling (that will never actually
trigger) for the "let's just do a 8-byte load" is just painful in
assembly. In C, we already have the helper function to do it.

Hmm? Would somebody be willing to take the likely very buggy code
above, and test it and make it work? I assume there's a test harness
for the whole csum thing somewhere.

 Linus


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar  wrote:
>
> * Ingo Molnar  wrote:
>
>> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>
>> > +
>> > +   /* Check length */
>> > +10:cmpl$8, %esi
>> > +   jg  30f
>> > +   jl  20f
>> > +
>> > +   /* Exactly 8 bytes length */
>> > +   addl(%rdi), %eax
>> > +   adcl4(%rdi), %eax
>> > +   RETURN
>> > +
>> > +   /* Less than 8 bytes length */
>> > +20:clc
>> > +   jmpq *branch_tbl_len(, %rsi, 8)
>> > +
>> > +   /* Greater than 8 bytes length. Determine number of quads (n). Sum
>> > +* over first n % 16 quads
>> > +*/
>> > +30:movl%esi, %ecx
>> > +   shrl$3, %ecx
>> > +   andl$0xf, %ecx
>> > +   negq%rcx
>> > +   lea 40f(, %rcx, 4), %r11
>> > +   clc
>> > +   jmp *%r11
>>
>> Are you absolutely sure that a jump table is the proper solution here? It
>> defeats branch prediction on most x86 uarchs. Why not label the loop stages 
>> and
>> jump in directly with a binary tree of branches?
>
> So just to expand on this a bit, attached below is a quick & simple & stupid
> testcase that generates a 16 entries call table. (Indirect jumps and indirect
> calls are predicted similarly on most x86 uarchs.) Just built it with:
>
>   gcc -Wall -O2 -o jump-table jump-table.c
>
> Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel 
> CPU
> and also on AMD Opteron. The numbers are from the Intel box.) this gives a 
> high
> branch miss rate with a 16 entries jump table:
>
>  triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
>  ... using 16 jump table entries.
>  ... using 1 loop iterations.
>  ... result: 10001
>  [...]
>
>  Performance counter stats for './jump-table 16' (10 runs):
>
>1386.131780  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.18% )
> 33  context-switches  #0.024 K/sec
> ( +-  1.71% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.037 K/sec
> ( +-  0.71% )
>  6,247,215,683  cycles#4.507 GHz  
> ( +-  0.18% )
>  3,895,337,877  stalled-cycles-frontend   #   62.35% frontend cycles 
> idle ( +-  0.30% )
>  1,404,014,996  instructions  #0.22  insns per cycle
>   #2.77  stalled cycles 
> per insn  ( +-  0.02% )
>300,820,988  branches  #  217.022 M/sec
> ( +-  0.02% )
> 87,518,741  branch-misses #   29.09% of all branches  
> ( +-  0.01% )
>
>1.385240076 seconds time elapsed   
>( +-  0.21% )
>
> ... as you can see the branch miss rate is very significant, causing a stalled
> decoder and very low instruction throughput.
>
> I have to reduce the jump table to a single entry (!) to get good performance 
> on
> this CPU:
>
>  Performance counter stats for './jump-table 1' (10 runs):
>
> 739.173505  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.26% )
> 37  context-switches  #0.051 K/sec
> ( +- 16.79% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.070 K/sec
> ( +-  0.41% )
>  3,331,328,405  cycles#4.507 GHz  
> ( +-  0.26% )
>  2,012,973,596  stalled-cycles-frontend   #   60.43% frontend cycles 
> idle ( +-  0.47% )
>  1,403,880,792  instructions  #0.42  insns per cycle
>   #1.43  stalled cycles 
> per insn  ( +-  0.05% )
>300,817,064  branches  #  406.964 M/sec
> ( +-  0.05% )
> 12,177  branch-misses #0.00% of all branches  
> ( +- 12.39% )
>
>0.738616356 seconds time elapsed   
>( +-  0.26% )
>
> Note how the runtime got halved: that is because stalls got halved and the
> instructions per cycle throughput doubled.
>
> Even a two entries jump table performs poorly:
>
>  Performance counter stats for './jump-table 2' (10 runs):
>
>1493.790686  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.06% )
> 39  context-switches  #0.026 K/sec
> ( +-  4.73% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.035 K/sec
> ( +-  0.26% )
>  6,732,372,612  cycles#4.507 GHz  
> ( +-  0.06% )
>  4,229,130,302  

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Ingo Molnar

* Ingo Molnar  wrote:

> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> 
> > +
> > +   /* Check length */
> > +10:cmpl$8, %esi
> > +   jg  30f
> > +   jl  20f
> > +
> > +   /* Exactly 8 bytes length */
> > +   addl(%rdi), %eax
> > +   adcl4(%rdi), %eax
> > +   RETURN
> > +
> > +   /* Less than 8 bytes length */
> > +20:clc
> > +   jmpq *branch_tbl_len(, %rsi, 8)
> > +
> > +   /* Greater than 8 bytes length. Determine number of quads (n). Sum
> > +* over first n % 16 quads
> > +*/
> > +30:movl%esi, %ecx
> > +   shrl$3, %ecx
> > +   andl$0xf, %ecx
> > +   negq%rcx
> > +   lea 40f(, %rcx, 4), %r11
> > +   clc
> > +   jmp *%r11
> 
> Are you absolutely sure that a jump table is the proper solution here? It 
> defeats branch prediction on most x86 uarchs. Why not label the loop stages 
> and 
> jump in directly with a binary tree of branches?

So just to expand on this a bit, attached below is a quick & simple & stupid 
testcase that generates a 16 entries call table. (Indirect jumps and indirect 
calls are predicted similarly on most x86 uarchs.) Just built it with:

  gcc -Wall -O2 -o jump-table jump-table.c

Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel 
CPU 
and also on AMD Opteron. The numbers are from the Intel box.) this gives a high 
branch miss rate with a 16 entries jump table:

 triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
 ... using 16 jump table entries.
 ... using 1 loop iterations.
 ... result: 10001
 [...]

 Performance counter stats for './jump-table 16' (10 runs):

   1386.131780  task-clock (msec) #1.001 CPUs utilized  
  ( +-  0.18% )
33  context-switches  #0.024 K/sec  
  ( +-  1.71% )
 0  cpu-migrations#0.000 K/sec  

52  page-faults   #0.037 K/sec  
  ( +-  0.71% )
 6,247,215,683  cycles#4.507 GHz
  ( +-  0.18% )
 3,895,337,877  stalled-cycles-frontend   #   62.35% frontend cycles 
idle ( +-  0.30% )
 1,404,014,996  instructions  #0.22  insns per cycle

  #2.77  stalled cycles per 
insn  ( +-  0.02% )
   300,820,988  branches  #  217.022 M/sec  
  ( +-  0.02% )
87,518,741  branch-misses #   29.09% of all branches
  ( +-  0.01% )

   1.385240076 seconds time elapsed 
 ( +-  0.21% )

... as you can see the branch miss rate is very significant, causing a stalled 
decoder and very low instruction throughput.

I have to reduce the jump table to a single entry (!) to get good performance 
on 
this CPU:

 Performance counter stats for './jump-table 1' (10 runs):

739.173505  task-clock (msec) #1.001 CPUs utilized  
  ( +-  0.26% )
37  context-switches  #0.051 K/sec  
  ( +- 16.79% )
 0  cpu-migrations#0.000 K/sec  

52  page-faults   #0.070 K/sec  
  ( +-  0.41% )
 3,331,328,405  cycles#4.507 GHz
  ( +-  0.26% )
 2,012,973,596  stalled-cycles-frontend   #   60.43% frontend cycles 
idle ( +-  0.47% )
 1,403,880,792  instructions  #0.42  insns per cycle

  #1.43  stalled cycles per 
insn  ( +-  0.05% )
   300,817,064  branches  #  406.964 M/sec  
  ( +-  0.05% )
12,177  branch-misses #0.00% of all branches
  ( +- 12.39% )

   0.738616356 seconds time elapsed 
 ( +-  0.26% )

Note how the runtime got halved: that is because stalls got halved and the 
instructions per cycle throughput doubled.

Even a two entries jump table performs poorly:

 Performance counter stats for './jump-table 2' (10 runs):

   1493.790686  task-clock (msec) #1.001 CPUs utilized  
  ( +-  0.06% )
39  context-switches  #0.026 K/sec  
  ( +-  4.73% )
 0  cpu-migrations#0.000 K/sec  

52  page-faults   #0.035 K/sec  
  ( +-  0.26% )
 6,732,372,612  cycles#4.507 GHz
  ( +-  0.06% )
 4,229,130,302  stalled-cycles-frontend   #   62.82% frontend cycles 
idle ( +-  0.09% )
 1,407,803,145  instructions  #0.21  insns per cycle

  

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Ingo Molnar

* Tom Herbert  wrote:

> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conversion.
> 
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
> we need to avoid unaligned accesses. Efficient unaligned accesses
> offer a nice additional speedup as demonstrated in the results
> provided below.
> 
> This implementation is similar to csum_partial implemented in
> checksum_32.S, however since we are dealing with 8 bytes at a time
> there are more cases for small lengths (and alignments)-- for that
> we employ a jump table.
> 
> Testing:
> 
> Correctness:
> 
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
> 
> Checksum performance:
> 
> Isolating old and new implementation for some common cases:
> 
>  Old  NewA NewA %   NewNoA   NewNoA %
> Len/Aln  nsec nsec Improv   nsecsImprove
> +---++---+---+-
> 1400/0192.9175.1   10% 174.9 10%   (Big packet)
> 40/0  13.8 7.7 44% 5.7   58%   (Ipv6 hdr cmn case)
> 8/4   8.4  6.9 18% 2.8   67%   (UDP, VXLAN in IPv4)
> 14/0  10.5 7.3 30% 5.4   48%   (Eth hdr)
> 14/4  10.8 8.7 19% 5.4   50%   (Eth hdr in IPv4)
> 14/3  11.0 9.8 11% 5.6   49%   (Eth with odd align)
> 7/1   10.0 5.8 42% 4.8   52%   (buffer in one quad)
> 
> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
> 
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
> 
> Also test on these with similar results:
> 
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
> 
> Branch prediction:
> 
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between
> the baseline and the interleaving test case.

Possibly because the jumps are missing branch prediction for all these 
variants...

Please run something like:

 triton:~/tip> perf stat --repeat 10 -e '{instructions, branches, 
branch-misses}' ~/hackbench 10
 ...

 Performance counter stats for '/home/mingo/hackbench 10' (10 runs):

 1,701,689,802  instructions
  ( +-  0.83% )
   305,439,262   branches   
  ( +-  0.92% )
 2,011,032   branch-misses#0.66% of all branches
  ( +-  3.00% )

   0.074311070 seconds time elapsed 
 ( +-  2.42% )


... to see the quality of x86 branch prediction and to see the statistical 
quality 
of the measurement (stddev).

The real comparison would be jump table against a hierarchy of open coded 
branches. The latter is much more likely to be correctly branch-predicted.

I have a couple of (mostly stylistic) comments about the patch:

> Signed-off-by: Tom Herbert 
> ---
>  arch/x86/include/asm/checksum_64.h |   5 +
>  arch/x86/lib/csum-partial_64.S | 277 
> +
>  arch/x86/lib/csum-partial_64.c | 148 
>  3 files changed, 282 insertions(+), 148 deletions(-)
>  create mode 100644 arch/x86/lib/csum-partial_64.S
>  delete mode 100644 arch/x86/lib/csum-partial_64.c
> 
> diff --git a/arch/x86/include/asm/checksum_64.h 
> b/arch/x86/include/asm/checksum_64.h
> index cd00e17..a888f65 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, 
> __be32 daddr,
>   */
>  extern __wsum csum_partial(const void *buff, int len, __wsum sum);
>  
> +static inline __sum16 ip_compute_csum(const void *buff, int len)
> +{
> + return csum_fold(csum_partial(buff, len, 0));
> +}
> +
>  #define  _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
>  #define HAVE_CSUM_COPY_USER 1
>  
> diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S
> new file mode 100644
> index 000..520b400
> --- /dev/null
> +++ b/arch/x86/lib/csum-partial_64.S
> @@ -0,0 +1,277 @@
> +/* Copyright 2016 Tom Herbert 
> + *
> + * Checksum partial 

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Ingo Molnar

* Ingo Molnar  wrote:

> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> 
> > +
> > +   /* Check length */
> > +10:cmpl$8, %esi
> > +   jg  30f
> > +   jl  20f
> > +
> > +   /* Exactly 8 bytes length */
> > +   addl(%rdi), %eax
> > +   adcl4(%rdi), %eax
> > +   RETURN
> > +
> > +   /* Less than 8 bytes length */
> > +20:clc
> > +   jmpq *branch_tbl_len(, %rsi, 8)
> > +
> > +   /* Greater than 8 bytes length. Determine number of quads (n). Sum
> > +* over first n % 16 quads
> > +*/
> > +30:movl%esi, %ecx
> > +   shrl$3, %ecx
> > +   andl$0xf, %ecx
> > +   negq%rcx
> > +   lea 40f(, %rcx, 4), %r11
> > +   clc
> > +   jmp *%r11
> 
> Are you absolutely sure that a jump table is the proper solution here? It 
> defeats branch prediction on most x86 uarchs. Why not label the loop stages 
> and 
> jump in directly with a binary tree of branches?

So just to expand on this a bit, attached below is a quick & simple & stupid 
testcase that generates a 16 entries call table. (Indirect jumps and indirect 
calls are predicted similarly on most x86 uarchs.) Just built it with:

  gcc -Wall -O2 -o jump-table jump-table.c

Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel 
CPU 
and also on AMD Opteron. The numbers are from the Intel box.) this gives a high 
branch miss rate with a 16 entries jump table:

 triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
 ... using 16 jump table entries.
 ... using 1 loop iterations.
 ... result: 10001
 [...]

 Performance counter stats for './jump-table 16' (10 runs):

   1386.131780  task-clock (msec) #1.001 CPUs utilized  
  ( +-  0.18% )
33  context-switches  #0.024 K/sec  
  ( +-  1.71% )
 0  cpu-migrations#0.000 K/sec  

52  page-faults   #0.037 K/sec  
  ( +-  0.71% )
 6,247,215,683  cycles#4.507 GHz
  ( +-  0.18% )
 3,895,337,877  stalled-cycles-frontend   #   62.35% frontend cycles 
idle ( +-  0.30% )
 1,404,014,996  instructions  #0.22  insns per cycle

  #2.77  stalled cycles per 
insn  ( +-  0.02% )
   300,820,988  branches  #  217.022 M/sec  
  ( +-  0.02% )
87,518,741  branch-misses #   29.09% of all branches
  ( +-  0.01% )

   1.385240076 seconds time elapsed 
 ( +-  0.21% )

... as you can see the branch miss rate is very significant, causing a stalled 
decoder and very low instruction throughput.

I have to reduce the jump table to a single entry (!) to get good performance 
on 
this CPU:

 Performance counter stats for './jump-table 1' (10 runs):

739.173505  task-clock (msec) #1.001 CPUs utilized  
  ( +-  0.26% )
37  context-switches  #0.051 K/sec  
  ( +- 16.79% )
 0  cpu-migrations#0.000 K/sec  

52  page-faults   #0.070 K/sec  
  ( +-  0.41% )
 3,331,328,405  cycles#4.507 GHz
  ( +-  0.26% )
 2,012,973,596  stalled-cycles-frontend   #   60.43% frontend cycles 
idle ( +-  0.47% )
 1,403,880,792  instructions  #0.42  insns per cycle

  #1.43  stalled cycles per 
insn  ( +-  0.05% )
   300,817,064  branches  #  406.964 M/sec  
  ( +-  0.05% )
12,177  branch-misses #0.00% of all branches
  ( +- 12.39% )

   0.738616356 seconds time elapsed 
 ( +-  0.26% )

Note how the runtime got halved: that is because stalls got halved and the 
instructions per cycle throughput doubled.

Even a two entries jump table performs poorly:

 Performance counter stats for './jump-table 2' (10 runs):

   1493.790686  task-clock (msec) #1.001 CPUs utilized  
  ( +-  0.06% )
39  context-switches  #0.026 K/sec  
  ( +-  4.73% )
 0  cpu-migrations#0.000 K/sec  

52  page-faults   #0.035 K/sec  
  ( +-  0.26% )
 6,732,372,612  cycles#4.507 GHz
  ( +-  0.06% )
 4,229,130,302  stalled-cycles-frontend   #   62.82% frontend cycles 
idle ( +-  0.09% )
 1,407,803,145  instructions  #0.21  insns per cycle


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
I missed the original email (I don't have net-devel in my mailbox),
but based on Ingo's quoting have a more fundamental question:

Why wasn't that done with C code instead of asm with odd numerical targets?

It seems likely that the real issue is avoiding the short loops (that
will cause branch prediction problems) and use a lookup table instead.

But we can probably do better than that asm.

For example, for the actual "8 bytes or shorter" case, I think
something like this might just work fine:

  unsigned long csum_partial_8orless(const unsigned char *buf,
unsigned long len, unsigned long sum)
  {
static const unsigned long mask[9] = {
0x,
0xff00,
0x,
0xff00,
0x,
0xff00,
0x,
0xff00,
0x };
unsigned long val = load_unaligned_zeropad(buf + (len & 1));
val &= mask[len];
asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
return sum;
  }

NOTE! The above is 100% untested. But I _think_ it may handle the
odd-byte-case correctly, and it should result in just one 8-byte load
(the "load_unaligned_zeropad()" is just in case that ends up
overflowing and we have page-alloc-debug triggering a page fault at
the end). All without looping or any conditional branches that might
mispredict.

My point is that going to assembly results in pretty hard-to-read
code, but it's also fairly inflexible. If we stay with C, we can much
more easily play tricks. So for example, make the above an inline
function, and now you can do things like this:

  static inline unsigned long csum_partial_64bit(void *buf, unsigned
long len, unsigned long sum)
  {
if (len <= 8)
return csum_partial_8orless(buf, len, sum);

/* We know it's larger than 8 bytes, so handle alignment */
align = 7 & -(unsigned long)buf;
sum = csum_partial_8orless(buf, align, sum);
buf += align;

/* We need to do the big-endian thing */
sum = rotate_by8_if_odd(sum, align);

/* main loop for big packets */
.. do the unrolled inline asm thing that we already do ..

sum = rotate_by8_if_odd(sum, align);

/* Handle the last bytes */
return csum_partial_8orless(buf, len, sum);
  }

  /* Fold the 64-bit sum we computed down to 32 bits __wsum */
  __wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
  {
unsigned long sum = csum_partial_64bit(ptr, len, partial);
asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
return sum;
 }

or something like that.

NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
"big-endian 16-bit add" thing right by doing the odd byte masking in
the end cases, and by rotating the sum by 8 bits around the
8-byte-unrolled-loop, but I didn't test the above. It's literally
written in my mail client. So I can almost guarantee that it is buggy,
but it is meant as an *example* of "why not do it this way" rather
than production code.

I think that writing it in C and trying to be intelligent about it
like the above would result in more maintainable code, and it is
possible that it would even be faster.

Yes, writing it in C *does* likely result in a few more cases of "adcq
$0" in order to finish up the carry calculations. The *only* advantage
of inline asm is how it allows you to keep the carry flag around. So
there is downside to the C model, and it might cause a cycle or two of
extra work, but the upside of C is that you can try to do clever
things without turning the code completely unreadable.

For example, doing the exception handling (that will never actually
trigger) for the "let's just do a 8-byte load" is just painful in
assembly. In C, we already have the helper function to do it.

Hmm? Would somebody be willing to take the likely very buggy code
above, and test it and make it work? I assume there's a test harness
for the whole csum thing somewhere.

 Linus


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
 wrote:
>
> static const unsigned long mask[9] = {
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x };
> unsigned long val = load_unaligned_zeropad(buf + (len & 1));
> val &= mask[len];

Yeah, that was buggy. I knew it was likely buggy,  but that "buf +
(len & 1)" was just stupid.

The "+" should be "-", of course - the point is to shift up the value
by 8 bits for odd cases, and we need to load starting one byte early
for that. The idea is that we use the byte shifter in the load unit to
do some work for us.

And the bitmasks are the wrong way around for the odd cases - it's the
low byte that ends up being bogus for those cases.

So it should probably look something like

static const unsigned long mask[9] = {
0x,
0xff00,
0x,
0xff00,
0x,
0xff00,
0x,
0xff00,
0x };
unsigned long val = load_unaligned_zeropad(buf - (len & 1));
val &= mask[len];

and then it *might* work.

Still entirely and utterly untested, I just decided to look at it a
bit more and noticed my thinko.

Linus


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert  wrote:
>
> The reason I did this in assembly is precisely about the your point of
> having to close the carry chains with adcq $0. I do have a first
> implementation in C which using switch() to handle alignment, excess
> length less than 8 bytes, and the odd number of quads to sum in the
> main loop. gcc turns these switch statements into jump tables (not
> function tables which is what Ingo's example code was using). The
> problem I hit was that for each case I needed to close the carry chain
> in the inline asm so fall through wouldn't have much value and each
> case is expanded. The C version using switch gave a nice performance
> gain, moving to all assembly was somewhat better.

Yeah,. so I _think_ that if my approach works, the code generated for
the 0-8 byte case looks just something like

movq %rsi, %rax
andl $1, %eax
subq %rax, %rdi
movq %rdx, %rax
movq (%rdi), %rcx
andq mask(,%rsi,8), %rcx
addq %rcx,%rax
adcq $0,%rax

which is pretty close to optimal (that's with my hopefully fixed version).

That's not with the final narrowing from 64-bit to 32-bit, but it
looks like it should perform well on most x86 cores, and then the only
remaining branches end up being the actual size checking ones.

And yes, you could avoid a few "adc $0" instructions in asm, but quite
frankly, I think that's a tradeoff that is worth it.

My guess is that you can get to within a percent of optimal with the C
approach, and I really think it makes it easier to try different
things (ie things like the above that avoids the switch table
entirely)

> There is also question of alignment. I f we really don't need to worry
> about alignment at all on x86, then we should be able to eliminate the
> complexity of dealing with it.

So most x86 alignment issues are about crossing the cache bank size,
which is usually 16 or 32 bytes. Unaligned accesses *within* one of
those banks should be basically free (there's a whoppign big byte lane
shifter, so there's lots of hardware support for that).

Also, even when you do cross a cache bank, it's usually pretty cheap.
It extra cycles, but it's generally *less* extra cycles than it would
be to try to align things in software and doing two accesses.

The rule of thumb is that you should never worry about _individual_
unaligned accesses. It's only really worth it aligning things before
biggish loops. So aligning to an 8-byte boundary before you then do a
lot of "adcq" instructions makes sense, but worrying about unaligned
accesses for the beginning/end does generally not.

>>> For example, for the actual "8 bytes or shorter" case, I think
>> something like this might just work fine: [ snip ]
>
> I will look at doing that.

So I already found and fixed one bug in that approach, but I still
think it's a viable model.

But you will almost certainly have to fix a few more of my bugs before
it really works ;)

   Linus


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
 wrote:
> I missed the original email (I don't have net-devel in my mailbox),
> but based on Ingo's quoting have a more fundamental question:
>
> Why wasn't that done with C code instead of asm with odd numerical targets?
>
The reason I did this in assembly is precisely about the your point of
having to close the carry chains with adcq $0. I do have a first
implementation in C which using switch() to handle alignment, excess
length less than 8 bytes, and the odd number of quads to sum in the
main loop. gcc turns these switch statements into jump tables (not
function tables which is what Ingo's example code was using). The
problem I hit was that for each case I needed to close the carry chain
in the inline asm so fall through wouldn't have much value and each
case is expanded. The C version using switch gave a nice performance
gain, moving to all assembly was somewhat better.

There is also question of alignment. I f we really don't need to worry
about alignment at all on x86, then we should be able to eliminate the
complexity of dealing with it.

> It seems likely that the real issue is avoiding the short loops (that
> will cause branch prediction problems) and use a lookup table instead.
>
> But we can probably do better than that asm.
>
> For example, for the actual "8 bytes or shorter" case, I think
> something like this might just work fine:
>
>   unsigned long csum_partial_8orless(const unsigned char *buf,
> unsigned long len, unsigned long sum)
>   {
> static const unsigned long mask[9] = {
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x };
> unsigned long val = load_unaligned_zeropad(buf + (len & 1));
> val &= mask[len];
> asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum));
> return sum;
>   }
>
I will look at doing that.

Thanks,
Tom

> NOTE! The above is 100% untested. But I _think_ it may handle the
> odd-byte-case correctly, and it should result in just one 8-byte load
> (the "load_unaligned_zeropad()" is just in case that ends up
> overflowing and we have page-alloc-debug triggering a page fault at
> the end). All without looping or any conditional branches that might
> mispredict.
>
> My point is that going to assembly results in pretty hard-to-read
> code, but it's also fairly inflexible. If we stay with C, we can much
> more easily play tricks. So for example, make the above an inline
> function, and now you can do things like this:
>
>   static inline unsigned long csum_partial_64bit(void *buf, unsigned
> long len, unsigned long sum)
>   {
> if (len <= 8)
> return csum_partial_8orless(buf, len, sum);
>
> /* We know it's larger than 8 bytes, so handle alignment */
> align = 7 & -(unsigned long)buf;
> sum = csum_partial_8orless(buf, align, sum);
> buf += align;
>
> /* We need to do the big-endian thing */
> sum = rotate_by8_if_odd(sum, align);
>
> /* main loop for big packets */
> .. do the unrolled inline asm thing that we already do ..
>
> sum = rotate_by8_if_odd(sum, align);
>
> /* Handle the last bytes */
> return csum_partial_8orless(buf, len, sum);
>   }
>
>   /* Fold the 64-bit sum we computed down to 32 bits __wsum */
>   __wsum int csum_partial(void *buf, unsigned int len, __wsum partial)
>   {
> unsigned long sum = csum_partial_64bit(ptr, len, partial);
> asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum));
> return sum;
>  }
>
> or something like that.
>
> NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole
> "big-endian 16-bit add" thing right by doing the odd byte masking in
> the end cases, and by rotating the sum by 8 bits around the
> 8-byte-unrolled-loop, but I didn't test the above. It's literally
> written in my mail client. So I can almost guarantee that it is buggy,
> but it is meant as an *example* of "why not do it this way" rather
> than production code.
>
> I think that writing it in C and trying to be intelligent about it
> like the above would result in more maintainable code, and it is
> possible that it would even be faster.
>
> Yes, writing it in C *does* likely result in a few more cases of "adcq
> $0" in order to finish up the carry calculations. The *only* advantage
> of inline asm is how it allows you to keep the carry flag around. So
> there is downside to the C model, and it might cause a cycle or two of
> extra work, but the upside of C is that you can try to do clever
> things without turning the code completely unreadable.
>
> For example, doing the exception handling (that 

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds
 wrote:
>
> The "+" should be "-", of course - the point is to shift up the value
> by 8 bits for odd cases, and we need to load starting one byte early
> for that. The idea is that we use the byte shifter in the load unit to
> do some work for us.

Ok, so I thought some more about this, and the fact is, we don't
actually want to do the byte shifting at all for the first case (the
"length < 8" case), since the address of that one hasn't been shifted.

it's only for the "we're going to align things to 8 bytes" case that
we would want to do it. But then we might as well use the
rotate_by8_if_odd() model, so I suspect the address games are just
entirely pointless.

So here is something that is actually tested (although admittedly not
well), and uses that fairly simple model.

NOTE! I did not do the unrolling of the "adcq" loop in the middle, but
that's a totally trivial thing now. So this isn't very optimized,
because it will do a *lot* of extra "adcq $0" to get rid of the carry
bit. But with that core loop unrolled, you'd get rid of most of them.

  Linus

---
static unsigned long rotate_by8_if_odd(unsigned long sum, unsigned long aligned)
{
asm("rorq %b1,%0"
:"=r" (sum)
:"c" ((aligned & 1) << 3), "0" (sum));
return sum;
}

static unsigned long csum_partial_lt8(unsigned long val, int len,
unsigned long sum)
{
unsigned long mask = (1ul << len*8)-1;
val &= mask;
return add64_with_carry(val, sum);
}

static unsigned long csum_partial_64(const void *buff, unsigned long
len, unsigned long sum)
{
unsigned long align, val;

// This is the only potentially unaligned access, and it can
// also theoretically overflow into the next page
val = load_unaligned_zeropad(buff);
if (len < 8)
return csum_partial_lt8(val, len, sum);

align = 7 & -(unsigned long)buff;
sum = csum_partial_lt8(val, align, sum);
buff += align;
len -= align;

sum = rotate_by8_if_odd(sum, align);
while (len >= 8) {
val = *(unsigned long *) buff;
sum = add64_with_carry(sum, val);
buff += 8;
len -= 8;
}
sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
return rotate_by8_if_odd(sum, align);
}

__wsum csum_partial(const void *buff, unsigned long len, unsigned long sum)
{
sum = csum_partial_64(buff, len, sum);
return add32_with_carry(sum, sum >> 32);
}


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds
 wrote:
> sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
> return rotate_by8_if_odd(sum, align);

Actually, that last word-sized access to "buff" might be past the end
of the buffer. The code does the right thing if "len" is zero, except
for the possible page fault or address verifier complaint.

So that very last call to "csum_partial_lt8()" either needs to be
conditional (easy enough to add an "if (len)" around that whole
statement) or the thing could be unconditional but the load needs to
use "load_unaligned_zeropad()" so that the exception is harmless.

It's probably immaterial which one you pick. The few possibly useless
ALU operations vs a possible branch misprodict penalty are probably
going to make it a wash. The exception will never happen in practice,
but if DEBUG_PAGEALLOC is enabled, or if something like KASAN is
active, it will complain loudly if it happens to go past the
allocation.

   Linus


Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Ingo Molnar

* Tom Herbert  wrote:

> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conversion.
> 
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether
> we need to avoid unaligned accesses. Efficient unaligned accesses
> offer a nice additional speedup as demonstrated in the results
> provided below.
> 
> This implementation is similar to csum_partial implemented in
> checksum_32.S, however since we are dealing with 8 bytes at a time
> there are more cases for small lengths (and alignments)-- for that
> we employ a jump table.
> 
> Testing:
> 
> Correctness:
> 
> Verified correctness by testing arbitrary length buffer filled with
> random data. For each buffer I compared the computed checksum
> using the original algorithm for each possible alignment (0-7 bytes).
> 
> Checksum performance:
> 
> Isolating old and new implementation for some common cases:
> 
>  Old  NewA NewA %   NewNoA   NewNoA %
> Len/Aln  nsec nsec Improv   nsecsImprove
> +---++---+---+-
> 1400/0192.9175.1   10% 174.9 10%   (Big packet)
> 40/0  13.8 7.7 44% 5.7   58%   (Ipv6 hdr cmn case)
> 8/4   8.4  6.9 18% 2.8   67%   (UDP, VXLAN in IPv4)
> 14/0  10.5 7.3 30% 5.4   48%   (Eth hdr)
> 14/4  10.8 8.7 19% 5.4   50%   (Eth hdr in IPv4)
> 14/3  11.0 9.8 11% 5.6   49%   (Eth with odd align)
> 7/1   10.0 5.8 42% 4.8   52%   (buffer in one quad)
> 
> NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set
> NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set
> 
> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
> 
> Also test on these with similar results:
> 
> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
> 
> Branch prediction:
> 
> To test the effects of poor branch prediction in the jump tables I
> tested checksum performance with runs for two combinations of length
> and alignment. As the baseline I performed the test by doing half of
> calls with the first combination, followed by using the second
> combination for the second half. In the test case, I interleave the
> two combinations so that in every call the length and alignment are
> different to defeat the effects of branch prediction. Running several
> cases, I did not see any material performance difference between
> the baseline and the interleaving test case.

Possibly because the jumps are missing branch prediction for all these 
variants...

Please run something like:

 triton:~/tip> perf stat --repeat 10 -e '{instructions, branches, 
branch-misses}' ~/hackbench 10
 ...

 Performance counter stats for '/home/mingo/hackbench 10' (10 runs):

 1,701,689,802  instructions
  ( +-  0.83% )
   305,439,262   branches   
  ( +-  0.92% )
 2,011,032   branch-misses#0.66% of all branches
  ( +-  3.00% )

   0.074311070 seconds time elapsed 
 ( +-  2.42% )


... to see the quality of x86 branch prediction and to see the statistical 
quality 
of the measurement (stddev).

The real comparison would be jump table against a hierarchy of open coded 
branches. The latter is much more likely to be correctly branch-predicted.

I have a couple of (mostly stylistic) comments about the patch:

> Signed-off-by: Tom Herbert 
> ---
>  arch/x86/include/asm/checksum_64.h |   5 +
>  arch/x86/lib/csum-partial_64.S | 277 
> +
>  arch/x86/lib/csum-partial_64.c | 148 
>  3 files changed, 282 insertions(+), 148 deletions(-)
>  create mode 100644 arch/x86/lib/csum-partial_64.S
>  delete mode 100644 arch/x86/lib/csum-partial_64.c
> 
> diff --git a/arch/x86/include/asm/checksum_64.h 
> b/arch/x86/include/asm/checksum_64.h
> index cd00e17..a888f65 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, 
> __be32 daddr,
>   */
>  extern __wsum csum_partial(const void *buff, int len, __wsum sum);
>  
> +static inline __sum16 ip_compute_csum(const void *buff, int len)
> +{
> + return csum_fold(csum_partial(buff, len, 0));
> +}
> +
>  #define  _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1
>  #define HAVE_CSUM_COPY_USER 1
>  
> diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S
> new file mode 100644
> index 000..520b400
> --- /dev/null
> +++ b/arch/x86/lib/csum-partial_64.S
> @@ -0,0 +1,277 @@
> +/* Copyright 2016 

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar  wrote:
>
> * Ingo Molnar  wrote:
>
>> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>
>> > +
>> > +   /* Check length */
>> > +10:cmpl$8, %esi
>> > +   jg  30f
>> > +   jl  20f
>> > +
>> > +   /* Exactly 8 bytes length */
>> > +   addl(%rdi), %eax
>> > +   adcl4(%rdi), %eax
>> > +   RETURN
>> > +
>> > +   /* Less than 8 bytes length */
>> > +20:clc
>> > +   jmpq *branch_tbl_len(, %rsi, 8)
>> > +
>> > +   /* Greater than 8 bytes length. Determine number of quads (n). Sum
>> > +* over first n % 16 quads
>> > +*/
>> > +30:movl%esi, %ecx
>> > +   shrl$3, %ecx
>> > +   andl$0xf, %ecx
>> > +   negq%rcx
>> > +   lea 40f(, %rcx, 4), %r11
>> > +   clc
>> > +   jmp *%r11
>>
>> Are you absolutely sure that a jump table is the proper solution here? It
>> defeats branch prediction on most x86 uarchs. Why not label the loop stages 
>> and
>> jump in directly with a binary tree of branches?
>
> So just to expand on this a bit, attached below is a quick & simple & stupid
> testcase that generates a 16 entries call table. (Indirect jumps and indirect
> calls are predicted similarly on most x86 uarchs.) Just built it with:
>
>   gcc -Wall -O2 -o jump-table jump-table.c
>
> Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel 
> CPU
> and also on AMD Opteron. The numbers are from the Intel box.) this gives a 
> high
> branch miss rate with a 16 entries jump table:
>
>  triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16
>  ... using 16 jump table entries.
>  ... using 1 loop iterations.
>  ... result: 10001
>  [...]
>
>  Performance counter stats for './jump-table 16' (10 runs):
>
>1386.131780  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.18% )
> 33  context-switches  #0.024 K/sec
> ( +-  1.71% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.037 K/sec
> ( +-  0.71% )
>  6,247,215,683  cycles#4.507 GHz  
> ( +-  0.18% )
>  3,895,337,877  stalled-cycles-frontend   #   62.35% frontend cycles 
> idle ( +-  0.30% )
>  1,404,014,996  instructions  #0.22  insns per cycle
>   #2.77  stalled cycles 
> per insn  ( +-  0.02% )
>300,820,988  branches  #  217.022 M/sec
> ( +-  0.02% )
> 87,518,741  branch-misses #   29.09% of all branches  
> ( +-  0.01% )
>
>1.385240076 seconds time elapsed   
>( +-  0.21% )
>
> ... as you can see the branch miss rate is very significant, causing a stalled
> decoder and very low instruction throughput.
>
> I have to reduce the jump table to a single entry (!) to get good performance 
> on
> this CPU:
>
>  Performance counter stats for './jump-table 1' (10 runs):
>
> 739.173505  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.26% )
> 37  context-switches  #0.051 K/sec
> ( +- 16.79% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.070 K/sec
> ( +-  0.41% )
>  3,331,328,405  cycles#4.507 GHz  
> ( +-  0.26% )
>  2,012,973,596  stalled-cycles-frontend   #   60.43% frontend cycles 
> idle ( +-  0.47% )
>  1,403,880,792  instructions  #0.42  insns per cycle
>   #1.43  stalled cycles 
> per insn  ( +-  0.05% )
>300,817,064  branches  #  406.964 M/sec
> ( +-  0.05% )
> 12,177  branch-misses #0.00% of all branches  
> ( +- 12.39% )
>
>0.738616356 seconds time elapsed   
>( +-  0.26% )
>
> Note how the runtime got halved: that is because stalls got halved and the
> instructions per cycle throughput doubled.
>
> Even a two entries jump table performs poorly:
>
>  Performance counter stats for './jump-table 2' (10 runs):
>
>1493.790686  task-clock (msec) #1.001 CPUs utilized
> ( +-  0.06% )
> 39  context-switches  #0.026 K/sec
> ( +-  4.73% )
>  0  cpu-migrations#0.000 K/sec
> 52  page-faults   #0.035 K/sec
> ( +-  0.26% )
>  6,732,372,612  cycles#4.507 GHz  
> ( +-  0.06% )
>