RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
From: George Spelvin > Sent: 10 February 2016 14:44 ... > > I think the fastest loop is: > > 10: adcq0(%rdi,%rcx,8),%rax > > inc %rcx > > jnz 10b > > That loop looks like it will have no overhead on recent cpu. > > Well, it should execute at 1 instruction/cycle. I presume you do mean 1 adc/cycle. If it doesn't unrolling once might help. > (No, a scaled offset doesn't take extra time.) Maybe I'm remembering the 386 book. > To break that requires ADCX/ADOX: > > 10: adcxq 0(%rdi,%rcx),%rax > adoxq 8(%rdi,%rcx),%rdx > leaq16(%rcx),%rcx > jrcxz 11f > j 10b > 11: Getting 2 adc/cycle probably does require a little unrolling. With luck the adcxq, adoxq and leaq will execute together. The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair. Experiments would be needed to confirm guesses though. David
RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
David Laight wrote: > Separate renaming allows: > 1) The value to tested without waiting for pending updates to complete. >Useful for IE and DIR. I don't quite follow. It allows the value to be tested without waiting for pending updates *of other bits* to complete. Obviusly, the update of the bit being tested has to complete! > I can't see any obvious gain from separating out O or Z (even with > adcx and adox). You'd need some other instructions that don't set O (or Z) > but set some other useful flags. > (A decrement that only set Z for instance.) I tried to describe the advantages in the previous message. The problems arise much less often than the INC/DEC pair, but there are instructions whick write only the O and C flags, (ROL, ROR) and only the Z flag (CMPXCHG). The sign, aux carry, and parity flags are *always* updated as a group, so they can be renamed as a group. > While LOOP could be used on Bulldozer+ an equivalently fast loop > can be done with inc/dec and jnz. > So you only care about LOOP/JCXZ when ADOX is supported. > > I think the fastest loop is: > 10: adc %rax,0(%rdi,%rcx,8) > inc %rcx > jnz 10b > but check if any cpu add an extra clock for the 'scaled' offset > (they might be faster if %rdi is incremented). > That loop looks like it will have no overhead on recent cpu. Well, it should execute at 1 instruction/cycle. (No, a scaled offset doesn't take extra time.) To break that requires ADCX/ADOX: 10: adcxq 0(%rdi,%rcx),%rax adoxq 8(%rdi,%rcx),%rdx leaq16(%rcx),%rcx jrcxz 11f j 10b 11:
RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
From: George Spelvin > Sent: 10 February 2016 00:54 > To: David Laight; linux-kernel@vger.kernel.org; li...@horizon.com; > net...@vger.kernel.org; > David Laight wrote: > > Since adcx and adox must execute in parallel I clearly need to re-remember > > how dependencies against the flags register work. I'm sure I remember > > issues with 'false dependencies' against the flags. > > The issue is with flags register bits that are *not* modified by > an instruction. If the register is treated as a monolithic entity, > then the previous values of those bits must be considered an *input* > to the instruction, forcing serialization. > > The first step in avoiding this problem is to consider the rarely-modified > bits (interrupt, direction, trap, etc.) to be a separate logical register > from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity) > which are updated by almost every instruction. > > An arithmetic instruction overwrites the arithmetic flags (so it's only > a WAW dependency which can be broken by renaming) and doesn't touch the > status flags (so no dependency). > > However, on x86 even the arithmetic flags aren't updated consistently. > The biggest offender are the (very common!) INC/DEC instructions, > which update all of the arithmetic flags *except* the carry flag. > > Thus, the carry flag is also renamed separately on every superscalar > x86 implementation I've ever heard of. Ah, that is the little fact I'd forgotten. ... > Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that > it was reasonable to commit to always renaming CF and OF separately. Separate renaming allows: 1) The value to tested without waiting for pending updates to complete. Useful for IE and DIR. 2) Instructions that modify almost all the flags to execute without waiting for a previous instruction to complete. So separating 'carry' allows inc/dec to execute without waiting for previous arithmetic to complete. The latter should remove the dependency (both ways) between 'adc' and 'dec, jnz' in a checksum loop. I can't see any obvious gain from separating out O or Z (even with adcx and adox). You'd need some other instructions that don't set O (or Z) but set some other useful flags. (A decrement that only set Z for instance.) > > However you still need a loop construct that doesn't modify 'o' or 'c'. > > Using leal, jcxz, jmp might work. > > (Unless broadwell actually has a fast 'loop' instruction.) > > According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf), > JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back > to K8 and Merom. The one exception is Precott. JCXZ and LOOP are 4 > uops on those processors. But 64 bit in general sucked on Precott, > so how much do we care? > > AMD: LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar. > JCXZ is acceptable on all of them. > LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller. > Intel:LOOP is slow (7+ uops) on all processors up to and including > Skylake. > JCXZ is 2 upos on everything from P6 to Skylake exacpt for: > - Prescott (JCXZ & loop both 4 uops) > - 1st gen Atom (JCXZ 3 uops, LOOP 8 uops) > I can't find any that it's fast on. While LOOP could be used on Bulldozer+ an equivalently fast loop can be done with inc/dec and jnz. So you only care about LOOP/JCXZ when ADOX is supported. I think the fastest loop is: 10: adc %rax,0(%rdi,%rcx,8) inc %rcx jnz 10b but check if any cpu add an extra clock for the 'scaled' offset (they might be faster if %rdi is incremented). That loop looks like it will have no overhead on recent cpu. David
RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
David Laight wrote: > Since adcx and adox must execute in parallel I clearly need to re-remember > how dependencies against the flags register work. I'm sure I remember > issues with 'false dependencies' against the flags. The issue is with flags register bits that are *not* modified by an instruction. If the register is treated as a monolithic entity, then the previous values of those bits must be considered an *input* to the instruction, forcing serialization. The first step in avoiding this problem is to consider the rarely-modified bits (interrupt, direction, trap, etc.) to be a separate logical register from the arithmetic flags (carry, overflow, zero, sign, aux carry and parity) which are updated by almost every instruction. An arithmetic instruction overwrites the arithmetic flags (so it's only a WAW dependency which can be broken by renaming) and doesn't touch the status flags (so no dependency). However, on x86 even the arithmetic flags aren't updated consistently. The biggest offender are the (very common!) INC/DEC instructions, which update all of the arithmetic flags *except* the carry flag. Thus, the carry flag is also renamed separately on every superscalar x86 implementation I've ever heard of. The bit test instructions (BT, BTC, BTR, BTS) also affect *only* the carry flag, leaving other flags unmodified. This is also handled properly by renaming the carry flag separately. Here's a brief summary chart of flags updated by common instructions: http://www.logix.cz/michal/doc/i386/app-c.htm and the full list with all the corner cases: http://www.logix.cz/michal/doc/i386/app-b.htm The other two flags that can be worth separating are the overflow and zero flags. The rotate instructions modify *only* the carry and overflow flags. While overflow is undefined for multi-bit rotates (and thus leaving it unmodified is a valid implementation), it's defined for single-bit rotates, so must be written. There are several less common instructions, notably BSF, BSR, CMPXCHG8B, and a bunch of 80286 segment instructions that nobody cares about, which retort the result of a test in the zero flag and are defined to not affect the other flags. So an aggressive x86 implementation breaks the flags register into five separately renamed registers: - CF (carry) - OF (overflow) - ZF (zero) - SF, AF, PF (sign, aux carry, and parity) - DF, IF, TF, IOPL, etc. Anyway, I'm sure that when Intel defined ADCX and ADOX they felt that it was reasonable to commit to always renaming CF and OF separately. > However you still need a loop construct that doesn't modify 'o' or 'c'. > Using leal, jcxz, jmp might work. > (Unless broadwell actually has a fast 'loop' instruction.) According to Agner Fog (http://agner.org/optimize/instruction_tables.pdf), JCXZ is reasonably fast (2 uops) on almost all 64-bit CPUs, right back to K8 and Merom. The one exception is Precott. JCXZ and LOOP are 4 uops on those processors. But 64 bit in general sucked on Precott, so how much do we care? AMD:LOOP is slow (7 uops) on K8, K10, Bobcat and Jaguar. JCXZ is acceptable on all of them. LOOP and JCXZ are 1 uop on Bulldozer, Piledriver and Steamroller. Intel: LOOP is slow (7+ uops) on all processors up to and including Skylake. JCXZ is 2 upos on everything from P6 to Skylake exacpt for: - Prescott (JCXZ & loop both 4 uops) - 1st gen Atom (JCXZ 3 uops, LOOP 8 uops) I can't find any that it's fast on.
RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
From: George Spelvin [mailto:li...@horizon.com] > Sent: 08 February 2016 20:13 > David Laight wrote: > > I'd need convincing that unrolling the loop like that gives any significant > > gain. > > You have a dependency chain on the carry flag so have delays between the > > 'adcq' > > instructions (these may be more significant than the memory reads from l1 > > cache). > > If the carry chain is a bottleneck, on Broadwell+ (feature flag > X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use > separate flag bits for their carry chains and so can be interleaved. > > I don't have such a machine to test on, but if someone who does > would like to do a little benchmarking, that would be an interesting > data point. > > Unfortunately, that means yet another version of the main loop, > but if there's a significant benefit... Well, the only part actually worth writing in assembler is the 'adc' loop. So run-time substitution of separate versions (as is done for memcpy()) wouldn't be hard. Since adcx and adox must execute in parallel I clearly need to re-remember how dependencies against the flags register work. I'm sure I remember issues with 'false dependencies' against the flags. However you still need a loop construct that doesn't modify 'o' or 'c'. Using leal, jcxz, jmp might work. (Unless broadwell actually has a fast 'loop' instruction.) (I've not got a suitable test cpu.) David
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
David Laight wrote: > I'd need convincing that unrolling the loop like that gives any significant > gain. > You have a dependency chain on the carry flag so have delays between the > 'adcq' > instructions (these may be more significant than the memory reads from l1 > cache). If the carry chain is a bottleneck, on Broadwell+ (feature flag X86_FEATURE_ADX), there are the ADCX and ADOX instructions, which use separate flag bits for their carry chains and so can be interleaved. I don't have such a machine to test on, but if someone who does would like to do a little benchmarking, that would be an interesting data point. Unfortunately, that means yet another version of the main loop, but if there's a significant benefit...
RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
From: Ingo Molnar ... > As Linus noticed, data lookup tables are the intelligent solution: if you > manage > to offload the logic into arithmetics and not affect the control flow then > that's > a big win. The inherent branching will be hidden by executing on massively > parallel arithmetics units which effectively execute everything fed to them > in a > single cycle. Not necessarily, in real-life they are likely to be a cache miss. With the parallel execution units you just need to worry about the register dependency chains. In this case the 'adc' have dependencies on both the result register and the flags register. The best loop might even be: 10: adc %rax,0(%rdi,%rcx) lea %rcx,8(%rcx) jcxz20f jmp 10b 20: You then need to insert just enough copies of the adc so they take as long as the other instructions. David
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
* Tom Herbert wrote: > Thanks for the explanation and sample code. Expanding on your example, I > added a > switch statement to perform to function (code below). So I think your new switch() based testcase is broken in a subtle way. The problem is that in your added testcase GCC effectively optimizes out the function call, because it is able to recognize that all the (trivial) functions are identical. (At least here, with GCC 5.2.1) So all overhead of the workload goes into just that do_switch() function you added: # # Total Lost Samples: 0 # # Samples: 5K of event 'cycles:pp' # Event count (approx.): 5738959683 # # Overhead Command Shared Object Symbol # ... . ... # 99.94% jump-table2 jump-table2[.] do_switch 0.02% jump-table2 [kernel.kallsyms] [k] native_read_msr_safe 0.02% jump-table2 [kernel.kallsyms] [k] native_apic_mem_write 0.01% jump-table2 [kernel.kallsyms] [k] __fput 0.00% jump-table2 [kernel.kallsyms] [k] change_protection_range 0.00% perf [kernel.kallsyms] [k] strrchr 0.00% perf [kernel.kallsyms] [k] intel_pmu_handle_irq 0.00% perf [kernel.kallsyms] [k] native_write_msr_safe ... and per instruction overhead in the do_switch() looks like this: : : Disassembly of section .text: : : 00401100 : : do_switch(): 0.00 :401100: mov0x201f61(%rip),%r8# 603068 0.00 :401107: test %r8,%r8 0.00 :40110a: je 401178 0.00 :40110c: mov0x201f5d(%rip),%rdi# 603070 0.00 :401113: xor%esi,%esi 0.00 :401115: xor%ecx,%ecx 0.00 :401117: nopw 0x0(%rax,%rax,1) 0.00 :401120: mov0x201f61(%rip),%rax# 603088 1.46 :401127: xor%edx,%edx 0.00 :401129: add$0x1,%rax 0.00 :40112d: mov%rax,0x201f54(%rip)# 603088 0.00 :401134: mov0x201f4d(%rip),%rax# 603088 51.68 :40113b: div%rdi 0.00 :40113e: cmp$0x3f,%rdx 1.34 :401142: ja 40116b 0.02 :401144: mov0x201f3d(%rip),%r9# 603088 0.10 :40114b: mov0x201f36(%rip),%rax# 603088 6.91 :401152: jmpq *0x401420(,%rdx,8) 0.00 :401159: nopl 0x0(%rax) 0.02 :401160: xor%edx,%edx 35.71 :401162: div%r9 1.07 :401165: add%rdx,%r9 1.69 :401168: add%r9,%rsi 0.00 :40116b: add$0x1,%rcx 0.00 :40116f: cmp%r8,%rcx 0.00 :401172: jne401120 0.00 :401174: mov%rsi,%rax 0.00 :401177: retq 0.00 :401178: xor%esi,%esi 0.00 :40117a: jmp401174 Note that as you remarked there _is_ an indirect jump in there: 6.91 :401152: jmpq *0x401420(,%rdx,8) But: > gcc seems to be implementing this switch as a jump table: > >jmpq *0x400ac8(,%rdx,8) ... the 'jump table' has identical entries: 40141f: 00 60 11add%ah,0x11(%rax) 401422: 40 00 00add%al,(%rax) 401425: 00 00 add%al,(%rax) 401427: 00 60 11add%ah,0x11(%rax) 40142a: 40 00 00add%al,(%rax) 40142d: 00 00 add%al,(%rax) 40142f: 00 60 11add%ah,0x11(%rax) 401432: 40 00 00add%al,(%rax) 401435: 00 00 add%al,(%rax) 401437: 00 60 11add%ah,0x11(%rax) 40143a: 40 00 00add%al,(%rax) (misaligned dump - jump table starts at 0x401420) Every jump table entry points to 0x401160 - which is the code after the indirect jump in do_switch(). So the hardware predictor simply recognizes that it's the same target address all the time, and thus (understandably) performs much faster - none of the functions are actually executed. That is why there are no function call entries in perf report, while in the function pointer case we get: # # Total Lost Samples: 0 # # Samples: 4K of event 'cycles:pp' # Event count (approx.): 5482523153 # # Overhead Command Shared Object Symbol # ... . .. # 13.47% jump-table2 jump-table2[.] fun_1 13.22% jump-table2 jump-table2[.] fun_6 13.12% jump-table2
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
* Tom Herbert wrote: > [] gcc turns these switch statements into jump tables (not function > tables > which is what Ingo's example code was using). [...] So to the extent this still matters, on most x86 microarchitectures that count, jump tables and function call tables (i.e. virtual functions that C++ uses) are generally optimized by the same branch predictor hardware mechanism. Indirect jumps (jump tables) and indirect calls (function pointer tables) are very similar conceptually. That is why posted the indirect calls test code. ( The only branching variant that will perform badly even on the latest uarchs are indirect returns: to modify the return address on the stack. ) So my narrow performance point stands, if any sort of indirect jump is used. They should be avoided if possible, because it's pretty hard for the hardware to get it right. As Linus noticed, data lookup tables are the intelligent solution: if you manage to offload the logic into arithmetics and not affect the control flow then that's a big win. The inherent branching will be hidden by executing on massively parallel arithmetics units which effectively execute everything fed to them in a single cycle. In any case, when submitting such patches, please get into the habit of looking at and posting perf stat output - it will give us a good idea about the quality of an implementation. Thanks, Ingo
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds wrote: > sum = csum_partial_lt8(*(unsigned long *)buff, len, sum); > return rotate_by8_if_odd(sum, align); Actually, that last word-sized access to "buff" might be past the end of the buffer. The code does the right thing if "len" is zero, except for the possible page fault or address verifier complaint. So that very last call to "csum_partial_lt8()" either needs to be conditional (easy enough to add an "if (len)" around that whole statement) or the thing could be unconditional but the load needs to use "load_unaligned_zeropad()" so that the exception is harmless. It's probably immaterial which one you pick. The few possibly useless ALU operations vs a possible branch misprodict penalty are probably going to make it a wash. The exception will never happen in practice, but if DEBUG_PAGEALLOC is enabled, or if something like KASAN is active, it will complain loudly if it happens to go past the allocation. Linus
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds wrote: > > The "+" should be "-", of course - the point is to shift up the value > by 8 bits for odd cases, and we need to load starting one byte early > for that. The idea is that we use the byte shifter in the load unit to > do some work for us. Ok, so I thought some more about this, and the fact is, we don't actually want to do the byte shifting at all for the first case (the "length < 8" case), since the address of that one hasn't been shifted. it's only for the "we're going to align things to 8 bytes" case that we would want to do it. But then we might as well use the rotate_by8_if_odd() model, so I suspect the address games are just entirely pointless. So here is something that is actually tested (although admittedly not well), and uses that fairly simple model. NOTE! I did not do the unrolling of the "adcq" loop in the middle, but that's a totally trivial thing now. So this isn't very optimized, because it will do a *lot* of extra "adcq $0" to get rid of the carry bit. But with that core loop unrolled, you'd get rid of most of them. Linus --- static unsigned long rotate_by8_if_odd(unsigned long sum, unsigned long aligned) { asm("rorq %b1,%0" :"=r" (sum) :"c" ((aligned & 1) << 3), "0" (sum)); return sum; } static unsigned long csum_partial_lt8(unsigned long val, int len, unsigned long sum) { unsigned long mask = (1ul << len*8)-1; val &= mask; return add64_with_carry(val, sum); } static unsigned long csum_partial_64(const void *buff, unsigned long len, unsigned long sum) { unsigned long align, val; // This is the only potentially unaligned access, and it can // also theoretically overflow into the next page val = load_unaligned_zeropad(buff); if (len < 8) return csum_partial_lt8(val, len, sum); align = 7 & -(unsigned long)buff; sum = csum_partial_lt8(val, align, sum); buff += align; len -= align; sum = rotate_by8_if_odd(sum, align); while (len >= 8) { val = *(unsigned long *) buff; sum = add64_with_carry(sum, val); buff += 8; len -= 8; } sum = csum_partial_lt8(*(unsigned long *)buff, len, sum); return rotate_by8_if_odd(sum, align); } __wsum csum_partial(const void *buff, unsigned long len, unsigned long sum) { sum = csum_partial_64(buff, len, sum); return add32_with_carry(sum, sum >> 32); }
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert wrote: > > The reason I did this in assembly is precisely about the your point of > having to close the carry chains with adcq $0. I do have a first > implementation in C which using switch() to handle alignment, excess > length less than 8 bytes, and the odd number of quads to sum in the > main loop. gcc turns these switch statements into jump tables (not > function tables which is what Ingo's example code was using). The > problem I hit was that for each case I needed to close the carry chain > in the inline asm so fall through wouldn't have much value and each > case is expanded. The C version using switch gave a nice performance > gain, moving to all assembly was somewhat better. Yeah,. so I _think_ that if my approach works, the code generated for the 0-8 byte case looks just something like movq %rsi, %rax andl $1, %eax subq %rax, %rdi movq %rdx, %rax movq (%rdi), %rcx andq mask(,%rsi,8), %rcx addq %rcx,%rax adcq $0,%rax which is pretty close to optimal (that's with my hopefully fixed version). That's not with the final narrowing from 64-bit to 32-bit, but it looks like it should perform well on most x86 cores, and then the only remaining branches end up being the actual size checking ones. And yes, you could avoid a few "adc $0" instructions in asm, but quite frankly, I think that's a tradeoff that is worth it. My guess is that you can get to within a percent of optimal with the C approach, and I really think it makes it easier to try different things (ie things like the above that avoids the switch table entirely) > There is also question of alignment. I f we really don't need to worry > about alignment at all on x86, then we should be able to eliminate the > complexity of dealing with it. So most x86 alignment issues are about crossing the cache bank size, which is usually 16 or 32 bytes. Unaligned accesses *within* one of those banks should be basically free (there's a whoppign big byte lane shifter, so there's lots of hardware support for that). Also, even when you do cross a cache bank, it's usually pretty cheap. It extra cycles, but it's generally *less* extra cycles than it would be to try to align things in software and doing two accesses. The rule of thumb is that you should never worry about _individual_ unaligned accesses. It's only really worth it aligning things before biggish loops. So aligning to an 8-byte boundary before you then do a lot of "adcq" instructions makes sense, but worrying about unaligned accesses for the beginning/end does generally not. >>> For example, for the actual "8 bytes or shorter" case, I think >> something like this might just work fine: [ snip ] > > I will look at doing that. So I already found and fixed one bug in that approach, but I still think it's a viable model. But you will almost certainly have to fix a few more of my bugs before it really works ;) Linus
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds wrote: > I missed the original email (I don't have net-devel in my mailbox), > but based on Ingo's quoting have a more fundamental question: > > Why wasn't that done with C code instead of asm with odd numerical targets? > The reason I did this in assembly is precisely about the your point of having to close the carry chains with adcq $0. I do have a first implementation in C which using switch() to handle alignment, excess length less than 8 bytes, and the odd number of quads to sum in the main loop. gcc turns these switch statements into jump tables (not function tables which is what Ingo's example code was using). The problem I hit was that for each case I needed to close the carry chain in the inline asm so fall through wouldn't have much value and each case is expanded. The C version using switch gave a nice performance gain, moving to all assembly was somewhat better. There is also question of alignment. I f we really don't need to worry about alignment at all on x86, then we should be able to eliminate the complexity of dealing with it. > It seems likely that the real issue is avoiding the short loops (that > will cause branch prediction problems) and use a lookup table instead. > > But we can probably do better than that asm. > > For example, for the actual "8 bytes or shorter" case, I think > something like this might just work fine: > > unsigned long csum_partial_8orless(const unsigned char *buf, > unsigned long len, unsigned long sum) > { > static const unsigned long mask[9] = { > 0x, > 0xff00, > 0x, > 0xff00, > 0x, > 0xff00, > 0x, > 0xff00, > 0x }; > unsigned long val = load_unaligned_zeropad(buf + (len & 1)); > val &= mask[len]; > asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum)); > return sum; > } > I will look at doing that. Thanks, Tom > NOTE! The above is 100% untested. But I _think_ it may handle the > odd-byte-case correctly, and it should result in just one 8-byte load > (the "load_unaligned_zeropad()" is just in case that ends up > overflowing and we have page-alloc-debug triggering a page fault at > the end). All without looping or any conditional branches that might > mispredict. > > My point is that going to assembly results in pretty hard-to-read > code, but it's also fairly inflexible. If we stay with C, we can much > more easily play tricks. So for example, make the above an inline > function, and now you can do things like this: > > static inline unsigned long csum_partial_64bit(void *buf, unsigned > long len, unsigned long sum) > { > if (len <= 8) > return csum_partial_8orless(buf, len, sum); > > /* We know it's larger than 8 bytes, so handle alignment */ > align = 7 & -(unsigned long)buf; > sum = csum_partial_8orless(buf, align, sum); > buf += align; > > /* We need to do the big-endian thing */ > sum = rotate_by8_if_odd(sum, align); > > /* main loop for big packets */ > .. do the unrolled inline asm thing that we already do .. > > sum = rotate_by8_if_odd(sum, align); > > /* Handle the last bytes */ > return csum_partial_8orless(buf, len, sum); > } > > /* Fold the 64-bit sum we computed down to 32 bits __wsum */ > __wsum int csum_partial(void *buf, unsigned int len, __wsum partial) > { > unsigned long sum = csum_partial_64bit(ptr, len, partial); > asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum)); > return sum; > } > > or something like that. > > NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole > "big-endian 16-bit add" thing right by doing the odd byte masking in > the end cases, and by rotating the sum by 8 bits around the > 8-byte-unrolled-loop, but I didn't test the above. It's literally > written in my mail client. So I can almost guarantee that it is buggy, > but it is meant as an *example* of "why not do it this way" rather > than production code. > > I think that writing it in C and trying to be intelligent about it > like the above would result in more maintainable code, and it is > possible that it would even be faster. > > Yes, writing it in C *does* likely result in a few more cases of "adcq > $0" in order to finish up the carry calculations. The *only* advantage > of inline asm is how it allows you to keep the carry flag around. So > there is downside to the C model, and it might cause a cycle or two of > extra work, but the upside of C is that you can try to do clever > things without turning the code completely unreadable. > > For example, doing the exception handling (that will never actually > trigger) for
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds wrote: > > static const unsigned long mask[9] = { > 0x, > 0xff00, > 0x, > 0xff00, > 0x, > 0xff00, > 0x, > 0xff00, > 0x }; > unsigned long val = load_unaligned_zeropad(buf + (len & 1)); > val &= mask[len]; Yeah, that was buggy. I knew it was likely buggy, but that "buf + (len & 1)" was just stupid. The "+" should be "-", of course - the point is to shift up the value by 8 bits for odd cases, and we need to load starting one byte early for that. The idea is that we use the byte shifter in the load unit to do some work for us. And the bitmasks are the wrong way around for the odd cases - it's the low byte that ends up being bogus for those cases. So it should probably look something like static const unsigned long mask[9] = { 0x, 0xff00, 0x, 0xff00, 0x, 0xff00, 0x, 0xff00, 0x }; unsigned long val = load_unaligned_zeropad(buf - (len & 1)); val &= mask[len]; and then it *might* work. Still entirely and utterly untested, I just decided to look at it a bit more and noticed my thinko. Linus
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
I missed the original email (I don't have net-devel in my mailbox), but based on Ingo's quoting have a more fundamental question: Why wasn't that done with C code instead of asm with odd numerical targets? It seems likely that the real issue is avoiding the short loops (that will cause branch prediction problems) and use a lookup table instead. But we can probably do better than that asm. For example, for the actual "8 bytes or shorter" case, I think something like this might just work fine: unsigned long csum_partial_8orless(const unsigned char *buf, unsigned long len, unsigned long sum) { static const unsigned long mask[9] = { 0x, 0xff00, 0x, 0xff00, 0x, 0xff00, 0x, 0xff00, 0x }; unsigned long val = load_unaligned_zeropad(buf + (len & 1)); val &= mask[len]; asm("addq %1,%0 ; adcq $0,%0":"=r" (sum):"r" (val), "0" (sum)); return sum; } NOTE! The above is 100% untested. But I _think_ it may handle the odd-byte-case correctly, and it should result in just one 8-byte load (the "load_unaligned_zeropad()" is just in case that ends up overflowing and we have page-alloc-debug triggering a page fault at the end). All without looping or any conditional branches that might mispredict. My point is that going to assembly results in pretty hard-to-read code, but it's also fairly inflexible. If we stay with C, we can much more easily play tricks. So for example, make the above an inline function, and now you can do things like this: static inline unsigned long csum_partial_64bit(void *buf, unsigned long len, unsigned long sum) { if (len <= 8) return csum_partial_8orless(buf, len, sum); /* We know it's larger than 8 bytes, so handle alignment */ align = 7 & -(unsigned long)buf; sum = csum_partial_8orless(buf, align, sum); buf += align; /* We need to do the big-endian thing */ sum = rotate_by8_if_odd(sum, align); /* main loop for big packets */ .. do the unrolled inline asm thing that we already do .. sum = rotate_by8_if_odd(sum, align); /* Handle the last bytes */ return csum_partial_8orless(buf, len, sum); } /* Fold the 64-bit sum we computed down to 32 bits __wsum */ __wsum int csum_partial(void *buf, unsigned int len, __wsum partial) { unsigned long sum = csum_partial_64bit(ptr, len, partial); asm("addl %1,%0 ; adcl $0,%0":"=r" (sum):"r" (sum >> 32), "0" (sum)); return sum; } or something like that. NOTE NOTE NOTE! I did a half-arsed attempt at getting the whole "big-endian 16-bit add" thing right by doing the odd byte masking in the end cases, and by rotating the sum by 8 bits around the 8-byte-unrolled-loop, but I didn't test the above. It's literally written in my mail client. So I can almost guarantee that it is buggy, but it is meant as an *example* of "why not do it this way" rather than production code. I think that writing it in C and trying to be intelligent about it like the above would result in more maintainable code, and it is possible that it would even be faster. Yes, writing it in C *does* likely result in a few more cases of "adcq $0" in order to finish up the carry calculations. The *only* advantage of inline asm is how it allows you to keep the carry flag around. So there is downside to the C model, and it might cause a cycle or two of extra work, but the upside of C is that you can try to do clever things without turning the code completely unreadable. For example, doing the exception handling (that will never actually trigger) for the "let's just do a 8-byte load" is just painful in assembly. In C, we already have the helper function to do it. Hmm? Would somebody be willing to take the likely very buggy code above, and test it and make it work? I assume there's a test harness for the whole csum thing somewhere. Linus
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar wrote: > > * Ingo Molnar wrote: > >> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> >> > + >> > + /* Check length */ >> > +10:cmpl$8, %esi >> > + jg 30f >> > + jl 20f >> > + >> > + /* Exactly 8 bytes length */ >> > + addl(%rdi), %eax >> > + adcl4(%rdi), %eax >> > + RETURN >> > + >> > + /* Less than 8 bytes length */ >> > +20:clc >> > + jmpq *branch_tbl_len(, %rsi, 8) >> > + >> > + /* Greater than 8 bytes length. Determine number of quads (n). Sum >> > +* over first n % 16 quads >> > +*/ >> > +30:movl%esi, %ecx >> > + shrl$3, %ecx >> > + andl$0xf, %ecx >> > + negq%rcx >> > + lea 40f(, %rcx, 4), %r11 >> > + clc >> > + jmp *%r11 >> >> Are you absolutely sure that a jump table is the proper solution here? It >> defeats branch prediction on most x86 uarchs. Why not label the loop stages >> and >> jump in directly with a binary tree of branches? > > So just to expand on this a bit, attached below is a quick & simple & stupid > testcase that generates a 16 entries call table. (Indirect jumps and indirect > calls are predicted similarly on most x86 uarchs.) Just built it with: > > gcc -Wall -O2 -o jump-table jump-table.c > > Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel > CPU > and also on AMD Opteron. The numbers are from the Intel box.) this gives a > high > branch miss rate with a 16 entries jump table: > > triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16 > ... using 16 jump table entries. > ... using 1 loop iterations. > ... result: 10001 > [...] > > Performance counter stats for './jump-table 16' (10 runs): > >1386.131780 task-clock (msec) #1.001 CPUs utilized > ( +- 0.18% ) > 33 context-switches #0.024 K/sec > ( +- 1.71% ) > 0 cpu-migrations#0.000 K/sec > 52 page-faults #0.037 K/sec > ( +- 0.71% ) > 6,247,215,683 cycles#4.507 GHz > ( +- 0.18% ) > 3,895,337,877 stalled-cycles-frontend # 62.35% frontend cycles > idle ( +- 0.30% ) > 1,404,014,996 instructions #0.22 insns per cycle > #2.77 stalled cycles > per insn ( +- 0.02% ) >300,820,988 branches # 217.022 M/sec > ( +- 0.02% ) > 87,518,741 branch-misses # 29.09% of all branches > ( +- 0.01% ) > >1.385240076 seconds time elapsed >( +- 0.21% ) > > ... as you can see the branch miss rate is very significant, causing a stalled > decoder and very low instruction throughput. > > I have to reduce the jump table to a single entry (!) to get good performance > on > this CPU: > > Performance counter stats for './jump-table 1' (10 runs): > > 739.173505 task-clock (msec) #1.001 CPUs utilized > ( +- 0.26% ) > 37 context-switches #0.051 K/sec > ( +- 16.79% ) > 0 cpu-migrations#0.000 K/sec > 52 page-faults #0.070 K/sec > ( +- 0.41% ) > 3,331,328,405 cycles#4.507 GHz > ( +- 0.26% ) > 2,012,973,596 stalled-cycles-frontend # 60.43% frontend cycles > idle ( +- 0.47% ) > 1,403,880,792 instructions #0.42 insns per cycle > #1.43 stalled cycles > per insn ( +- 0.05% ) >300,817,064 branches # 406.964 M/sec > ( +- 0.05% ) > 12,177 branch-misses #0.00% of all branches > ( +- 12.39% ) > >0.738616356 seconds time elapsed >( +- 0.26% ) > > Note how the runtime got halved: that is because stalls got halved and the > instructions per cycle throughput doubled. > > Even a two entries jump table performs poorly: > > Performance counter stats for './jump-table 2' (10 runs): > >1493.790686 task-clock (msec) #1.001 CPUs utilized > ( +- 0.06% ) > 39 context-switches #0.026 K/sec > ( +- 4.73% ) > 0 cpu-migrations#0.000 K/sec > 52 page-faults #0.035 K/sec > ( +- 0.26% ) > 6,732,372,612 cycles#4.507 GHz > ( +- 0.06% ) > 4,229,130,302 stalled-cycl
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
* Ingo Molnar wrote: > s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > > > + > > + /* Check length */ > > +10:cmpl$8, %esi > > + jg 30f > > + jl 20f > > + > > + /* Exactly 8 bytes length */ > > + addl(%rdi), %eax > > + adcl4(%rdi), %eax > > + RETURN > > + > > + /* Less than 8 bytes length */ > > +20:clc > > + jmpq *branch_tbl_len(, %rsi, 8) > > + > > + /* Greater than 8 bytes length. Determine number of quads (n). Sum > > +* over first n % 16 quads > > +*/ > > +30:movl%esi, %ecx > > + shrl$3, %ecx > > + andl$0xf, %ecx > > + negq%rcx > > + lea 40f(, %rcx, 4), %r11 > > + clc > > + jmp *%r11 > > Are you absolutely sure that a jump table is the proper solution here? It > defeats branch prediction on most x86 uarchs. Why not label the loop stages > and > jump in directly with a binary tree of branches? So just to expand on this a bit, attached below is a quick & simple & stupid testcase that generates a 16 entries call table. (Indirect jumps and indirect calls are predicted similarly on most x86 uarchs.) Just built it with: gcc -Wall -O2 -o jump-table jump-table.c Even on relatively modern x86 uarchs (I ran this on a post Nehalem IVB Intel CPU and also on AMD Opteron. The numbers are from the Intel box.) this gives a high branch miss rate with a 16 entries jump table: triton:~> taskset 1 perf stat --repeat 10 ./jump-table 16 ... using 16 jump table entries. ... using 1 loop iterations. ... result: 10001 [...] Performance counter stats for './jump-table 16' (10 runs): 1386.131780 task-clock (msec) #1.001 CPUs utilized ( +- 0.18% ) 33 context-switches #0.024 K/sec ( +- 1.71% ) 0 cpu-migrations#0.000 K/sec 52 page-faults #0.037 K/sec ( +- 0.71% ) 6,247,215,683 cycles#4.507 GHz ( +- 0.18% ) 3,895,337,877 stalled-cycles-frontend # 62.35% frontend cycles idle ( +- 0.30% ) 1,404,014,996 instructions #0.22 insns per cycle #2.77 stalled cycles per insn ( +- 0.02% ) 300,820,988 branches # 217.022 M/sec ( +- 0.02% ) 87,518,741 branch-misses # 29.09% of all branches ( +- 0.01% ) 1.385240076 seconds time elapsed ( +- 0.21% ) ... as you can see the branch miss rate is very significant, causing a stalled decoder and very low instruction throughput. I have to reduce the jump table to a single entry (!) to get good performance on this CPU: Performance counter stats for './jump-table 1' (10 runs): 739.173505 task-clock (msec) #1.001 CPUs utilized ( +- 0.26% ) 37 context-switches #0.051 K/sec ( +- 16.79% ) 0 cpu-migrations#0.000 K/sec 52 page-faults #0.070 K/sec ( +- 0.41% ) 3,331,328,405 cycles#4.507 GHz ( +- 0.26% ) 2,012,973,596 stalled-cycles-frontend # 60.43% frontend cycles idle ( +- 0.47% ) 1,403,880,792 instructions #0.42 insns per cycle #1.43 stalled cycles per insn ( +- 0.05% ) 300,817,064 branches # 406.964 M/sec ( +- 0.05% ) 12,177 branch-misses #0.00% of all branches ( +- 12.39% ) 0.738616356 seconds time elapsed ( +- 0.26% ) Note how the runtime got halved: that is because stalls got halved and the instructions per cycle throughput doubled. Even a two entries jump table performs poorly: Performance counter stats for './jump-table 2' (10 runs): 1493.790686 task-clock (msec) #1.001 CPUs utilized ( +- 0.06% ) 39 context-switches #0.026 K/sec ( +- 4.73% ) 0 cpu-migrations#0.000 K/sec 52 page-faults #0.035 K/sec ( +- 0.26% ) 6,732,372,612 cycles#4.507 GHz ( +- 0.06% ) 4,229,130,302 stalled-cycles-frontend # 62.82% frontend cycles idle ( +- 0.09% ) 1,407,803,145 instructions #0.21 insns per cycle
Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64
* Tom Herbert wrote: > Implement assembly routine for csum_partial for 64 bit x86. This > primarily speeds up checksum calculation for smaller lengths such as > those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY > conversion. > > CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is checked to determine whether > we need to avoid unaligned accesses. Efficient unaligned accesses > offer a nice additional speedup as demonstrated in the results > provided below. > > This implementation is similar to csum_partial implemented in > checksum_32.S, however since we are dealing with 8 bytes at a time > there are more cases for small lengths (and alignments)-- for that > we employ a jump table. > > Testing: > > Correctness: > > Verified correctness by testing arbitrary length buffer filled with > random data. For each buffer I compared the computed checksum > using the original algorithm for each possible alignment (0-7 bytes). > > Checksum performance: > > Isolating old and new implementation for some common cases: > > Old NewA NewA % NewNoA NewNoA % > Len/Aln nsec nsec Improv nsecsImprove > +---++---+---+- > 1400/0192.9175.1 10% 174.9 10% (Big packet) > 40/0 13.8 7.7 44% 5.7 58% (Ipv6 hdr cmn case) > 8/4 8.4 6.9 18% 2.8 67% (UDP, VXLAN in IPv4) > 14/0 10.5 7.3 30% 5.4 48% (Eth hdr) > 14/4 10.8 8.7 19% 5.4 50% (Eth hdr in IPv4) > 14/3 11.0 9.8 11% 5.6 49% (Eth with odd align) > 7/1 10.0 5.8 42% 4.8 52% (buffer in one quad) > > NewA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS not set > NewNoA=>CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set > > Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz > > Also test on these with similar results: > > Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz > Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz > > Branch prediction: > > To test the effects of poor branch prediction in the jump tables I > tested checksum performance with runs for two combinations of length > and alignment. As the baseline I performed the test by doing half of > calls with the first combination, followed by using the second > combination for the second half. In the test case, I interleave the > two combinations so that in every call the length and alignment are > different to defeat the effects of branch prediction. Running several > cases, I did not see any material performance difference between > the baseline and the interleaving test case. Possibly because the jumps are missing branch prediction for all these variants... Please run something like: triton:~/tip> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' ~/hackbench 10 ... Performance counter stats for '/home/mingo/hackbench 10' (10 runs): 1,701,689,802 instructions ( +- 0.83% ) 305,439,262 branches ( +- 0.92% ) 2,011,032 branch-misses#0.66% of all branches ( +- 3.00% ) 0.074311070 seconds time elapsed ( +- 2.42% ) ... to see the quality of x86 branch prediction and to see the statistical quality of the measurement (stddev). The real comparison would be jump table against a hierarchy of open coded branches. The latter is much more likely to be correctly branch-predicted. I have a couple of (mostly stylistic) comments about the patch: > Signed-off-by: Tom Herbert > --- > arch/x86/include/asm/checksum_64.h | 5 + > arch/x86/lib/csum-partial_64.S | 277 > + > arch/x86/lib/csum-partial_64.c | 148 > 3 files changed, 282 insertions(+), 148 deletions(-) > create mode 100644 arch/x86/lib/csum-partial_64.S > delete mode 100644 arch/x86/lib/csum-partial_64.c > > diff --git a/arch/x86/include/asm/checksum_64.h > b/arch/x86/include/asm/checksum_64.h > index cd00e17..a888f65 100644 > --- a/arch/x86/include/asm/checksum_64.h > +++ b/arch/x86/include/asm/checksum_64.h > @@ -128,6 +128,11 @@ static inline __sum16 csum_tcpudp_magic(__be32 saddr, > __be32 daddr, > */ > extern __wsum csum_partial(const void *buff, int len, __wsum sum); > > +static inline __sum16 ip_compute_csum(const void *buff, int len) > +{ > + return csum_fold(csum_partial(buff, len, 0)); > +} > + > #define _HAVE_ARCH_COPY_AND_CSUM_FROM_USER 1 > #define HAVE_CSUM_COPY_USER 1 > > diff --git a/arch/x86/lib/csum-partial_64.S b/arch/x86/lib/csum-partial_64.S > new file mode 100644 > index 000..520b400 > --- /dev/null > +++ b/arch/x86/lib/csum-partial_64.S > @@ -0,0 +1,277 @@ > +/* Copyright 2016 Tom Herbert > + * > + * Checksum partial cal