subject:"\[PATCH v3 net\-next\] net\: Implement fast csum_partial for x86

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread David Laight

From: George Spelvin > Sent: 10 February 2016 14:44 ... > > I think the fastest loop is: > > 10: adcq0(%rdi,%rcx,8),%rax > > inc %rcx > > jnz 10b > > That loop looks like it will have no overhead on recent cpu. > > Well, it should execute at 1 instruction/cycle. I presume you

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread George Spelvin

David Laight wrote: > Separate renaming allows: > 1) The value to tested without waiting for pending updates to complete. >Useful for IE and DIR. I don't quite follow. It allows the value to be tested without waiting for pending updates *of other bits* to complete. Obviusly, the update of th

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-10 Thread David Laight

From: George Spelvin > Sent: 10 February 2016 00:54 > To: David Laight; linux-kernel@vger.kernel.org; li...@horizon.com; > net...@vger.kernel.org; > David Laight wrote: > > Since adcx and adox must execute in parallel I clearly need to re-remember > > how dependencies against the flags register wo

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-09 Thread George Spelvin

David Laight wrote: > Since adcx and adox must execute in parallel I clearly need to re-remember > how dependencies against the flags register work. I'm sure I remember > issues with 'false dependencies' against the flags. The issue is with flags register bits that are *not* modified by an instruc

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-09 Thread David Laight

From: George Spelvin [mailto:li...@horizon.com] > Sent: 08 February 2016 20:13 > David Laight wrote: > > I'd need convincing that unrolling the loop like that gives any significant > > gain. > > You have a dependency chain on the carry flag so have delays between the > > 'adcq' > > instructions (

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-08 Thread George Spelvin

David Laight wrote: > I'd need convincing that unrolling the loop like that gives any significant > gain. > You have a dependency chain on the carry flag so have delays between the > 'adcq' > instructions (these may be more significant than the memory reads from l1 > cache). If the carry chain

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread David Laight

From: Ingo Molnar ... > As Linus noticed, data lookup tables are the intelligent solution: if you > manage > to offload the logic into arithmetics and not affect the control flow then > that's > a big win. The inherent branching will be hidden by executing on massively > parallel arithmetics unit

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread Ingo Molnar

* Tom Herbert wrote: > Thanks for the explanation and sample code. Expanding on your example, I > added a > switch statement to perform to function (code below). So I think your new switch() based testcase is broken in a subtle way. The problem is that in your added testcase GCC effectively o

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-05 Thread Ingo Molnar

* Tom Herbert wrote: > [] gcc turns these switch statements into jump tables (not function > tables > which is what Ingo's example code was using). [...] So to the extent this still matters, on most x86 microarchitectures that count, jump tables and function call tables (i.e. virtual fun

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds

On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds wrote: > sum = csum_partial_lt8(*(unsigned long *)buff, len, sum); > return rotate_by8_if_odd(sum, align); Actually, that last word-sized access to "buff" might be past the end of the buffer. The code does the right thing if "len" is

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds

On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds wrote: > > The "+" should be "-", of course - the point is to shift up the value > by 8 bits for odd cases, and we need to load starting one byte early > for that. The idea is that we use the byte shifter in the load unit to > do some work for us. Ok

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds

On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert wrote: > > The reason I did this in assembly is precisely about the your point of > having to close the carry chains with adcq $0. I do have a first > implementation in C which using switch() to handle alignment, excess > length less than 8 bytes, and th

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert

On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds wrote: > I missed the original email (I don't have net-devel in my mailbox), > but based on Ingo's quoting have a more fundamental question: > > Why wasn't that done with C code instead of asm with odd numerical targets? > The reason I did this in ass

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds

On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds wrote: > > static const unsigned long mask[9] = { > 0x, > 0xff00, > 0x, > 0xff00, > 0x, >

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Linus Torvalds

I missed the original email (I don't have net-devel in my mailbox), but based on Ingo's quoting have a more fundamental question: Why wasn't that done with C code instead of asm with odd numerical targets? It seems likely that the real issue is avoiding the short loops (that will cause branch pre

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Tom Herbert

On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar wrote: > > * Ingo Molnar wrote: > >> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> >> > + >> > + /* Check length */ >> > +10:cmpl$8, %esi >> > + jg 30f >> > + jl 20f >> > + >> > + /* Exactly 8 bytes length */ >> > + addl

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Ingo Molnar

* Ingo Molnar wrote: > s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > > > + > > + /* Check length */ > > +10:cmpl$8, %esi > > + jg 30f > > + jl 20f > > + > > + /* Exactly 8 bytes length */ > > + addl(%rdi), %eax > > + adcl4(%rdi), %eax > > + RETURN > > +

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

2016-02-04 Thread Ingo Molnar

* Tom Herbert wrote: > Implement assembly routine for csum_partial for 64 bit x86. This > primarily speeds up checksum calculation for smaller lengths such as > those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY > conve

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

RE: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

Re: [PATCH v3 net-next] net: Implement fast csum_partial for x86_64

18 matches

Site Navigation

Mail list logo

Footer information