From: George Spelvin
> Sent: 10 February 2016 14:44
...
> > I think the fastest loop is:
> > 10: adcq0(%rdi,%rcx,8),%rax
> > inc %rcx
> > jnz 10b
> > That loop looks like it will have no overhead on recent cpu.
>
> Well, it should execute at 1 instruction/cycle.
I presume you
David Laight wrote:
> Separate renaming allows:
> 1) The value to tested without waiting for pending updates to complete.
>Useful for IE and DIR.
I don't quite follow. It allows the value to be tested without waiting
for pending updates *of other bits* to complete.
Obviusly, the update of th
From: George Spelvin
> Sent: 10 February 2016 00:54
> To: David Laight; linux-kernel@vger.kernel.org; li...@horizon.com;
> net...@vger.kernel.org;
> David Laight wrote:
> > Since adcx and adox must execute in parallel I clearly need to re-remember
> > how dependencies against the flags register wo
David Laight wrote:
> Since adcx and adox must execute in parallel I clearly need to re-remember
> how dependencies against the flags register work. I'm sure I remember
> issues with 'false dependencies' against the flags.
The issue is with flags register bits that are *not* modified by
an instruc
From: George Spelvin [mailto:li...@horizon.com]
> Sent: 08 February 2016 20:13
> David Laight wrote:
> > I'd need convincing that unrolling the loop like that gives any significant
> > gain.
> > You have a dependency chain on the carry flag so have delays between the
> > 'adcq'
> > instructions (
David Laight wrote:
> I'd need convincing that unrolling the loop like that gives any significant
> gain.
> You have a dependency chain on the carry flag so have delays between the
> 'adcq'
> instructions (these may be more significant than the memory reads from l1
> cache).
If the carry chain
From: Ingo Molnar
...
> As Linus noticed, data lookup tables are the intelligent solution: if you
> manage
> to offload the logic into arithmetics and not affect the control flow then
> that's
> a big win. The inherent branching will be hidden by executing on massively
> parallel arithmetics unit
* Tom Herbert wrote:
> Thanks for the explanation and sample code. Expanding on your example, I
> added a
> switch statement to perform to function (code below).
So I think your new switch() based testcase is broken in a subtle way.
The problem is that in your added testcase GCC effectively o
* Tom Herbert wrote:
> [] gcc turns these switch statements into jump tables (not function
> tables
> which is what Ingo's example code was using). [...]
So to the extent this still matters, on most x86 microarchitectures that count,
jump tables and function call tables (i.e. virtual fun
On Thu, Feb 4, 2016 at 5:27 PM, Linus Torvalds
wrote:
> sum = csum_partial_lt8(*(unsigned long *)buff, len, sum);
> return rotate_by8_if_odd(sum, align);
Actually, that last word-sized access to "buff" might be past the end
of the buffer. The code does the right thing if "len" is
On Thu, Feb 4, 2016 at 2:09 PM, Linus Torvalds
wrote:
>
> The "+" should be "-", of course - the point is to shift up the value
> by 8 bits for odd cases, and we need to load starting one byte early
> for that. The idea is that we use the byte shifter in the load unit to
> do some work for us.
Ok
On Thu, Feb 4, 2016 at 2:43 PM, Tom Herbert wrote:
>
> The reason I did this in assembly is precisely about the your point of
> having to close the carry chains with adcq $0. I do have a first
> implementation in C which using switch() to handle alignment, excess
> length less than 8 bytes, and th
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
wrote:
> I missed the original email (I don't have net-devel in my mailbox),
> but based on Ingo's quoting have a more fundamental question:
>
> Why wasn't that done with C code instead of asm with odd numerical targets?
>
The reason I did this in ass
On Thu, Feb 4, 2016 at 1:46 PM, Linus Torvalds
wrote:
>
> static const unsigned long mask[9] = {
> 0x,
> 0xff00,
> 0x,
> 0xff00,
> 0x,
>
I missed the original email (I don't have net-devel in my mailbox),
but based on Ingo's quoting have a more fundamental question:
Why wasn't that done with C code instead of asm with odd numerical targets?
It seems likely that the real issue is avoiding the short loops (that
will cause branch pre
On Thu, Feb 4, 2016 at 2:56 AM, Ingo Molnar wrote:
>
> * Ingo Molnar wrote:
>
>> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>>
>> > +
>> > + /* Check length */
>> > +10:cmpl$8, %esi
>> > + jg 30f
>> > + jl 20f
>> > +
>> > + /* Exactly 8 bytes length */
>> > + addl
* Ingo Molnar wrote:
> s/!CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>
> > +
> > + /* Check length */
> > +10:cmpl$8, %esi
> > + jg 30f
> > + jl 20f
> > +
> > + /* Exactly 8 bytes length */
> > + addl(%rdi), %eax
> > + adcl4(%rdi), %eax
> > + RETURN
> > +
* Tom Herbert wrote:
> Implement assembly routine for csum_partial for 64 bit x86. This
> primarily speeds up checksum calculation for smaller lengths such as
> those that are present when doing skb_postpull_rcsum when getting
> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY
> conve
18 matches
Mail list logo