Re: [PATCH] T3/T4 sparc shifts, plus more timings

Torbjorn Granlund Tue, 26 Mar 2013 06:28:39 -0700

David Miller <da...@davemloft.net> writes:

  These give a modest speedup compared to the T1 routines.
  I also added missing T3 timings to existing code.
  
The first thing to try then is finding code that runs well on both.
There is a cost in having more variants than we need.


  Also, I worked on a copyi/copyd for T3/T4 that uses cache-initializing
  stores (basically, if you're going to write a full aligned 64-byte
  cache line, you tell the chip by using a special ASI in the stores,
  and the cpu will simply clear the entire cache line on write to the
  first word of the cache line, eliminating all the memory traffic).

That will be good, but we need to watch for a few things:

* We must make sure not to use it near the beginning or end of operands.

* What if the code runs on a machine with 128-byte lines, then will it
  risk to clobber the area just outside our operands?  Or is the ASI
  defined to mean "64 byte cache line"?

* We need to watch out for overlapping copies, mpn_copyi(p,p+off,n)
  where off might be any number >= 0.

However, the setup for this has a bit of overhead as we have to align
the destination to 64-bytes and I'm not so sure that it's a win
overall in common usage.

Solution:  :-)

if (n < lim)
  simple code
else
  sophisticated code
  
The T3 popount and hamdist timing numbers are awful.
Is the C code perhaps faster?

ABout [lr]shift{c,}:

I think we should put generic code at the top-level dir, and make it
work OK for T1 too.  That code should not rely on out-of-order
execution.

lshiftc: Why both 'xnor' and 'andn'.  Is there no 'nor' insn?  Then
I'd suggest xnor (or some pseudo insn for 'not') and 'or' for
blending.

Have you tried software pipelined loops, 2-way or 4-way unrolled,
instead of your 2-way non-pipelined loops?

Say we want to write a 2-way pipelined lshift.  I'd start by placing the ldx 
insns

L(top):
        ldx     [up+0], %l0

        ldx     [up+8], %l1

        brgz    n, L(top)
         add    n, -2, n

Next, place the insn depending on ldx as far as possible from the ldx.

L(top):
        srlx    %l0, tcnt, %l4
        sllx    %l0, cnt, %l2
        ldx     [up+0], %l0

        srlx    %l1, tcnt, %l5
        sllx    %l1, cnt, %l3
        ldx     [up+8], %l1

        add     up, 16, up
        brgz    n, L(top)
         add    n, -2, n

Placing dependent insns as far as possible from their corresponding
producers is not always a good idea, since that may create a very deep
sw pipelined.  For loads it may be the best appoach, though, unless we
assume L1 hit.

Next, we place the blending insn, we may use 'or' or 'add', whichever
is generally best for the hardware family.  I expect sllx and srlx to
run in one cycle.  Let's therefore use limited pipelining for these.

L(top):
        sllx    %l0, cnt, %l2
        or      %l4, %l3, %g1
        srlx    %l0, tcnt, %l4
        ldx     [up+0], %l0

        sllx    %l1, cnt, %l3
        or      %l5, %l2, %g1
        srlx    %l1, tcnt, %l5
        ldx     [up+8], %l1

        add     up, 16, up
        brgz    n, L(top)
         add    n, -2, n

OK, now only stx remain to be placed.  Stores typically interfere with
loads, while hardware store buffers might make them agnostic to RAW
scheduling.  Some experimentations, using tune/speed playing with the
options -w[N] ad -x[M] to watch for alignment induced fluctuations,
will help.  But we should probably wait a bit for final store
placement, and make the loop work first.

L(top):
        sllx    %l0, cnt, %l2
        or      %l4, %l3, %l6
        srlx    %l0, tcnt, %l4
        ldx     [up+0], %l0
        stx     %l6, [rp+0]

        sllx    %l1, cnt, %l3
        or      %l5, %l2, %l7
        srlx    %l1, tcnt, %l5
        ldx     [up+8], %l1
        stx     %l7, [rp+8]

        add     up, 16, up
        add     rp, 16, rp
        brgz    n, L(top)
         add    n, -2, n

This should be a semi-working loop.  We need to fix feed-in and
wind-down, and adjust load and store offsets.  Adding wind-down code
is mechanic; just copy the full loop after, then start by removing the
loads.

        ...
        stx     %l7, [rp+8]

        add     up, 16, up
        add     rp, 16, rp
        brgz    n, L(top)
         add    n, -2, n
C wind down phase 1
        sllx    %l0, cnt, %l2
        or      %l4, %l3, %l6
        srlx    %l0, tcnt, %l4
        stx     %l6, [rp+0]

        sllx    %l1, cnt, %l3
        or      %l5, %l2, %l7
        srlx    %l1, tcnt, %l5
        stx     %l7, [rp+8]

We still have juggling clubs in the air, so we need to add a phase 2
wind-down block.  And then mechanically delete instructions which have
no producer.

        ...
        stx     %l7, [rp+8]

        add     up, 16, up
        add     rp, 16, rp
        brgz    n, L(top)
         add    n, -2, n
C wind down phase 1
        sllx    %l0, cnt, %l2
        or      %l4, %l3, %l6
        srlx    %l0, tcnt, %l4
        stx     %l6, [rp+0]

        sllx    %l1, cnt, %l3
        or      %l5, %l2, %l7
        srlx    %l1, tcnt, %l5
        stx     %l7, [rp+8]

C wind down phase 2
        or      %l4, %l3, %l6
        stx     %l6, [rp+0]

        or      %l5, %l2, %l7
        stx     %l7, [rp+8]

This should finish the wind down.  The feed-in work is similar, except
that we need (ways) variants, depending on (n mod ways).  But since
our loop is symmetric, we will be able to enter either at its top, or
in the middle, depending on (in our case) n mod 2.

I always first make the code work for just one n residue class.  Let's
do that!  Analogously with wind-down, we copy the loop, but now we
start with removing the final sw pipeline insn, i.e., the stores.  We
may fall into the loop, or jump into its loop control.  For now, we do
the latter.

        sllx    %l0, cnt, %l2
        or      %l4, %l3, %l6
        srlx    %l0, tcnt, %l4
        ldx     [up+0], %l0

        sllx    %l1, cnt, %l3
        or      %l5, %l2, %l7
        srlx    %l1, tcnt, %l5
        ldx     [up+8], %l1

L(top):
        ...

Now, we should note that removing the stores caused dead writes by the
'or' insns.  And when we remove them, the first sllx performs a dead
write.  Removing also that insn, we get:

        srlx    %l0, tcnt, %l4
        ldx     [up+0], %l0

        sllx    %l1, cnt, %l3
        srlx    %l1, tcnt, %l5
        ldx     [up+8], %l1

L(top):
        ...

We're almost done.  We just need to add another feed-in phase, by
copying the existing feed-in block, and then removing insn from a data
dependency analysis.

C feed-in phase 1
        ldx     [up+0], %l0
        ldx     [up+8], %l1
C feed-in phase 2
        srlx    %l0, tcnt, %l4
        ldx     [up+0], %l0

        sllx    %l1, cnt, %l3
        srlx    %l1, tcnt, %l5
        ldx     [up+8], %l1

L(top):
        ...

I'd claim we now have working code structure (unless I made some
mistake!) for n mod 2 = 0.  We only have to:

1. Adjust ldx and stx offsets.
2. Insert n updates in the feed-in blocks.
3. Add n-dependent conditional jumps after each feed-in block.
4. Compute the return value.

Finally, we need to check n mod 2 and make a separate feed-in path,
where we jump into the middle of the loop.  We can really jump into
the middle, since our loop is quite symmetric.  The resulting
execution paths will be smooth from function entry to function exit.


We might take a more labourous approach in order to work out the
pipeline issue preferences.  Then we start with all required
instructions, where each insn writes a unique register, and all insns
read some fixed dummy register (say %g7).  Assuming we have a CPU with
many registers, we now will have a dependency-free loop (well, each
insn will have a WAW conflict with itself...).

We need to make stores semi-correct so that they don't hammer in the
same word.  I.e., we need to make them write to consecutive words, and
properly update the pointer register.

We now proceed with timing experiments (using tune/speed).  The goal
is to find an insn sequence which suits the pipeline.  If we don't
reach the target speed, we may unroll more, or find different
instructions.  Once we reach the target speed, we proceed with
connecting instructions, while timing for each edit.  If performance
drops, we have a latency induced slowdown, which can always be
resolved with deeper pipelining.

-- 
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Re: [PATCH] T3/T4 sparc shifts, plus more timings

Reply via email to