David Miller <da...@davemloft.net> writes: These give a modest speedup compared to the T1 routines. I also added missing T3 timings to existing code. The first thing to try then is finding code that runs well on both. There is a cost in having more variants than we need.
Also, I worked on a copyi/copyd for T3/T4 that uses cache-initializing stores (basically, if you're going to write a full aligned 64-byte cache line, you tell the chip by using a special ASI in the stores, and the cpu will simply clear the entire cache line on write to the first word of the cache line, eliminating all the memory traffic). That will be good, but we need to watch for a few things: * We must make sure not to use it near the beginning or end of operands. * What if the code runs on a machine with 128-byte lines, then will it risk to clobber the area just outside our operands? Or is the ASI defined to mean "64 byte cache line"? * We need to watch out for overlapping copies, mpn_copyi(p,p+off,n) where off might be any number >= 0. However, the setup for this has a bit of overhead as we have to align the destination to 64-bytes and I'm not so sure that it's a win overall in common usage. Solution: :-) if (n < lim) simple code else sophisticated code The T3 popount and hamdist timing numbers are awful. Is the C code perhaps faster? ABout [lr]shift{c,}: I think we should put generic code at the top-level dir, and make it work OK for T1 too. That code should not rely on out-of-order execution. lshiftc: Why both 'xnor' and 'andn'. Is there no 'nor' insn? Then I'd suggest xnor (or some pseudo insn for 'not') and 'or' for blending. Have you tried software pipelined loops, 2-way or 4-way unrolled, instead of your 2-way non-pipelined loops? Say we want to write a 2-way pipelined lshift. I'd start by placing the ldx insns L(top): ldx [up+0], %l0 ldx [up+8], %l1 brgz n, L(top) add n, -2, n Next, place the insn depending on ldx as far as possible from the ldx. L(top): srlx %l0, tcnt, %l4 sllx %l0, cnt, %l2 ldx [up+0], %l0 srlx %l1, tcnt, %l5 sllx %l1, cnt, %l3 ldx [up+8], %l1 add up, 16, up brgz n, L(top) add n, -2, n Placing dependent insns as far as possible from their corresponding producers is not always a good idea, since that may create a very deep sw pipelined. For loads it may be the best appoach, though, unless we assume L1 hit. Next, we place the blending insn, we may use 'or' or 'add', whichever is generally best for the hardware family. I expect sllx and srlx to run in one cycle. Let's therefore use limited pipelining for these. L(top): sllx %l0, cnt, %l2 or %l4, %l3, %g1 srlx %l0, tcnt, %l4 ldx [up+0], %l0 sllx %l1, cnt, %l3 or %l5, %l2, %g1 srlx %l1, tcnt, %l5 ldx [up+8], %l1 add up, 16, up brgz n, L(top) add n, -2, n OK, now only stx remain to be placed. Stores typically interfere with loads, while hardware store buffers might make them agnostic to RAW scheduling. Some experimentations, using tune/speed playing with the options -w[N] ad -x[M] to watch for alignment induced fluctuations, will help. But we should probably wait a bit for final store placement, and make the loop work first. L(top): sllx %l0, cnt, %l2 or %l4, %l3, %l6 srlx %l0, tcnt, %l4 ldx [up+0], %l0 stx %l6, [rp+0] sllx %l1, cnt, %l3 or %l5, %l2, %l7 srlx %l1, tcnt, %l5 ldx [up+8], %l1 stx %l7, [rp+8] add up, 16, up add rp, 16, rp brgz n, L(top) add n, -2, n This should be a semi-working loop. We need to fix feed-in and wind-down, and adjust load and store offsets. Adding wind-down code is mechanic; just copy the full loop after, then start by removing the loads. ... stx %l7, [rp+8] add up, 16, up add rp, 16, rp brgz n, L(top) add n, -2, n C wind down phase 1 sllx %l0, cnt, %l2 or %l4, %l3, %l6 srlx %l0, tcnt, %l4 stx %l6, [rp+0] sllx %l1, cnt, %l3 or %l5, %l2, %l7 srlx %l1, tcnt, %l5 stx %l7, [rp+8] We still have juggling clubs in the air, so we need to add a phase 2 wind-down block. And then mechanically delete instructions which have no producer. ... stx %l7, [rp+8] add up, 16, up add rp, 16, rp brgz n, L(top) add n, -2, n C wind down phase 1 sllx %l0, cnt, %l2 or %l4, %l3, %l6 srlx %l0, tcnt, %l4 stx %l6, [rp+0] sllx %l1, cnt, %l3 or %l5, %l2, %l7 srlx %l1, tcnt, %l5 stx %l7, [rp+8] C wind down phase 2 or %l4, %l3, %l6 stx %l6, [rp+0] or %l5, %l2, %l7 stx %l7, [rp+8] This should finish the wind down. The feed-in work is similar, except that we need (ways) variants, depending on (n mod ways). But since our loop is symmetric, we will be able to enter either at its top, or in the middle, depending on (in our case) n mod 2. I always first make the code work for just one n residue class. Let's do that! Analogously with wind-down, we copy the loop, but now we start with removing the final sw pipeline insn, i.e., the stores. We may fall into the loop, or jump into its loop control. For now, we do the latter. sllx %l0, cnt, %l2 or %l4, %l3, %l6 srlx %l0, tcnt, %l4 ldx [up+0], %l0 sllx %l1, cnt, %l3 or %l5, %l2, %l7 srlx %l1, tcnt, %l5 ldx [up+8], %l1 L(top): ... Now, we should note that removing the stores caused dead writes by the 'or' insns. And when we remove them, the first sllx performs a dead write. Removing also that insn, we get: srlx %l0, tcnt, %l4 ldx [up+0], %l0 sllx %l1, cnt, %l3 srlx %l1, tcnt, %l5 ldx [up+8], %l1 L(top): ... We're almost done. We just need to add another feed-in phase, by copying the existing feed-in block, and then removing insn from a data dependency analysis. C feed-in phase 1 ldx [up+0], %l0 ldx [up+8], %l1 C feed-in phase 2 srlx %l0, tcnt, %l4 ldx [up+0], %l0 sllx %l1, cnt, %l3 srlx %l1, tcnt, %l5 ldx [up+8], %l1 L(top): ... I'd claim we now have working code structure (unless I made some mistake!) for n mod 2 = 0. We only have to: 1. Adjust ldx and stx offsets. 2. Insert n updates in the feed-in blocks. 3. Add n-dependent conditional jumps after each feed-in block. 4. Compute the return value. Finally, we need to check n mod 2 and make a separate feed-in path, where we jump into the middle of the loop. We can really jump into the middle, since our loop is quite symmetric. The resulting execution paths will be smooth from function entry to function exit. We might take a more labourous approach in order to work out the pipeline issue preferences. Then we start with all required instructions, where each insn writes a unique register, and all insns read some fixed dummy register (say %g7). Assuming we have a CPU with many registers, we now will have a dependency-free loop (well, each insn will have a WAW conflict with itself...). We need to make stores semi-correct so that they don't hammer in the same word. I.e., we need to make them write to consecutive words, and properly update the pointer register. We now proceed with timing experiments (using tune/speed). The goal is to find an insn sequence which suits the pipeline. If we don't reach the target speed, we may unroll more, or find different instructions. Once we reach the target speed, we proceed with connecting instructions, while timing for each edit. If performance drops, we have a latency induced slowdown, which can always be resolved with deeper pipelining. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel