These give a modest speedup compared to the T1 routines.
I also added missing T3 timings to existing code.
Also, I worked on a copyi/copyd for T3/T4 that uses cache-initializing
stores (basically, if you're going to write a full aligned 64-byte
cache line, you tell the chip by using a special ASI
David Miller writes:
Technically we could use this on some chips we don't distinguish on
a fine enough granularity yet. For example we can assume popc is
available on T2 as well as UltraSPARC-IV.
But for now, just T3 and later.
I suppose we should mention this as a comment in the c
From: Torbjorn Granlund
Date: Mon, 25 Mar 2013 19:45:27 +0100
> I cannot recall which edits I made between these variants. If only the
> checked-in code has fluctuations, then it should be no problem finding
> an edit which avoids them. If both variants have fluctuations, then it
> will be hard
> If you want to play with this, please start with the checked in code
> (you'll need to fresh configure.ac to allow the aormul_2 'multifunc'
> name). The first thing to try is its speed compared to the code you
> timed above.
I'm getting wildly different performance characteristics f