> > - 128 FPU registers (126 usable)
> >  96 of them are rotating (not stacked) which I imagine could
> > be used to the
> > code's advantage quite well, holding more data in registers
> > during the FFT
>
> Eh, it would only really help if you wanted to unroll quite a few
> loops... I
> think that as can been seen from the RISC processors out there, it really
> doesn't help a _whole_ lot as far as the FFT goes. I suppose you
> could move
> to a radix-8, but that's about the extent of it. Would going past radix-8
> help a whole lot?

Like I said, I'm not sure how the FFT code really works, so I couldn't say
how much this would help.

> > - Memory "speculation"
> >  Preload code and/or data...while the FPU is churning away,
> > preload more
> > data into L2/L1 cache so it's in the high-speed memory by the
> > time it's
> > needed (data prefetch/lfetch).  That will REALLY help on
> > these large FFT
> > datasets!
>
> Is this limited to MMX/SIMD data only? The 3DNow (and KNI?)
> instruction sets
> have prefetch for their SIMD opcodes, but those of course are single
> precision and really kinda useless. :P

The IA-64 lets you preload code/data for ANYTHING, or at least for the FPU,
not just KNI/SIMD stuff.  It's with the "lfetch" (not sure if that's the
"official" opcode) command.  The instruction for loading 2 FPU registers at
once (as long as the data is next to each other in memory) would also be
pretty handy for mem->register loads, even if you didn't "prefetch" it.

> > - 64 bit integer ops
> >  Integer unit with 64 bits...need I say more?
>
> Doesn't really help a whole lot. Honest. :) Mainly cuz integer and single
> precision operands are hardly ever used.

Well, I was thinking for the pre-LL trial factoring.  I wonder though if it
wouldn't be helpful at all in doing an integer FFT?  Probably not, if the
FPU is all it's cracked up to be.

> > - Bunch of fun parallel arithmetic instructions
> >  Probably useful for large numbers...
>
> Whatever that means...

Adding numbers to multiple registers for instance, or doing multiple
multiplies.  Maybe that sort of stuff could come in handy for the modulo??
Just guessing.

> The EPIC hints are probably the biggest benefit. Its hard to convince the
> CPU to take the right branches and O-O-O execution can really mess up the
> pipeline. (Its hard to tell what the heck the P6 is doing dangit!)

For NT, if you really want to know what the P6 is doing, you can get the
www.sysinternals.com program to view Pentium performance counters of various
sorts (CPUMon 2.0).  Not sure if that's what you meant.

At the very least, pipelining the instructions in such a way that, for
example, while the FPU is chugging away on something, you could prefetch
data into L1 cache for the next bit, do a couple other tidy little jobs,
etc.  Again, since the LL test is pretty serialized, it's more a matter of
keeping it as efficient as possible to eliminate any lags, so the FPU will
*always* have something to do right away.  Beyond that, finding more
efficient ways of using the FPU would be nice.  Keeping more of the data in
the registers will help since you can eliminate some rounding errors by
doing that.

> I'm a bit curious about the K7. Just from the minimal specs I've
> looked at.
> Might be able to squeek a few % more out it than a similarly clocked PIII.
> What has me wondering is the 3DNow instructions they added for DSP
> instructions. I'm sure their single precision. But it seems kinda
> wacky they
> added 'em in the first place.

>From what I've seen, the K7 supposedly does even as much as 40% faster in
FPU benchmarks than a similarly clocked PIII.  Interesting, if true, but
again, those probably are single-precision numbers...

> Now, if Intel decided to put some extra silicon and support
> double precision
> FP ops in the SIMD instruction set (The registers support it, the silicon
> doesn't). Then you'd be able to get double the thrughput in the FFT code,
> plus I think the latency goes down (from 2 cycles to 1?) For multiplies.

The registers can hold it (since it uses the same FPU registers IIRC), but
the microcode would have to be significantly tweaked to handle the
double/extended double data.  Since they are "multimedia" instructions, it
makes sense that they only made them single precision capable, but it still
would have been nice...  Of course, you could always right the FFT to only
use single-precision! :-)

> Has me thinking a bit about a NTT algorithm for doing the FFTs
> with integers
> instead of doubles and using MMX instructions to speed it up...

Thus my idea about using the 128 64 bit integer registers (and the cool
integer math ops) of the IA-64.

> But then again, I'm working on a totally different algorithm right now
> anyway that _should_ be fast. But then again, I'm probably forgetting
> something, so until I work out some of the details on paper, I'll
> leave that
> one in hiding. ;)

You can tell me, I'm your brother.

By the way all, Jeremy probably wouldn't have said so, but he's getting
married this June 26th.  Gifts can be made payable to me and I'll make sure
he gets 'em! :-)

Aaron

________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm

Reply via email to