> But, you can do it in integer if you have a processor with 1)
> enough integer
> registers 2) wide registers and 3) fast/pipelined multiply--which IA-64 is
> supposed to have.  The floating point version was a cluge to make
> up for an,
> uhhh, *interesting* processor archetecture.  It shouldn't make everyone
> think that it's always the best way to do things.

That's kind of what I was driving at.  With 128 64 bit multi-purpose
registers, register rotation, etc.  The register rotation should help by not
making you unroll *all* your loops *all* the time.  I'm sure it'd still help
to unroll them anyway, but that's an opinion.  Oh...sure enough, there's an
example where they show that you get a speedup in an unrolled loop, but you
do save even more cycles in a partially unrolled loop using register
rotation.

I notice that the imul actually uses the FPU which makes me wonder if imul
would really be any better than fpmul (which can be parallelized - fpmpy).
Fused-multiply-and-add commands (fma) could help with some code, but that's
a guess on my part.

On the other hand, I do see that IA-64 *does* do 64bit*64bit=128bit imul,
though they do indeed use the FP registers, and it's the FPU core doing all
the work.  "The product of 2 64bit significands is added to the third 64bit
significand (zero extended) to produce a 128bit result."

Additionally, there is support for quad-precision FP "in software" (just
above the microcode I'd guess?  Or do they mean in ASM?).  Certainly
quad-precision (128bits), if it were fast enough, would be a lot better than
extended double (80bits).  I wonder about the speed of that though, and what
they mean by "in software".

Lost my train of thought...the power went where I work for a couple hours
just now....and I think I'd better end it here! :-)

Aaron

________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm

Reply via email to