On Mon, May 4, 2009 at 11:27 AM, Jason Moxham <ja...@njkfrudils.plus.com> wrote:
>
> Hi
>
> I've been playing with some assembler for the Intel Core2 chips and have come
> across this timing oddity which I cant explain . Any ideas?

Maybe it's to do with the branch predictor? Remarks:

1. It seems to me that this starts happening when the loop is at least
65 iterations (as in the first case to be "penalized" is n=261).

2. Permuting the lines "ja case0" and "je case1" seems to kill the
advantage --- so this looks like the not taken "jnc" and the taken
"ja" are somehow linked (they both depend just on %r8).

3. The entry for "case0" seems to be aligned at 32*n+16, and it seems
that using ALIGN(32) kills the advantage.

4. if any load or store (other than the pops, I guess) happens in that
code, the difference disappears...

5. it also seems to be related to the length of the "case0", as adding
two dummy movs (say mov %r9,%r9 twice) kills the difference, as if the
way the pipeline is filled *up to* the "ret" was important...

It seems the speed test is actually calling this function in a very
tight loop, right? maybe that's involved... calling the function twice
in the tight loop seems to change things a bit.

Could the high "penalty" be due to periodic branch mispredictions in
the loop, but which disappear precisely when the "case0" is taken,
etc?

I'd call this an "amazing coincidence" which makes the branch
predictor better than average for this particular case. It seems
unstable to be of any use...

Gonzalo

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to