On Mon, May 4, 2009 at 11:27 AM, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > > Hi > > I've been playing with some assembler for the Intel Core2 chips and have come > across this timing oddity which I cant explain . Any ideas?
Maybe it's to do with the branch predictor? Remarks: 1. It seems to me that this starts happening when the loop is at least 65 iterations (as in the first case to be "penalized" is n=261). 2. Permuting the lines "ja case0" and "je case1" seems to kill the advantage --- so this looks like the not taken "jnc" and the taken "ja" are somehow linked (they both depend just on %r8). 3. The entry for "case0" seems to be aligned at 32*n+16, and it seems that using ALIGN(32) kills the advantage. 4. if any load or store (other than the pops, I guess) happens in that code, the difference disappears... 5. it also seems to be related to the length of the "case0", as adding two dummy movs (say mov %r9,%r9 twice) kills the difference, as if the way the pipeline is filled *up to* the "ret" was important... It seems the speed test is actually calling this function in a very tight loop, right? maybe that's involved... calling the function twice in the tight loop seems to change things a bit. Could the high "penalty" be due to periodic branch mispredictions in the loop, but which disappear precisely when the "case0" is taken, etc? I'd call this an "amazing coincidence" which makes the branch predictor better than average for this particular case. It seems unstable to be of any use... Gonzalo --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---