On Friday 24 December 2010 08:16:38 Cactus wrote:
> On Dec 24, 5:58 am, Jason <ja...@njkfrudils.plus.com> wrote:
> > Hi
> > 
> > The new mul_1 for nehalem is in trunk , this does run at a measured 3390
> > ie 3.333c/l . Note it is very sensitive to feed-in/wind-down code ,
> > notice the spurious instructions needed :(
> > 
> > Jason
> > 
> > On Friday 24 December 2010 04:19:11 Jason wrote:
> > > I should of said the below is for the nehalem/westmere only.
> > > 
> > > On Friday 24 December 2010 04:18:26 jason wrote:
> > > > Hi , now I have more accurate timings here are the real changes made
> > > > from mpir-2.2 to the upcoming mpir-2.3
> > > > 
> > > > popcount 1310 to 1066 ie 1.25c/l at 4-way to 1.0c/l at 6-way
> > > > hamdist 2036 to 2040 ie 2.0c/l at 4-way to 2.0c/l at 2-way
> > > > mul_1 3779 to 3610 ie 3.75c/l at 4-way to 3.563c/l at 3-way
> > > > mul_2 7961 to 7172 ie 7.9c/l at 3-way to 7.1 at 3-way
> > > > 
> > > > The popcount and hamdist are as before , but the mul_1,2 are showing
> > > > some bit rot , in light of the better timings I'll give them another
> > > > go.
> > > > 
> > > > Jason
> > > > 
> > > > On Dec 22, 5:57 pm, Jason <ja...@njkfrudils.plus.com> wrote:
> > > > > On Wednesday 22 December 2010 10:15:39 Cactus wrote:
> > > > > > On Dec 22, 9:08 am, Jason <ja...@njkfrudils.plus.com> wrote:
> > > > > > > Hi
> > > > > > > 
> > > > > > > In trunk there is a new mpn_mul_2 for the nehalem/westmere ,
> > > > > > > the old one ran at (a measured) 7.59c/l and the new one at
> > > > > > > 6.84c/l  , about 10% speed-up , the optimal would be 6.0c/l
> > > > > > > (bound by add latency) this would give a measured 5.87c/l .
> > > > > > > I'm going to try adding a cpuid serializing instruction in our
> > > > > > > timing code to see if we can get proper timing for the
> > > > > > > nehalem. Note: This new function is VERY sensitive to the
> > > > > > > exact feed-in and wind-down code , it's a right old PITA . If
> > > > > > > only I could put the pipelines in a known state at the start
> > > > > > > of the function , or time it with the exact feed-in code.
> > > > > > 
> > > > > > Hi Jason,
> > > > > > 
> > > > > > I have added it to the nehalem x64 builds on Windows.
> > > > > > 
> > > > > > Of course, the feed in/out code is different so its quite
> > > > > > possible that this will interfere with the optimisation.
> > > > > > 
> > > > > >     Brian
> > > > > 
> > > > > It seems we allready use cpuid to serialize , however turning off
> > > > > turbo-boost in the bios solves it.
> > > > > 
> > > > > with turbo boost
> > > > > 
> > > > > ./speed  -c -s 1000 mpn_add_n
> > > > > overhead 6.00 cycles, precision 1000000 units of 3.75e-10 secs, CPU
> > > > > freq 2664.58 MHz
> > > > > 
> > > > >             mpn_add_n
> > > > > 
> > > > > 1000          1933.00
> > > > > 
> > > > > and with turbo-boost turned off
> > > > > 
> > > > > ./speed  -c -s 1000 mpn_add_n
> > > > > overhead 6.00 cycles, precision 1000000 units of 3.75e-10 secs, CPU
> > > > > freq 2664.58 MHz
> > > > > 
> > > > >             mpn_add_n
> > > > > 
> > > > > 1000          2030.00
> > > > > 
> > > > > clearly rdtsc counts the base clock , and if one core if boosted
> > > > > rdtsc still counts the base clock , giving impossible answers ,
> > > > > I'll think I'll leave my bios with turbo-boost switched off ,
> > > > > accurate answers are far more important than a 5% speedup.
> > > > > 
> > > > > Jason
> 
> I have added it to Windows but the feed in/out code is obviously
> different.
> 
> What properties should I be trying to establish in the prologue/
> epilogue code?
> 

dunno know :) I just guess it , what I do is , first find the inner loop , then 
copy this loop for the wind-down (sometimes twice, depending on the degree of 
pipelining) and in this wind down delete all instructions that do nothing or 
would load/store past bounds , same for the feed-in. Test it , and measure the 
speed , hopefully it runs at the speed I want , if not then uncomment out some 
of the "harmless" instructions and hope it works, science indeed....
What I think that is going on is , depending on the feed-in the pipeline state 
on entry to the loop can be different , and as the schedulers are not perfect , 
the loop can run at different speeds.If we could fix the pipeline state on 
entry 
to the loop thing would be much simpler. It's more common as you approach the 
cpu bounds , as this makes the schedulers work harder.Note: The int pipes on 
the AMD are identical(nearly) and therefore this problem is rare, unlike AMD's 
MMX pipes, where the problem is well known.What does puzzle me is that the 
wind-down code can also have a big effect.

> Given that the code context will be different anyway, is there any
> point in doing anything here?
> 

Again I dont know , I'm just hoping for the best , we do test for all possible 
memory alignments, and the code should run at the best speed for all of them , 
(for example thats why mpn_sumdiff only runs at 2.8c/l not 2.5c/l , I have not 
yet found a single code path which runs at the best speed for all memory 
alignments) . I was also thinking of testing the code for all possible size 
mod unroll-wayness , as I have noticed some irregularities , and like you say 
we should test against all possible starting pipeline combinations, how is the 
big question. I imagine we would only do this in a few critical cases ie 
mul_basecase . Anyway it's up to you if you want toi give this function a go , 
how much time have you got ? I may be able to find a faster version ie 3.0c/l 
in a few months time!


>     Brian

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to