On Saturday 29 January 2011 18:18:06 Jason wrote:
> On Saturday 29 January 2011 11:16:08 Jason wrote:
> > On Friday 28 January 2011 14:15:28 Jason wrote:
> > > On Friday 28 January 2011 13:43:56 Jason wrote:
> > > > On Friday 28 January 2011 11:02:07 Jason wrote:
> > > > > On Friday 28 January 2011 10:55:06 jason wrote:
> > > > > > Hi
> > > > > > 
> > > > > > In trunk is a new AMD addmul_1 , this runs at the same speed as
> > > > > > the old one , but is smaller. The old code was 450 bytes and the
> > > > > > new code is 407 bytes.I've not tested it on a K10 yet as skynet
> > > > > > is down but from what I think I know of the AMD chips it must
> > > > > > run at the same speed.The windows conversion is only worth doing
> > > > > > if the alignments/ spacing are placed carefully. ie loop
> > > > > > starts/ends on 16bytes boundary , and jmp destinations are close
> > > > > > enough , defined by testing :)
> > > > > > More to follow.
> > > > > > 
> > > > > > Jason
> > > > > 
> > > > > Note: The old addmul_1 also had an alternate entry point for
> > > > > inclsh_n , I dont know why we did this , if the fastest inclsh is
> > > > > really addmul_1 then we should use a macro , and if not (ie core2)
> > > > > then we should an alternate entry point(or new fn)
> > > > > Note: The 450 bytes count above did not include the inclsh_n part
> > > > > 
> > > > > Jason
> > > > 
> > > > Attached is a AMD 4way addmul_1 , the inner loop is the same , but
> > > > instead of four cases to handle the "leftovers" we use jumping into
> > > > the loop , this save quite a bit of code save , it's 278 bytes. The
> > > > asymptotic speed is the same but the overheads are a bit more. I
> > > > have not put this in trunk.
> > > > 
> > > > Jason
> > > 
> > > I should of also said , I expect I can quite easily shave some cycles
> > > of it and some space.
> > > 
> > > Attached are 3 variants of an AMD addmul_1 7-way unroll. This runs at
> > > 17/7=2.428c/l (4-way is 2.5c/l) a 2.9% improvement , due to the reasons
> > > below I dont regard this as practical so you will notice that no
> > > attempt has been made to optimize it or clean it up.
> > > 
> > > k8_addmul_1_7way.asm is the usual way of handling the leftovers by
> > > having 7 cases , the problem is code size.
> > > 
> > > k8_addmul_1_7way_jmpepi.asm uses a small 7-entry jump table to branch
> > > to the 7 cases (as oppose to the above which uses a string of cmp's
> > > and Jcc's_) , code size is still a problem , and the jump table should
> > > be in a separate segment.
> > > 
> > > k8_addmul_1_7way_jmpin.asm uses a jump into the middle of the loop
> > > approach to handle the left-overs , this saves a lot of space but we
> > > need to calculate size%7 , this is much easier than a general division(
> > > could do a hensel-div ie 10cycles max) or some shifting and assume
> > > L1-cache then we can limit the size to 4096.  I've just done a standard
> > > slow division , and the feed-in cases are poor.
> > > 
> > > The inflexibility of the code sequence limits the scheduler and pick
> > > hardware so that some tricks had to be used to help the chip out :)
> > > 
> > > It may be possible to to improve this speed (if the tricks are good
> > > enough) by going to a larger unroll , 10-way (2.4c/l) is possible , and
> > > 16-way (2.375) is the next, but there are better ways.
> > > 
> > > Jason
> > 
> > Hi
> > 
> > Attached is a AMD addmul_1 infinity-way unrolled which runs at 2.333c/l ,
> > asymptotically this is faster than our current addmul_2(runs at 2.375c/l)
> > . This is really proof of concept code at the moment as many things need
> > to be done. It's meant for mul_basecase etc where the sizes are limited
> > , if we keep to less than 32x32 mul's then it takes 23bytes of code per
> > limb plus overhead plus tables(currently 16 bytes per limb, certainly
> > can get this down to 9 or 5 bytes) . I've included our standard addmul_1
> > in it for large sizes so I can test it properly. Mul basecase is very
> > sensitive to overheads so this may not be an improvement , I'll write a
> > basecase on this current code and if it seems promising I'll do it
> > properly(reduce code size,reduce tables,check speed for all alignments
> > and jumpin points etc).
> > 
> > Jason
> 
> Note line 313,314 read
> 
> adc $0,%r10d
> #adc $0,%r11d
> 
> they should be
> 
> adc $0,%r10
> #adc $0,%r11
> 
> Jason

Hi

Well the experimental mul_basecase is about 17% slower than our current code 
at 20x20 , there is room for improvement and I expect I can get it down to 10%  
, but clearly it wont compete on speed. However it can compete on size , the 
code is 1052 bytes and data is 608( this can be drastically reduced x4 say) , 
compare this to our current code which is 3550 bytes. The main culprit for the 
speed difference is the extreme unrolling I did for our current code , plus 
some spurious overhead in the new code.It's was a good exercise to prepare for 
what I really think will be better.

Jason

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to