Re: [mpir-devel] Re: New assembler

Jason Sat, 29 Jan 2011 10:18:39 -0800

On Saturday 29 January 2011 11:16:08 Jason wrote:
> On Friday 28 January 2011 14:15:28 Jason wrote:
> > On Friday 28 January 2011 13:43:56 Jason wrote:
> > > On Friday 28 January 2011 11:02:07 Jason wrote:
> > > > On Friday 28 January 2011 10:55:06 jason wrote:
> > > > > Hi
> > > > > 
> > > > > In trunk is a new AMD addmul_1 , this runs at the same speed as the
> > > > > old one , but is smaller. The old code was 450 bytes and the new
> > > > > code is 407 bytes.I've not tested it on a K10 yet as skynet is
> > > > > down but from what I think I know of the AMD chips it must run at
> > > > > the same speed.The windows conversion is only worth doing if the
> > > > > alignments/ spacing are placed carefully. ie loop starts/ends on
> > > > > 16bytes boundary , and jmp destinations are close enough , defined
> > > > > by testing :)
> > > > > More to follow.
> > > > > 
> > > > > Jason
> > > > 
> > > > Note: The old addmul_1 also had an alternate entry point for inclsh_n
> > > > , I dont know why we did this , if the fastest inclsh is really
> > > > addmul_1 then we should use a macro , and if not (ie core2) then we
> > > > should an alternate entry point(or new fn)
> > > > Note: The 450 bytes count above did not include the inclsh_n part
> > > > 
> > > > Jason
> > > 
> > > Attached is a AMD 4way addmul_1 , the inner loop is the same , but
> > > instead of four cases to handle the "leftovers" we use jumping into the
> > > loop , this save quite a bit of code save , it's 278 bytes. The
> > > asymptotic speed is the same but the overheads are a bit more. I have
> > > not put this in trunk.
> > > 
> > > Jason
> > 
> > I should of also said , I expect I can quite easily shave some cycles of
> > it and some space.
> > 
> > Attached are 3 variants of an AMD addmul_1 7-way unroll. This runs at
> > 17/7=2.428c/l (4-way is 2.5c/l) a 2.9% improvement , due to the reasons
> > below I dont regard this as practical so you will notice that no attempt
> > has been made to optimize it or clean it up.
> > 
> > k8_addmul_1_7way.asm is the usual way of handling the leftovers by having
> > 7 cases , the problem is code size.
> > 
> > k8_addmul_1_7way_jmpepi.asm uses a small 7-entry jump table to branch to
> > the 7 cases (as oppose to the above which uses a string of cmp's and
> > Jcc's_) , code size is still a problem , and the jump table should be in
> > a separate segment.
> > 
> > k8_addmul_1_7way_jmpin.asm uses a jump into the middle of the loop
> > approach to handle the left-overs , this saves a lot of space but we
> > need to calculate size%7 , this is much easier than a general division(
> > could do a hensel-div ie 10cycles max) or some shifting and assume
> > L1-cache then we can limit the size to 4096.  I've just done a standard
> > slow division , and the feed-in cases are poor.
> > 
> > The inflexibility of the code sequence limits the scheduler and pick
> > hardware so that some tricks had to be used to help the chip out :)
> > 
> > It may be possible to to improve this speed (if the tricks are good
> > enough) by going to a larger unroll , 10-way (2.4c/l) is possible , and
> > 16-way (2.375) is the next, but there are better ways.
> > 
> > Jason
> 
> Hi
> 
> Attached is a AMD addmul_1 infinity-way unrolled which runs at 2.333c/l ,
> asymptotically this is faster than our current addmul_2(runs at 2.375c/l) .
> This is really proof of concept code at the moment as many things need to
> be done. It's meant for mul_basecase etc where the sizes are limited , if
> we keep to less than 32x32 mul's then it takes 23bytes of code per limb
> plus overhead plus tables(currently 16 bytes per limb, certainly can get
> this down to 9 or 5 bytes) . I've included our standard addmul_1 in it for
> large sizes so I can test it properly. Mul basecase is very sensitive to
> overheads so this may not be an improvement , I'll write a basecase on
> this current code and if it seems promising I'll do it properly(reduce
> code size,reduce tables,check speed for all alignments and jumpin points
> etc).
> 
> Jason


Note line 313,314 read

adc $0,%r10d
#adc $0,%r11d

they should be

adc $0,%r10
#adc $0,%r11

Jason

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Re: [mpir-devel] Re: New assembler

Reply via email to