On Saturday 29 January 2011 11:16:08 Jason wrote: > On Friday 28 January 2011 14:15:28 Jason wrote: > > On Friday 28 January 2011 13:43:56 Jason wrote: > > > On Friday 28 January 2011 11:02:07 Jason wrote: > > > > On Friday 28 January 2011 10:55:06 jason wrote: > > > > > Hi > > > > > > > > > > In trunk is a new AMD addmul_1 , this runs at the same speed as the > > > > > old one , but is smaller. The old code was 450 bytes and the new > > > > > code is 407 bytes.I've not tested it on a K10 yet as skynet is > > > > > down but from what I think I know of the AMD chips it must run at > > > > > the same speed.The windows conversion is only worth doing if the > > > > > alignments/ spacing are placed carefully. ie loop starts/ends on > > > > > 16bytes boundary , and jmp destinations are close enough , defined > > > > > by testing :) > > > > > More to follow. > > > > > > > > > > Jason > > > > > > > > Note: The old addmul_1 also had an alternate entry point for inclsh_n > > > > , I dont know why we did this , if the fastest inclsh is really > > > > addmul_1 then we should use a macro , and if not (ie core2) then we > > > > should an alternate entry point(or new fn) > > > > Note: The 450 bytes count above did not include the inclsh_n part > > > > > > > > Jason > > > > > > Attached is a AMD 4way addmul_1 , the inner loop is the same , but > > > instead of four cases to handle the "leftovers" we use jumping into the > > > loop , this save quite a bit of code save , it's 278 bytes. The > > > asymptotic speed is the same but the overheads are a bit more. I have > > > not put this in trunk. > > > > > > Jason > > > > I should of also said , I expect I can quite easily shave some cycles of > > it and some space. > > > > Attached are 3 variants of an AMD addmul_1 7-way unroll. This runs at > > 17/7=2.428c/l (4-way is 2.5c/l) a 2.9% improvement , due to the reasons > > below I dont regard this as practical so you will notice that no attempt > > has been made to optimize it or clean it up. > > > > k8_addmul_1_7way.asm is the usual way of handling the leftovers by having > > 7 cases , the problem is code size. > > > > k8_addmul_1_7way_jmpepi.asm uses a small 7-entry jump table to branch to > > the 7 cases (as oppose to the above which uses a string of cmp's and > > Jcc's_) , code size is still a problem , and the jump table should be in > > a separate segment. > > > > k8_addmul_1_7way_jmpin.asm uses a jump into the middle of the loop > > approach to handle the left-overs , this saves a lot of space but we > > need to calculate size%7 , this is much easier than a general division( > > could do a hensel-div ie 10cycles max) or some shifting and assume > > L1-cache then we can limit the size to 4096. I've just done a standard > > slow division , and the feed-in cases are poor. > > > > The inflexibility of the code sequence limits the scheduler and pick > > hardware so that some tricks had to be used to help the chip out :) > > > > It may be possible to to improve this speed (if the tricks are good > > enough) by going to a larger unroll , 10-way (2.4c/l) is possible , and > > 16-way (2.375) is the next, but there are better ways. > > > > Jason > > Hi > > Attached is a AMD addmul_1 infinity-way unrolled which runs at 2.333c/l , > asymptotically this is faster than our current addmul_2(runs at 2.375c/l) . > This is really proof of concept code at the moment as many things need to > be done. It's meant for mul_basecase etc where the sizes are limited , if > we keep to less than 32x32 mul's then it takes 23bytes of code per limb > plus overhead plus tables(currently 16 bytes per limb, certainly can get > this down to 9 or 5 bytes) . I've included our standard addmul_1 in it for > large sizes so I can test it properly. Mul basecase is very sensitive to > overheads so this may not be an improvement , I'll write a basecase on > this current code and if it seems promising I'll do it properly(reduce > code size,reduce tables,check speed for all alignments and jumpin points > etc). > > Jason
Note line 313,314 read adc $0,%r10d #adc $0,%r11d they should be adc $0,%r10 #adc $0,%r11 Jason -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.