On Saturday 29 January 2011 18:18:06 Jason wrote: > On Saturday 29 January 2011 11:16:08 Jason wrote: > > On Friday 28 January 2011 14:15:28 Jason wrote: > > > On Friday 28 January 2011 13:43:56 Jason wrote: > > > > On Friday 28 January 2011 11:02:07 Jason wrote: > > > > > On Friday 28 January 2011 10:55:06 jason wrote: > > > > > > Hi > > > > > > > > > > > > In trunk is a new AMD addmul_1 , this runs at the same speed as > > > > > > the old one , but is smaller. The old code was 450 bytes and the > > > > > > new code is 407 bytes.I've not tested it on a K10 yet as skynet > > > > > > is down but from what I think I know of the AMD chips it must > > > > > > run at the same speed.The windows conversion is only worth doing > > > > > > if the alignments/ spacing are placed carefully. ie loop > > > > > > starts/ends on 16bytes boundary , and jmp destinations are close > > > > > > enough , defined by testing :) > > > > > > More to follow. > > > > > > > > > > > > Jason > > > > > > > > > > Note: The old addmul_1 also had an alternate entry point for > > > > > inclsh_n , I dont know why we did this , if the fastest inclsh is > > > > > really addmul_1 then we should use a macro , and if not (ie core2) > > > > > then we should an alternate entry point(or new fn) > > > > > Note: The 450 bytes count above did not include the inclsh_n part > > > > > > > > > > Jason > > > > > > > > Attached is a AMD 4way addmul_1 , the inner loop is the same , but > > > > instead of four cases to handle the "leftovers" we use jumping into > > > > the loop , this save quite a bit of code save , it's 278 bytes. The > > > > asymptotic speed is the same but the overheads are a bit more. I > > > > have not put this in trunk. > > > > > > > > Jason > > > > > > I should of also said , I expect I can quite easily shave some cycles > > > of it and some space. > > > > > > Attached are 3 variants of an AMD addmul_1 7-way unroll. This runs at > > > 17/7=2.428c/l (4-way is 2.5c/l) a 2.9% improvement , due to the reasons > > > below I dont regard this as practical so you will notice that no > > > attempt has been made to optimize it or clean it up. > > > > > > k8_addmul_1_7way.asm is the usual way of handling the leftovers by > > > having 7 cases , the problem is code size. > > > > > > k8_addmul_1_7way_jmpepi.asm uses a small 7-entry jump table to branch > > > to the 7 cases (as oppose to the above which uses a string of cmp's > > > and Jcc's_) , code size is still a problem , and the jump table should > > > be in a separate segment. > > > > > > k8_addmul_1_7way_jmpin.asm uses a jump into the middle of the loop > > > approach to handle the left-overs , this saves a lot of space but we > > > need to calculate size%7 , this is much easier than a general division( > > > could do a hensel-div ie 10cycles max) or some shifting and assume > > > L1-cache then we can limit the size to 4096. I've just done a standard > > > slow division , and the feed-in cases are poor. > > > > > > The inflexibility of the code sequence limits the scheduler and pick > > > hardware so that some tricks had to be used to help the chip out :) > > > > > > It may be possible to to improve this speed (if the tricks are good > > > enough) by going to a larger unroll , 10-way (2.4c/l) is possible , and > > > 16-way (2.375) is the next, but there are better ways. > > > > > > Jason > > > > Hi > > > > Attached is a AMD addmul_1 infinity-way unrolled which runs at 2.333c/l , > > asymptotically this is faster than our current addmul_2(runs at 2.375c/l) > > . This is really proof of concept code at the moment as many things need > > to be done. It's meant for mul_basecase etc where the sizes are limited > > , if we keep to less than 32x32 mul's then it takes 23bytes of code per > > limb plus overhead plus tables(currently 16 bytes per limb, certainly > > can get this down to 9 or 5 bytes) . I've included our standard addmul_1 > > in it for large sizes so I can test it properly. Mul basecase is very > > sensitive to overheads so this may not be an improvement , I'll write a > > basecase on this current code and if it seems promising I'll do it > > properly(reduce code size,reduce tables,check speed for all alignments > > and jumpin points etc). > > > > Jason > > Note line 313,314 read > > adc $0,%r10d > #adc $0,%r11d > > they should be > > adc $0,%r10 > #adc $0,%r11 > > Jason
Hi Well the experimental mul_basecase is about 17% slower than our current code at 20x20 , there is room for improvement and I expect I can get it down to 10% , but clearly it wont compete on speed. However it can compete on size , the code is 1052 bytes and data is 608( this can be drastically reduced x4 say) , compare this to our current code which is 3550 bytes. The main culprit for the speed difference is the extreme unrolling I did for our current code , plus some spurious overhead in the new code.It's was a good exercise to prepare for what I really think will be better. Jason -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.