Marius Hillenbrand <mhil...@linux.ibm.com> writes: z14 introduced "alignment hints" for vector loads, where 8-byte aligned reads have more bandwidth (e.g., "vl %v<dst>,<addr>,3" # 3 for 8-byte alignment, 4 for 16-byte alignment). vlerg does not take these hints. Empirically, I observe a slight advantage for vlerg nonetheless (~5%).
How does z13 interpret these hints? Ignore them, I hope! My 8x unrolled addmul_1 with extensive pipelining gets to within ~20% of mlgr throughput. Though, my current implementation only applies for >= 18 limbs (8 lead-in, 8x-unrolled, 2 limbs wrap-up) -- not very useful for addmul_1, besides making the case for going for addmul_2. ~20% is very good, indeed. I see that the price is deep static instruction scheduling. That's sometimes necessary, but it often adds significant O(1) overhead. If we're really crazy, we could make what I sometimes refer to as overlapped software pipelining. With tha, I mean that the outer loop of e.g. mul_basecase combines the inner loop's wind-down code for outer loop iteration j with the inner loop's feed-in code of iteration j+1. But code complexity will probably be lower with addmul_2 or some such, as we now have an inner loop with more inherit parallelism. I'm looking at the addmul_2/3/4 variants, exploring parameters. Given that mpn_mul requires s1n >= s2n, mul_basecase will always call any variant of addmul_k with n > k (if I read the code correctly). Is that an assumption that addmul_k can make in general? Yes, it is. Or more correctly n >= k. Note a trickyness of mul_k which has bit me before: It is allowed to have the same source and destination, i.e., mul_2 (ap, ap, n, bp). My mul_2 code therefore preloads from ap in a manner which my addmul_2 does not. Of course, any slight software pipelining tends to take care of this problem automatically. And tests/devel/try knows how to trip code which gets this wrong. I would suggest that you concentrate on an addmul_2 to see how close you can get to mlgs's throughput without making its software pipeline overly deep. Going to addmul_k for larger k tends to diminish the returns, and furthermore requires mul_1, mul_2, mul_(k-1) in order to create the *_basecase functions. I made my addmul_2 actually work for all limb counts. (I think it even unnecessarily handles n = 1.) Code attached.
z14-addmul_2.asm
Description: Binary data
-- Torbjörn Please encrypt, key id 0xC8601622
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel