Hi, On 3/9/21 11:06 PM, Torbjörn Granlund wrote: > Marius Hillenbrand <mhil...@linux.ibm.com> writes: [...] > That absolutely makes sense. When I wrote my patches initially, it was > not yet clear that it is worthwhile to differentiate. > > It is not clear, but it does not hurt to make config.guess be accurate, > and then treat the CPUs the same way. In my experience, people can get > confiused when GMP claims they have CPU foo-k when they actually have > foo-(k+1).
OK, agreed. I had vlerg/vsterg vs vpdi in mind. > I tried vlerg on the system here, and it works fine. Very little timing > differece though, but then again I didn't try very hard. > > I am not aware of any timing differences between z13, z14, z15 for the > L1 cache-hit cases. Are there any? And the only GMP-relevant ISA > difference of which I am aware is the presence of vlerg in z15. z14 introduced "alignment hints" for vector loads, where 8-byte aligned reads have more bandwidth (e.g., "vl %v<dst>,<addr>,3" # 3 for 8-byte alignment, 4 for 16-byte alignment). vlerg does not take these hints. Empirically, I observe a slight advantage for vlerg nonetheless (~5%). > How's it going with the various addmul_k variants? My completely > non-scheduled addmul_2 seems to run 37% slower than the mlgr throughput. > That's not bad. Some fiddling around with the schedule got me to just > 25% slower. That was with 2x unrolling. I haven't tried anything > sophisticated. That is very good news. > > How far is your best addmul_1 from mlgr's throughput? My 8x unrolled addmul_1 with extensive pipelining gets to within ~20% of mlgr throughput. Though, my current implementation only applies for >= 18 limbs (8 lead-in, 8x-unrolled, 2 limbs wrap-up) -- not very useful for addmul_1, besides making the case for going for addmul_2. The loop is unrolled for 8 multiplications (4 "limb pairs" of 128 bits each) and looks like (simplified) for( ...) { LOAD(limb pair 1); MULT (0); SECOND ADD (1); FIRST ADD (2); WRITEBACK (0); // from previous iteration or lead-in VLVGP (0); LOAD(2); MULT(1); SECOND ADD (2); FIRST ADD (3); WRITEBACK (1); VLVGP (1); LOAD(3); MULT(2); WRITEBACK(2); SECOND ADD(3); FIRST ADD (0); // from mult and vlvgp at top /* ... and so on ... */ } > > I believe it to be possible to get pretty close to mlgr's throughput, if > not by any other means by going to addmul_k for k > 2. I think 8-way > addmul_1 makes little sense, but I think 2-way or 4-way addmul_2, or > 2-way addmul_3 or 2-way addmul_4 does make sense if they run close to > mlgr's throughput. I'm looking at the addmul_2/3/4 variants, exploring parameters. Given that mpn_mul requires s1n >= s2n, mul_basecase will always call any variant of addmul_k with n > k (if I read the code correctly). Is that an assumption that addmul_k can make in general? Marius -- Marius Hillenbrand Linux on Z development IBM Deutschland Research & Development GmbH Vors. des Aufsichtsrats: Gregor Pillen / Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel