I created an "ununrolled" version, and a 4x unrolled version. I then compared these with some other variants. Here are the results:
mul_1 addmul_1 addaddmul result best variant zen1 2 2 2 = all equal (saturated mul) zen2 1.7 2.1 2.8 + 4x unrolled zen3 1.5 1.5 3.9 - 1x/2x/4x unrolled bwl 1.5 1.7 3.1 + 1x skl 1.5 1.7 3.1 + 1x rkl 1.1 1.6 2.7 = 1x The only CPU which sees great improvements if zen2. Both bwl and skl see very minor improvement. I don't see your 20% figure for bwl. The result on zen3 are really poor. I believe this CPU has quite some cost for indexed memrefs. I think that's also true for bwl and perhaps skl, even if the new code runs OK there. We might want to produce variants which uses plainer memory ops, code which updates baseregs with lea. That will require unrolling to at least 4x. (The present addmul_1 code which is used for bwl, skl, rkl, and zen3 is 8x unrolled without indexed memrefs. It was loopmixed for bwl, but runs surprisingly well elsewhere.) We really should move the present code to the k8 subdir. It is a major slowdown everywhere I have tested except k8 and k10. (I have not tested bd1 through bd4, but they have k8 in their asm paths (which might be good or bad).) If we want to pursue this more, this is what I would suggest: 1. Move the present code to x86_64/k8 2. Create a non-indexed variant, perhaps 4x unrolled. I suspect this might give some real speedups for bwl, skl, perhaps rkl, and likely zen3. 3. If 2 is successful, commit code bwl subdir, create relevant grabber for the appropriate zen CPUs. If 2 is not successful, commit "ununrolled" code for bwl subdir, make zen1/zen2 use it (but make sure zen3 does not use that variant!). 4. Consider loopmixing. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel