My analysis and reports in this thread had several problems. For example, I had made shared-lib builds of some qemu images used for "user mode" emulation; that does not work unless the host dynlibs are made available in the guest file system.
Trying again: your submul_1 works fine on all tested versions of qemu, which are 4.1.x through 5.2.0. (But I did not test every version in that range.) The problems I saw with qemu and submul_1 where actually with the existing s390 submul_1 which has been part of GMP for several years. All tested qemu versions exhibit that same problem. I have access to a s196 system, and there I cannot provoke any error. So, the old submul_1 code seems to work on actual hardware. The code looks good to me. There is a bug in qemu. I only tested "user mode" qemu for this experiment. Sometimes full system emuation runs instructions correctly when "user mode" emuation does not. Sometimes vice versa. (I've encountered lots of bugs of this kind in qemu over the years. I have tried to help the qemu project as I have expertise in the areas of CPUs, emulation and arithmetic, but I was discouraged enough by their culture to no longer even try to take the time needed to report my findings.) I have dealt with qemu's buggyness by keeping many qemu versions installed. I then try to locate the most recent which works for running GMP in either full system emulation or on "user mode" emulation. In the meantime, I've been working on a software pipelined variant of addmul_1, which improves significantly over my previous patch. That is C with inline assembly, which helps somewhat with the increased complexity. It needs more tuning and stress testing, though. Great! Usually, multiplication insn throughput is the limiting factor for addmul_1 and friends. Therefore, understanding its throughput is a great place to start. Once that is understood, one knows what to aim for. One usually can get quite close to the multiplication insn throughput in addmul_1 (or in some cases addmul_2, or addmul_k for some small k > 2). But usually, and in particular if that throughput is great, the end performance will be up to 50% worse. I only know of one CPU where addmul_1 runs at exactly the multiplication insn throughput; Apple M1. I looked briefly at your code after David sent a link to the s390 ISA manual. If I understand it correctly, you rely on 128 bit addition. I have no idea of the throughput or latency of those 128-bit instructions, but if those numbers are good, I agree that they might be very useful. An alternative is to stick to plain 64-bit (non-vector) instructions for unrolled addmul_1. We do that for many CPUs already. One will need to run through partial products twice, for s390 using alcg(r). The most significant 64-bit partial product of an unroll group can work as a carry trap. If we pull out all the stops, perhaps addmul_4 or something like that, combined with the vector instructions could yield the best performance. In the end, balancing complexity and performance (and effort!) will decide. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel