Hi, On 3/2/21 12:30 AM, Torbjörn Granlund wrote: > Torbjörn Granlund <t...@gmplib.org> writes: > > I played a bit with an addmul_1 of my own, with some ideas from your > code. I don't plan to do more work on this. Does this perform well on > hardware? > > I now realise that the instruction sequence of my example is essentially > the same as in your code, except with more unrolling. That's a bit > surprising, as I wrote it by understanding the instruction set, starting > at the vac* insn which your code mase me look at. I did actually not > look at your code while writing my code.
Indeed, it is intriguing that we were both led to this instruction selection. Further, that same basic sequence has held up in my current best-performing variant. Though, let me answer your questions and comments one by one. I have run & benchmarked your code, with a few changes. It performs significantly better than my initial submission (because of the unrolling) and close to my second-best variant (same unrolling, different schedule). Most notably, I changed the order so that the mlgr's are next to each other. The reason is that decode and dispatch happens in two "groups" of up to three instructions each, with each group going into one of the two "issue sides" of the core (both are symmetric and have the same set of issue ports since z13). For some instructions, grouping is restricted -- that includes mlgr, which will be alone in a group. Thus, placing two mlgr's next to each other will ensure that they will spread across both issue sides and exploit both multiply units. (there is a "IBM Z / LinuxONE System Processor Optimization Primer" publicly available from https://community.ibm.com/HigherLogic/System/DownloadDocumentFile.ashx?DocumentFileKey=d1cdb394-0159-464c-92a3-3f74f8c545c4 discussing these details) your code with adjusted grouping: L(top): lg %r1, 0(%r13, %r3) lg %r7, 8(%r13, %r3) lg %r9, 16(%r13, %r3) mlgr %r0, %r5 mlgr %r6, %r5 lg %r11, 24(%r13, %r3) mlgr %r8, %r5 mlgr %r10, %r5 > (your comment) Remove "idx" and instead decrement n, update up, and rp. In my experiments, incrementing and comparing a single "idx" turned out beneficial over incrementing the pointers and decrementing n separately. Similarly, using 128-bit adds in vector registers performs better than alcgr + algr. One factor is that alcgr must be alone in a dispatch group, same as mlgr. Given the number of alcgrs we would need, the 128-bit adds wins. For comparison, vacq and vacccq also have a grouping limitation -- only two of them can be in a group. However, that means we can fit a 128-bit add with carry in and out in one dispatch group, instead of just a 64-bit add. To improve performance notably (~40% over my initial patch on z15), my currently best-performing implementation maintains the same instruction sequence (mlgr + vacq for two limbs at a time) as our previous attempts, yet unrolls for 8 limbs per iteration with software pipelining of the different stages (load, multiply, add, and so on). Unrolling even more did not improve performance. To cope with the complexity, the code is in C with a mix of GCC vector extensions, Z builtins for vectors, and inline assembly (to enforce the mlgr scheduling and trick GCC in some cases): /* ... ramup ... */ for ( ) { /* four limb pairs 0 .. 3 in flight */ /* first limb pair */ LOAD_LP1(2 + 8); /* loading next limb pair */ MULT_STAGE_0; SECOND_ADD_STAGE(1); /* load and mult in rampup, now add */ FIRST_ADD_STAGE(2); WRITEBACK_STAGE(0, 0); TRANSFER_TO_VR_0; asm (""); /* second limb pair */ LOAD_LP2(4 + 8); MULT_STAGE_1; SECOND_ADD_STAGE(2); FIRST_ADD_STAGE(3); WRITEBACK_STAGE(1, 2); TRANSFER_TO_VR_1; asm (""); /* third and fourth limb pair ... */ } /* wrapup ... */ While this variant helped a lot in debugging and tweaking parameters and schedule, it is hackish and brittle (e.g., the empty asm("")s help define instruction scheduling, yet GCC may change how it handles them over time). Further, I suspect there may be performance gains left in hand-tweaking the assembly code. So, for integrating this implementation into GMP, I propose adding both the resulting assembly variant and that C code for reference or future improvements. What do you think? Marius -- Marius Hillenbrand Linux on Z development IBM Deutschland Research & Development GmbH Vors. des Aufsichtsrats: Gregor Pillen / Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel