On Fri, 20 Nov 2020, Maciej W. Rozycki wrote: > Outliers: > > old new change %change filename > ---------------------------------------------------- > 2406 2950 +544 +22.610 20111208-1.exe > 4314 5329 +1015 +23.528 pr39417.exe > 2235 3055 +820 +36.689 990404-1.exe > 2631 4213 +1582 +60.129 pr57521.exe > 3063 5579 +2516 +82.142 20000422-1.exe
So as a matter of interest I have actually looked into the worst offender shown above. As I have just learnt by chance, GNU `size' in its default BSD mode reports code combined with rodata as text, so I have rerun it in the recently introduced GNU mode with `20000422-1.exe' and the results are worse yet: text data bss total filename - 1985 1466 68 3519 ./20000422-1.exe + 4501 1466 68 6035 ./20000422-1.exe However upon actual inspection code produced looks sound and it's just that the loops the test case has get unrolled further. With speed optimisation this is not necessarily bad. Mind that the options used for this particular compilation are `-O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions' meaning that, well, ahem, we do want loops to get unrolled and we want speed rather than size. So all is well. I have tried to run some benchmarking with this test case, by putting all the files involved in tmpfs (i.e. RAM) so as to limit any I/O influence and looping the executable 1000 times, which yielded elapsed times of around 340s, i.e. with a good resolution, but results are inconclusive and the execution time oscillates on individual runs around the value shown, regardless of whether this change has been applied or not. So I have to conclude that either both variants of code are virtually equivalent in terms of performance or that the test environment keeps execution I/O-bound despite the measures I have taken. Sadly the VAX architecture does not provide a user-accessible cycle count (I would know of) that could be used for more accurate measurements, and I do not feel right now like fiddling with the code of the test case any further so as to make it more suited for performance evaluation. I have to note however on this occasion that this part of the change: -(define_insn "*cmp<mode>" - [(set (cc0) - (compare (match_operand:VAXint 0 "nonimmediate_operand" "nrmT,nrmT") - (match_operand:VAXint 1 "general_operand" "I,nrmT")))] +(define_insn "*cmp<VAXint:mode>_<VAXcc:mode>" + [(set (reg:VAXcc VAX_PSL_REGNUM) + (compare:VAXcc (match_operand:VAXint 0 "general_operand" "nrmT,nrmT") + (match_operand:VAXint 1 "general_operand" "I,nrmT")))] which allowed an immediate with operand #0 has improved code generation a little bit with this test case as well, because rather than this: clrl %r0 [...] cmpl %r0,%r1 [...] cmpl %r0,%r3 [...] this: cmpl $0,%r1 [...] cmpl $0,%r3 [...] is produced, which does not waste a register to hold the value of 0 which can be supplied in the literal addressing mode, i.e. with the operand specifier byte itself just like with register operands, and therefore does not require extra space or execution time. I don't know however why the middle end insists on supplying constant 0 as operand #0 to the comparison operation (or the `cbranch4' insn it has originated from). While we have machine support for such a comparison, having constant 0 supplied as operand #1 would permit the use of the TST instruction, one byte shorter. Of course that would require reversing the condition of any branches using the output of the comparison, but unlike typical RISC ISAs the VAX ISA supports all the conditions as does our MD. Oddly enough making constant 0 more expensive in operand #0 than in operand #1 for comparison operations or COMPARE does not persuade the middle end to try and swap the operands, and making the `cbranch4' insns reject an immediate in operand #0 only makes reload put it back in a register. All this despite COMPARE documentation saying: "If one of the operands is a constant, it should be placed in the second operand and the comparison code adjusted as appropriate." So this looks like a missed optimisation and something to investigate at one point. Also, interestingly, we have this comment in our MD: ;; The VAX move instructions have space-time tradeoffs. On a MicroVAX ;; register-register mov instructions take 3 bytes and 2 CPU cycles. clrl ;; takes 2 bytes and 3 cycles. mov from constant to register takes 2 cycles ;; if the constant is smaller than 4 bytes, 3 cycles for a longword ;; constant. movz, mneg, and mcom are as fast as mov, so movzwl is faster ;; than movl for positive constants that fit in 16 bits but not 6 bits. cvt ;; instructions take 4 cycles. inc takes 3 cycles. The machine description ;; is willing to trade 1 byte for 1 cycle (clrl instead of movl $0; cvtwl ;; instead of movl). which clearly states that the cost of a const 0 rtx varies depending on whether it is implied as with the CLR instruction or explicitly encoded as with MOV. This must surely be true for the MicroVAX microarchitecture referred. Which I gather however must have been solely microcoded unlike at least some later implementations such as the NVAX, which used a mixed pipeline/microcode microarchitecture familiar to many from modern x86 processors. I doubt with such an implementation an implicit 0 operand would require an extra machine clock for execution. Which means we'd better review the costs at some point and make them model-specific. Maciej