On Fri, 20 Nov 2020, Maciej W. Rozycki wrote:

> Outliers:
> 
> old     new     change  %change filename
> ----------------------------------------------------
> 2406    2950    +544    +22.610 20111208-1.exe
> 4314    5329    +1015   +23.528 pr39417.exe
> 2235    3055    +820    +36.689 990404-1.exe
> 2631    4213    +1582   +60.129 pr57521.exe
> 3063    5579    +2516   +82.142 20000422-1.exe

 So as a matter of interest I have actually looked into the worst offender 
shown above.  As I have just learnt by chance, GNU `size' in its default 
BSD mode reports code combined with rodata as text, so I have rerun it in 
the recently introduced GNU mode with `20000422-1.exe' and the results are 
worse yet:

       text       data        bss      total filename
-      1985       1466         68       3519 ./20000422-1.exe
+      4501       1466         68       6035 ./20000422-1.exe

However upon actual inspection code produced looks sound and it's just 
that the loops the test case has get unrolled further.  With speed 
optimisation this is not necessarily bad.  Mind that the options used for 
this particular compilation are `-O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions' meaning that, well, ahem, we do 
want loops to get unrolled and we want speed rather than size.  So all is 
well.

 I have tried to run some benchmarking with this test case, by putting all 
the files involved in tmpfs (i.e. RAM) so as to limit any I/O influence 
and looping the executable 1000 times, which yielded elapsed times of 
around 340s, i.e. with a good resolution, but results are inconclusive and 
the execution time oscillates on individual runs around the value shown, 
regardless of whether this change has been applied or not.

 So I have to conclude that either both variants of code are virtually 
equivalent in terms of performance or that the test environment keeps 
execution I/O-bound despite the measures I have taken.  Sadly the VAX 
architecture does not provide a user-accessible cycle count (I would know 
of) that could be used for more accurate measurements, and I do not feel 
right now like fiddling with the code of the test case any further so as 
to make it more suited for performance evaluation.

 I have to note however on this occasion that this part of the change:

-(define_insn "*cmp<mode>"
-  [(set (cc0)
-       (compare (match_operand:VAXint 0 "nonimmediate_operand" "nrmT,nrmT")
-                (match_operand:VAXint 1 "general_operand" "I,nrmT")))]
+(define_insn "*cmp<VAXint:mode>_<VAXcc:mode>"
+  [(set (reg:VAXcc VAX_PSL_REGNUM)
+       (compare:VAXcc (match_operand:VAXint 0 "general_operand" "nrmT,nrmT")
+                      (match_operand:VAXint 1 "general_operand" "I,nrmT")))]

which allowed an immediate with operand #0 has improved code generation a 
little bit with this test case as well, because rather than this:

        clrl %r0
[...]
        cmpl %r0,%r1
[...]
        cmpl %r0,%r3
[...]

this:

        cmpl $0,%r1
[...]
        cmpl $0,%r3
[...]

is produced, which does not waste a register to hold the value of 0 which 
can be supplied in the literal addressing mode, i.e. with the operand 
specifier byte itself just like with register operands, and therefore does 
not require extra space or execution time.

 I don't know however why the middle end insists on supplying constant 0 
as operand #0 to the comparison operation (or the `cbranch4' insn it has 
originated from).  While we have machine support for such a comparison, 
having constant 0 supplied as operand #1 would permit the use of the TST 
instruction, one byte shorter.  Of course that would require reversing the 
condition of any branches using the output of the comparison, but unlike 
typical RISC ISAs the VAX ISA supports all the conditions as does our MD.

 Oddly enough making constant 0 more expensive in operand #0 than in 
operand #1 for comparison operations or COMPARE does not persuade the 
middle end to try and swap the operands, and making the `cbranch4' insns 
reject an immediate in operand #0 only makes reload put it back in a 
register.  All this despite COMPARE documentation saying:

    "If one of the operands is a constant, it should be placed in the
     second operand and the comparison code adjusted as appropriate."

So this looks like a missed optimisation and something to investigate at 
one point.

 Also, interestingly, we have this comment in our MD:

;; The VAX move instructions have space-time tradeoffs.  On a MicroVAX
;; register-register mov instructions take 3 bytes and 2 CPU cycles.  clrl
;; takes 2 bytes and 3 cycles.  mov from constant to register takes 2 cycles
;; if the constant is smaller than 4 bytes, 3 cycles for a longword
;; constant.  movz, mneg, and mcom are as fast as mov, so movzwl is faster
;; than movl for positive constants that fit in 16 bits but not 6 bits.  cvt
;; instructions take 4 cycles.  inc takes 3 cycles.  The machine description
;; is willing to trade 1 byte for 1 cycle (clrl instead of movl $0; cvtwl
;; instead of movl).

which clearly states that the cost of a const 0 rtx varies depending on 
whether it is implied as with the CLR instruction or explicitly encoded as 
with MOV.  This must surely be true for the MicroVAX microarchitecture 
referred.  Which I gather however must have been solely microcoded unlike 
at least some later implementations such as the NVAX, which used a mixed 
pipeline/microcode microarchitecture familiar to many from modern x86 
processors.  I doubt with such an implementation an implicit 0 operand 
would require an extra machine clock for execution.  Which means we'd 
better review the costs at some point and make them model-specific.

  Maciej

Reply via email to