Re: [PATCH 30/31] PR target/95294: VAX: Convert backend to MODE_CC representation

2020-12-09 Thread Maciej W. Rozycki
On Fri, 20 Nov 2020, Maciej W. Rozycki wrote:

> Outliers:
> 
> old new change  %change filename
> 
> 24062950+544+22.610 20111208-1.exe
> 43145329+1015   +23.528 pr39417.exe
> 22353055+820+36.689 990404-1.exe
> 26314213+1582   +60.129 pr57521.exe
> 30635579+2516   +82.142 2422-1.exe

 So as a matter of interest I have actually looked into the worst offender 
shown above.  As I have just learnt by chance, GNU `size' in its default 
BSD mode reports code combined with rodata as text, so I have rerun it in 
the recently introduced GNU mode with `2422-1.exe' and the results are 
worse yet:

   text   databss  total filename
-  1985   1466 68   3519 ./2422-1.exe
+  4501   1466 68   6035 ./2422-1.exe

However upon actual inspection code produced looks sound and it's just 
that the loops the test case has get unrolled further.  With speed 
optimisation this is not necessarily bad.  Mind that the options used for 
this particular compilation are `-O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions' meaning that, well, ahem, we do 
want loops to get unrolled and we want speed rather than size.  So all is 
well.

 I have tried to run some benchmarking with this test case, by putting all 
the files involved in tmpfs (i.e. RAM) so as to limit any I/O influence 
and looping the executable 1000 times, which yielded elapsed times of 
around 340s, i.e. with a good resolution, but results are inconclusive and 
the execution time oscillates on individual runs around the value shown, 
regardless of whether this change has been applied or not.

 So I have to conclude that either both variants of code are virtually 
equivalent in terms of performance or that the test environment keeps 
execution I/O-bound despite the measures I have taken.  Sadly the VAX 
architecture does not provide a user-accessible cycle count (I would know 
of) that could be used for more accurate measurements, and I do not feel 
right now like fiddling with the code of the test case any further so as 
to make it more suited for performance evaluation.

 I have to note however on this occasion that this part of the change:

-(define_insn "*cmp"
-  [(set (cc0)
-   (compare (match_operand:VAXint 0 "nonimmediate_operand" "nrmT,nrmT")
-(match_operand:VAXint 1 "general_operand" "I,nrmT")))]
+(define_insn "*cmp_"
+  [(set (reg:VAXcc VAX_PSL_REGNUM)
+   (compare:VAXcc (match_operand:VAXint 0 "general_operand" "nrmT,nrmT")
+  (match_operand:VAXint 1 "general_operand" "I,nrmT")))]

which allowed an immediate with operand #0 has improved code generation a 
little bit with this test case as well, because rather than this:

clrl %r0
[...]
cmpl %r0,%r1
[...]
cmpl %r0,%r3
[...]

this:

cmpl $0,%r1
[...]
cmpl $0,%r3
[...]

is produced, which does not waste a register to hold the value of 0 which 
can be supplied in the literal addressing mode, i.e. with the operand 
specifier byte itself just like with register operands, and therefore does 
not require extra space or execution time.

 I don't know however why the middle end insists on supplying constant 0 
as operand #0 to the comparison operation (or the `cbranch4' insn it has 
originated from).  While we have machine support for such a comparison, 
having constant 0 supplied as operand #1 would permit the use of the TST 
instruction, one byte shorter.  Of course that would require reversing the 
condition of any branches using the output of the comparison, but unlike 
typical RISC ISAs the VAX ISA supports all the conditions as does our MD.

 Oddly enough making constant 0 more expensive in operand #0 than in 
operand #1 for comparison operations or COMPARE does not persuade the 
middle end to try and swap the operands, and making the `cbranch4' insns 
reject an immediate in operand #0 only makes reload put it back in a 
register.  All this despite COMPARE documentation saying:

"If one of the operands is a constant, it should be placed in the
 second operand and the comparison code adjusted as appropriate."

So this looks like a missed optimisation and something to investigate at 
one point.

 Also, interestingly, we have this comment in our MD:

;; The VAX move instructions have space-time tradeoffs.  On a MicroVAX
;; register-register mov instructions take 3 bytes and 2 CPU cycles.  clrl
;; takes 2 bytes and 3 cycles.  mov from constant to register takes 2 cycles
;; if the constant is smaller than 4 bytes, 3 cycles for a longword
;; constant.  movz, mneg, and mcom are as fast as mov, so movzwl is faster
;; than movl for positive constants that fit in 16 bits but not 6 bits.  cvt
;; instructions take 4 cycles.  inc takes 3 cycles.  The machine description
;; is willing to trade 1 byte for 1 cycle (clrl instead of m

Re: [PATCH 30/31] PR target/95294: VAX: Convert backend to MODE_CC representation

2020-11-21 Thread Jeff Law via Gcc-patches



On 11/19/20 8:36 PM, Maciej W. Rozycki wrote:
> In the VAX ISA INSV bitfield insert instruction is the only computational
> operation that keeps the condition codes, held in the PSL or Processor
> Status Longword register, intact.  The instruction is flexible enough it
> could potentially be used for data moves post-reload, but then reportedly
> it is not the best choice performance-wise, and then we have no addition
> operation available that would keep the condition codes unchanged.
>
> Futhermore, as usually with a complex CISC ISA, for many operations we
> have several machine instructions or instruction sequences to choose
> from that set condition codes in a different manner.
>
> Use the approach then where the condition codes only get introduced by
> reload, by definining instruction splitters for RTL insns that change
> condition codes in some way, by default considering them clobbered.
>
> Then to prevent code generated from regressing too much provide insns
> that include a `compare' operation setting the condition codes in
> parallel to the main operation.  The manner condition codes are set by
> each insn is supposed to be provided by the whatever the SELECT_CC_MODE
> macro expands to.
>
> Given that individual patterns provided for the same RTL basic operation
> may set the condion codes differently keeping the information away from
> the insn patterns themselves would cause a maintenance nightmare and
> would be bound to fail in a horrible way sooner or later.  Therefore
> instead let the patterns themselves choose which condition modes they
> support, by having one or more subst iterators applied and then have
> individual comparison operators require the specific condition mode each
> according to the codes used by the operation.
>
> While subst iterators only support one alternative each, there is
> actually no problem with applying multiple ones to a single insn with
> the result as intended, and if the corresponding subst attribute
> supplies an empty NO-SUBST-VALUE, then no mess results even.  Make use
> of this observation.
>
> Add appropriate subst iterators to all the computational patterns then,
> according to the condition codes they usably set, including DImode ones
> and a substitute DImode comparison instruction in the absence of a CMPQ
> machine instruction, however do not provide a `cbranchdi4' named pattern
> as without a further development it regresses code quality by resorting
> to the `__cmpdi2' libcall where a simpler operation would do, e.g. to
> check for negativity the TSTL machine instruction may be executed over
> the upper longword only.  This is good material for further work.
>
> Do not apply subst iterators to the increment- or decrement-and-branch
> patterns at this time; these may yet have to be reviewed, in particular
> whether `*jsobneq_minus_one' is still relevant in the context of the
> recent integer constant cost review.
>
> Also add a couple of peepholes to help eliminating comparisons in some
> problematic cases, such as with the BIT instruction which is bitwise-AND
> for condition codes only that has no direct counterpart for the actual
> calculation, because the BIC instruction which does do bitwise-AND and
> produces a result implements the operation with a bitwise negation of
> its input `mask' operand.  Or the FFS instruction which sets the Z
> condition code according to its `field' input operand rather than the
> result produced.  Or the bitfield comparisons we don't have generic
> middle-end support for.
>
> Code size stats are as follows, obtained from 17640 and 9086 executables
> built in `check-c' and `check-c++' GCC testing respectively:
>
>   check-c check-c++
>   samples average  median  samples average  median
> ---
> regressions  1813  0.578%  0.198%  289  0.349%  0.175%
> unchanged   15160  0.000%  0.000% 8662  0.000%  0.000%
> progressions  667 -0.589% -0.194%  135 -0.944% -0.191%
> 
> total   17640  0.037%  0.000% 9086 -0.003%  0.000%
>
> Outliers:
>
> old new change  %change filename
> 
> 24062950+544+22.610 20111208-1.exe
> 43145329+1015   +23.528 pr39417.exe
> 22353055+820+36.689 990404-1.exe
> 26314213+1582   +60.129 pr57521.exe
> 30635579+2516   +82.142 2422-1.exe
>
> and:
>
> old new change  %change filename
> 
> 63174845-1472   -23.302 vector-compare-1.exe
> 63134845-1468   -23.254 vector-compare-1.exe
> 64745002-1472   -22.737 vector-compare-1.exe
> 64705002-1468   -22.689 vector-compare-1.exe
>
> We have some code quality regressions like:
>
> 10861:9e ef d9 12 movab 11b40 ,r0
> 10865:00 00 50
>   

[PATCH 30/31] PR target/95294: VAX: Convert backend to MODE_CC representation

2020-11-19 Thread Maciej W. Rozycki
In the VAX ISA INSV bitfield insert instruction is the only computational
operation that keeps the condition codes, held in the PSL or Processor
Status Longword register, intact.  The instruction is flexible enough it
could potentially be used for data moves post-reload, but then reportedly
it is not the best choice performance-wise, and then we have no addition
operation available that would keep the condition codes unchanged.

Futhermore, as usually with a complex CISC ISA, for many operations we
have several machine instructions or instruction sequences to choose
from that set condition codes in a different manner.

Use the approach then where the condition codes only get introduced by
reload, by definining instruction splitters for RTL insns that change
condition codes in some way, by default considering them clobbered.

Then to prevent code generated from regressing too much provide insns
that include a `compare' operation setting the condition codes in
parallel to the main operation.  The manner condition codes are set by
each insn is supposed to be provided by the whatever the SELECT_CC_MODE
macro expands to.

Given that individual patterns provided for the same RTL basic operation
may set the condion codes differently keeping the information away from
the insn patterns themselves would cause a maintenance nightmare and
would be bound to fail in a horrible way sooner or later.  Therefore
instead let the patterns themselves choose which condition modes they
support, by having one or more subst iterators applied and then have
individual comparison operators require the specific condition mode each
according to the codes used by the operation.

While subst iterators only support one alternative each, there is
actually no problem with applying multiple ones to a single insn with
the result as intended, and if the corresponding subst attribute
supplies an empty NO-SUBST-VALUE, then no mess results even.  Make use
of this observation.

Add appropriate subst iterators to all the computational patterns then,
according to the condition codes they usably set, including DImode ones
and a substitute DImode comparison instruction in the absence of a CMPQ
machine instruction, however do not provide a `cbranchdi4' named pattern
as without a further development it regresses code quality by resorting
to the `__cmpdi2' libcall where a simpler operation would do, e.g. to
check for negativity the TSTL machine instruction may be executed over
the upper longword only.  This is good material for further work.

Do not apply subst iterators to the increment- or decrement-and-branch
patterns at this time; these may yet have to be reviewed, in particular
whether `*jsobneq_minus_one' is still relevant in the context of the
recent integer constant cost review.

Also add a couple of peepholes to help eliminating comparisons in some
problematic cases, such as with the BIT instruction which is bitwise-AND
for condition codes only that has no direct counterpart for the actual
calculation, because the BIC instruction which does do bitwise-AND and
produces a result implements the operation with a bitwise negation of
its input `mask' operand.  Or the FFS instruction which sets the Z
condition code according to its `field' input operand rather than the
result produced.  Or the bitfield comparisons we don't have generic
middle-end support for.

Code size stats are as follows, obtained from 17640 and 9086 executables
built in `check-c' and `check-c++' GCC testing respectively:

  check-c check-c++
  samples average  median  samples average  median
---
regressions  1813  0.578%  0.198%  289  0.349%  0.175%
unchanged   15160  0.000%  0.000% 8662  0.000%  0.000%
progressions  667 -0.589% -0.194%  135 -0.944% -0.191%

total   17640  0.037%  0.000% 9086 -0.003%  0.000%

Outliers:

old new change  %change filename

24062950+544+22.610 20111208-1.exe
43145329+1015   +23.528 pr39417.exe
22353055+820+36.689 990404-1.exe
26314213+1582   +60.129 pr57521.exe
30635579+2516   +82.142 2422-1.exe

and:

old new change  %change filename

63174845-1472   -23.302 vector-compare-1.exe
63134845-1468   -23.254 vector-compare-1.exe
64745002-1472   -22.737 vector-compare-1.exe
64705002-1468   -22.689 vector-compare-1.exe

We have some code quality regressions like:

10861:  9e ef d9 12 movab 11b40 ,r0
10865:  00 00 50
10868:  90 a0 03 a0 movb 0x3(r0),0x2(r0)
1086c:  02
1086d:  d1 60 8f 61 cmpl (r0),$0x64646261
10871:  62 64 64
10874:  13 07   beql 1087d 

to:

10861:  9e ef