Re: [PATCH 30/31] PR target/95294: VAX: Convert backend to MODE_CC representation
On Fri, 20 Nov 2020, Maciej W. Rozycki wrote: > Outliers: > > old new change %change filename > > 24062950+544+22.610 20111208-1.exe > 43145329+1015 +23.528 pr39417.exe > 22353055+820+36.689 990404-1.exe > 26314213+1582 +60.129 pr57521.exe > 30635579+2516 +82.142 2422-1.exe So as a matter of interest I have actually looked into the worst offender shown above. As I have just learnt by chance, GNU `size' in its default BSD mode reports code combined with rodata as text, so I have rerun it in the recently introduced GNU mode with `2422-1.exe' and the results are worse yet: text databss total filename - 1985 1466 68 3519 ./2422-1.exe + 4501 1466 68 6035 ./2422-1.exe However upon actual inspection code produced looks sound and it's just that the loops the test case has get unrolled further. With speed optimisation this is not necessarily bad. Mind that the options used for this particular compilation are `-O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions' meaning that, well, ahem, we do want loops to get unrolled and we want speed rather than size. So all is well. I have tried to run some benchmarking with this test case, by putting all the files involved in tmpfs (i.e. RAM) so as to limit any I/O influence and looping the executable 1000 times, which yielded elapsed times of around 340s, i.e. with a good resolution, but results are inconclusive and the execution time oscillates on individual runs around the value shown, regardless of whether this change has been applied or not. So I have to conclude that either both variants of code are virtually equivalent in terms of performance or that the test environment keeps execution I/O-bound despite the measures I have taken. Sadly the VAX architecture does not provide a user-accessible cycle count (I would know of) that could be used for more accurate measurements, and I do not feel right now like fiddling with the code of the test case any further so as to make it more suited for performance evaluation. I have to note however on this occasion that this part of the change: -(define_insn "*cmp" - [(set (cc0) - (compare (match_operand:VAXint 0 "nonimmediate_operand" "nrmT,nrmT") -(match_operand:VAXint 1 "general_operand" "I,nrmT")))] +(define_insn "*cmp_" + [(set (reg:VAXcc VAX_PSL_REGNUM) + (compare:VAXcc (match_operand:VAXint 0 "general_operand" "nrmT,nrmT") + (match_operand:VAXint 1 "general_operand" "I,nrmT")))] which allowed an immediate with operand #0 has improved code generation a little bit with this test case as well, because rather than this: clrl %r0 [...] cmpl %r0,%r1 [...] cmpl %r0,%r3 [...] this: cmpl $0,%r1 [...] cmpl $0,%r3 [...] is produced, which does not waste a register to hold the value of 0 which can be supplied in the literal addressing mode, i.e. with the operand specifier byte itself just like with register operands, and therefore does not require extra space or execution time. I don't know however why the middle end insists on supplying constant 0 as operand #0 to the comparison operation (or the `cbranch4' insn it has originated from). While we have machine support for such a comparison, having constant 0 supplied as operand #1 would permit the use of the TST instruction, one byte shorter. Of course that would require reversing the condition of any branches using the output of the comparison, but unlike typical RISC ISAs the VAX ISA supports all the conditions as does our MD. Oddly enough making constant 0 more expensive in operand #0 than in operand #1 for comparison operations or COMPARE does not persuade the middle end to try and swap the operands, and making the `cbranch4' insns reject an immediate in operand #0 only makes reload put it back in a register. All this despite COMPARE documentation saying: "If one of the operands is a constant, it should be placed in the second operand and the comparison code adjusted as appropriate." So this looks like a missed optimisation and something to investigate at one point. Also, interestingly, we have this comment in our MD: ;; The VAX move instructions have space-time tradeoffs. On a MicroVAX ;; register-register mov instructions take 3 bytes and 2 CPU cycles. clrl ;; takes 2 bytes and 3 cycles. mov from constant to register takes 2 cycles ;; if the constant is smaller than 4 bytes, 3 cycles for a longword ;; constant. movz, mneg, and mcom are as fast as mov, so movzwl is faster ;; than movl for positive constants that fit in 16 bits but not 6 bits. cvt ;; instructions take 4 cycles. inc takes 3 cycles. The machine description ;; is willing to trade 1 byte for 1 cycle (clrl instead of m
Re: [PATCH 30/31] PR target/95294: VAX: Convert backend to MODE_CC representation
On 11/19/20 8:36 PM, Maciej W. Rozycki wrote: > In the VAX ISA INSV bitfield insert instruction is the only computational > operation that keeps the condition codes, held in the PSL or Processor > Status Longword register, intact. The instruction is flexible enough it > could potentially be used for data moves post-reload, but then reportedly > it is not the best choice performance-wise, and then we have no addition > operation available that would keep the condition codes unchanged. > > Futhermore, as usually with a complex CISC ISA, for many operations we > have several machine instructions or instruction sequences to choose > from that set condition codes in a different manner. > > Use the approach then where the condition codes only get introduced by > reload, by definining instruction splitters for RTL insns that change > condition codes in some way, by default considering them clobbered. > > Then to prevent code generated from regressing too much provide insns > that include a `compare' operation setting the condition codes in > parallel to the main operation. The manner condition codes are set by > each insn is supposed to be provided by the whatever the SELECT_CC_MODE > macro expands to. > > Given that individual patterns provided for the same RTL basic operation > may set the condion codes differently keeping the information away from > the insn patterns themselves would cause a maintenance nightmare and > would be bound to fail in a horrible way sooner or later. Therefore > instead let the patterns themselves choose which condition modes they > support, by having one or more subst iterators applied and then have > individual comparison operators require the specific condition mode each > according to the codes used by the operation. > > While subst iterators only support one alternative each, there is > actually no problem with applying multiple ones to a single insn with > the result as intended, and if the corresponding subst attribute > supplies an empty NO-SUBST-VALUE, then no mess results even. Make use > of this observation. > > Add appropriate subst iterators to all the computational patterns then, > according to the condition codes they usably set, including DImode ones > and a substitute DImode comparison instruction in the absence of a CMPQ > machine instruction, however do not provide a `cbranchdi4' named pattern > as without a further development it regresses code quality by resorting > to the `__cmpdi2' libcall where a simpler operation would do, e.g. to > check for negativity the TSTL machine instruction may be executed over > the upper longword only. This is good material for further work. > > Do not apply subst iterators to the increment- or decrement-and-branch > patterns at this time; these may yet have to be reviewed, in particular > whether `*jsobneq_minus_one' is still relevant in the context of the > recent integer constant cost review. > > Also add a couple of peepholes to help eliminating comparisons in some > problematic cases, such as with the BIT instruction which is bitwise-AND > for condition codes only that has no direct counterpart for the actual > calculation, because the BIC instruction which does do bitwise-AND and > produces a result implements the operation with a bitwise negation of > its input `mask' operand. Or the FFS instruction which sets the Z > condition code according to its `field' input operand rather than the > result produced. Or the bitfield comparisons we don't have generic > middle-end support for. > > Code size stats are as follows, obtained from 17640 and 9086 executables > built in `check-c' and `check-c++' GCC testing respectively: > > check-c check-c++ > samples average median samples average median > --- > regressions 1813 0.578% 0.198% 289 0.349% 0.175% > unchanged 15160 0.000% 0.000% 8662 0.000% 0.000% > progressions 667 -0.589% -0.194% 135 -0.944% -0.191% > > total 17640 0.037% 0.000% 9086 -0.003% 0.000% > > Outliers: > > old new change %change filename > > 24062950+544+22.610 20111208-1.exe > 43145329+1015 +23.528 pr39417.exe > 22353055+820+36.689 990404-1.exe > 26314213+1582 +60.129 pr57521.exe > 30635579+2516 +82.142 2422-1.exe > > and: > > old new change %change filename > > 63174845-1472 -23.302 vector-compare-1.exe > 63134845-1468 -23.254 vector-compare-1.exe > 64745002-1472 -22.737 vector-compare-1.exe > 64705002-1468 -22.689 vector-compare-1.exe > > We have some code quality regressions like: > > 10861:9e ef d9 12 movab 11b40 ,r0 > 10865:00 00 50 >
[PATCH 30/31] PR target/95294: VAX: Convert backend to MODE_CC representation
In the VAX ISA INSV bitfield insert instruction is the only computational operation that keeps the condition codes, held in the PSL or Processor Status Longword register, intact. The instruction is flexible enough it could potentially be used for data moves post-reload, but then reportedly it is not the best choice performance-wise, and then we have no addition operation available that would keep the condition codes unchanged. Futhermore, as usually with a complex CISC ISA, for many operations we have several machine instructions or instruction sequences to choose from that set condition codes in a different manner. Use the approach then where the condition codes only get introduced by reload, by definining instruction splitters for RTL insns that change condition codes in some way, by default considering them clobbered. Then to prevent code generated from regressing too much provide insns that include a `compare' operation setting the condition codes in parallel to the main operation. The manner condition codes are set by each insn is supposed to be provided by the whatever the SELECT_CC_MODE macro expands to. Given that individual patterns provided for the same RTL basic operation may set the condion codes differently keeping the information away from the insn patterns themselves would cause a maintenance nightmare and would be bound to fail in a horrible way sooner or later. Therefore instead let the patterns themselves choose which condition modes they support, by having one or more subst iterators applied and then have individual comparison operators require the specific condition mode each according to the codes used by the operation. While subst iterators only support one alternative each, there is actually no problem with applying multiple ones to a single insn with the result as intended, and if the corresponding subst attribute supplies an empty NO-SUBST-VALUE, then no mess results even. Make use of this observation. Add appropriate subst iterators to all the computational patterns then, according to the condition codes they usably set, including DImode ones and a substitute DImode comparison instruction in the absence of a CMPQ machine instruction, however do not provide a `cbranchdi4' named pattern as without a further development it regresses code quality by resorting to the `__cmpdi2' libcall where a simpler operation would do, e.g. to check for negativity the TSTL machine instruction may be executed over the upper longword only. This is good material for further work. Do not apply subst iterators to the increment- or decrement-and-branch patterns at this time; these may yet have to be reviewed, in particular whether `*jsobneq_minus_one' is still relevant in the context of the recent integer constant cost review. Also add a couple of peepholes to help eliminating comparisons in some problematic cases, such as with the BIT instruction which is bitwise-AND for condition codes only that has no direct counterpart for the actual calculation, because the BIC instruction which does do bitwise-AND and produces a result implements the operation with a bitwise negation of its input `mask' operand. Or the FFS instruction which sets the Z condition code according to its `field' input operand rather than the result produced. Or the bitfield comparisons we don't have generic middle-end support for. Code size stats are as follows, obtained from 17640 and 9086 executables built in `check-c' and `check-c++' GCC testing respectively: check-c check-c++ samples average median samples average median --- regressions 1813 0.578% 0.198% 289 0.349% 0.175% unchanged 15160 0.000% 0.000% 8662 0.000% 0.000% progressions 667 -0.589% -0.194% 135 -0.944% -0.191% total 17640 0.037% 0.000% 9086 -0.003% 0.000% Outliers: old new change %change filename 24062950+544+22.610 20111208-1.exe 43145329+1015 +23.528 pr39417.exe 22353055+820+36.689 990404-1.exe 26314213+1582 +60.129 pr57521.exe 30635579+2516 +82.142 2422-1.exe and: old new change %change filename 63174845-1472 -23.302 vector-compare-1.exe 63134845-1468 -23.254 vector-compare-1.exe 64745002-1472 -22.737 vector-compare-1.exe 64705002-1468 -22.689 vector-compare-1.exe We have some code quality regressions like: 10861: 9e ef d9 12 movab 11b40 ,r0 10865: 00 00 50 10868: 90 a0 03 a0 movb 0x3(r0),0x2(r0) 1086c: 02 1086d: d1 60 8f 61 cmpl (r0),$0x64646261 10871: 62 64 64 10874: 13 07 beql 1087d to: 10861: 9e ef