Re: making the new if-converter not mangle IR that is already vectorizer-friendly

Abe Tue, 07 Jul 2015 13:56:32 -0700

[Alan wrote:]

My understanding is that any decision as to whether one or both of y or z is 
evaluated (when 'evaluation' involves doing any work,
e.g. a load), has already been encoded into the gimple/tree IR. Thus, if we are 
to only evaluate one of 'y' or 'z' in your example,
the IR will (prior to if-conversion), contain basic blocks and control flow, 
that means we jump around the one that's not evaluated.

This appears to be the case in pr61194.c: prior to if-conversion, the IR for 
the loop in barX is

  <bb 3>:
   # i_16 = PHI <i_13(7), 0(2)>
   # ivtmp_21 = PHI <ivtmp_20(7), 1024(2)>
   _5 = x[i_16];
   _6 = _5 > 0.0;
   _7 = w[i_16];
   _8 = _7 < 0.0;
   _9 = _6 & _8;
   if (_9 != 0)
     goto <bb 4>;
   else
     goto <bb 5>;

   <bb 4>:
   iftmp.0_10 = z[i_16];
   goto <bb 6>;

   <bb 5>:
   iftmp.0_11 = y[i_16];


[snip]


[note: the following section, until the next quoted line, is mostly an 
explanation
 of if conversion and its trade-offs; please feel free to skip it or to read it 
later]


OK, but that`s overly-conservative in this case for most/all modern high-speed processors: 
"z[i]" and "y[i]" are both pure expressions and
the possible values of 'i' here [given that the hardware doesn`t malfunction, 
nobody alters the value in a debugger, etc.] can be proven
to not overflow the bounds of the arrays.  With purity established and 
bounds-validity known to be OK, all we need to do is to evaluate
the "cost" of evaluating both expressions vs. that of inflicting a conditional branch on 
the hardware.  I believe this "cost" depends on the
values in the arrays 'x' and 'w', but when the probability of either case -- "((x[i]>0) 
& (w[i]<0))" is true or is false -- is very close to
50% for a long time, the branch predictor is going to find it difficult to make 
any sense of it, and speculative execution is going to speculatively
execute the incorrect branch about 50% of the time until the data comes in that 
renders speculation no-longer-needed in the given iteration.

On most/all modern high-speed processors, doing the loads unconditionally is going to be 
better IMO for anything even remotely resembling a "normal"
or "average" workload.  In other words, so long as the result of "((x[i]>0) & 
(w[i]<0))" changes frequently and unpredictably as 'i' is incremented,
it should be better to do both loads: the needed data are in cache lines that 
are already in cache, possibly even in L1 data cache, so it`s pretty
"cheap" to just load them.  Conditional branches based on difficult-to-predict 
data that changes frequently [as the arrays are traversed],
OTOH, are likely to cause stalls and result in slower execution than the 
fastest possible on that CPU for the source code in question.

In cases where the data upon which the condition depends lead to mostly-true or 
mostly-false results,
the conditional branch should perform well for enough-iterations loops on any 
CPU with a decent branch predictor,
but those are rare conditions IMO.  I think we should assume random data in the 
arrays unless/until we have analysis that tells
us otherwise, and I don`t expect much compile-time analysis of the contents of 
arrays to become the norm in compilers anytime soon.
These are probably extremely-infrequent corner cases anyway, with the possible 
exception of the processing of sparse matrices;
does anybody reading this with a strong background in numerical/scientific 
computing have anything to comment on this?

Predication of instructions can help to remove the burden of the conditional 
branch, but is not available on all relevant architectures.
In some architectures that are typically implemented in modern high-speed 
processors -- i.e. with high core frequency, caches, speculative execution, 
etc. --
there is not full support for predication [as there is, for example, in 32-bit ARM] but there _is_ 
support for "conditional move" [hereinafter "cmove"].
If/when the ISA in question supports cmove to/from main memory, perhaps it 
would be profitable to use two cmoves back-to-back with opposite conditions
and the same register [destination for load, source for store] to implement e.g. "temp = c ? 
X[x] : Y[y]" and/or "temp = C[c] ? X[x] : Y[y]" etc.
Even without cmove to/from main memory, two cmoves back-to-back with opposite 
conditions could still be used, e.g. [not for a real-world ISA]:
  load X[x] -> reg1
  load Y[y] -> reg2
  cmove   c  ? reg1 -> reg3
  cmove (!c) ? reg2 -> reg3

Or even better if the ISA can support something like:
  load X[x] -> reg1
  load Y[y] -> reg2
  cmove (c ? reg1 : reg2) -> reg3

However, this is a code-gen issue from my POV, and at the present time all the 
if-conversion work is occurring at the GIMPLE level.
If anybody reading this knows how I could make the if converter generate GIMPLE 
that leads to code-gen that is better for at
least one ISA and the change[s] do/does not negatively impact upon code-gen for 
any other ISA, I`d be glad to read about it.

Without -ftree-loop-if-convert-stores, if-conversion leaves this alone, and 
vectorization fails  (i.e. the vectorizer

> bails out because the loop has >2 basic blocks). With 
-ftree-loop-if-convert-stores, if-conversion produces
[snip]

where I have commented the conditional loads that have become unconditional.

> (Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how the 
if-conversion phase converts loads,
> too - please correct me if I misunderstand (Richard?) ?!) This contains no 
control flow, and so is vectorizable.
[snip]
> (This is all without your scratchpad patch, of course.)

I think the preceding is quite correct.

Sebastian and I have discussed [at some length, on my part ;-)] what to do 
about the flags.  We definitely want them to change.
We might wind up with a different number of flags, i.e. not just rename one or 
both of the two we currently have.

Does anybody know of any large codebase that depends heavily on the current 
flags?
AFAIK these have been infrequently-if-ever used so far in production code.


Regards,

Abe

Re: making the new if-converter not mangle IR that is already vectorizer-friendly

Reply via email to