[Alan wrote:]
My understanding is that any decision as to whether one or both of y or z is
evaluated (when 'evaluation' involves doing any work,
e.g. a load), has already been encoded into the gimple/tree IR. Thus, if we are
to only evaluate one of 'y' or 'z' in your example,
the IR will (prior to if-conversion), contain basic blocks and control flow,
that means we jump around the one that's not evaluated.
This appears to be the case in pr61194.c: prior to if-conversion, the IR for
the loop in barX is
<bb 3>:
# i_16 = PHI <i_13(7), 0(2)>
# ivtmp_21 = PHI <ivtmp_20(7), 1024(2)>
_5 = x[i_16];
_6 = _5 > 0.0;
_7 = w[i_16];
_8 = _7 < 0.0;
_9 = _6 & _8;
if (_9 != 0)
goto <bb 4>;
else
goto <bb 5>;
<bb 4>:
iftmp.0_10 = z[i_16];
goto <bb 6>;
<bb 5>:
iftmp.0_11 = y[i_16];
[snip]
[note: the following section, until the next quoted line, is mostly an
explanation
of if conversion and its trade-offs; please feel free to skip it or to read it
later]
OK, but that`s overly-conservative in this case for most/all modern high-speed processors:
"z[i]" and "y[i]" are both pure expressions and
the possible values of 'i' here [given that the hardware doesn`t malfunction,
nobody alters the value in a debugger, etc.] can be proven
to not overflow the bounds of the arrays. With purity established and
bounds-validity known to be OK, all we need to do is to evaluate
the "cost" of evaluating both expressions vs. that of inflicting a conditional branch on
the hardware. I believe this "cost" depends on the
values in the arrays 'x' and 'w', but when the probability of either case -- "((x[i]>0)
& (w[i]<0))" is true or is false -- is very close to
50% for a long time, the branch predictor is going to find it difficult to make
any sense of it, and speculative execution is going to speculatively
execute the incorrect branch about 50% of the time until the data comes in that
renders speculation no-longer-needed in the given iteration.
On most/all modern high-speed processors, doing the loads unconditionally is going to be
better IMO for anything even remotely resembling a "normal"
or "average" workload. In other words, so long as the result of "((x[i]>0) &
(w[i]<0))" changes frequently and unpredictably as 'i' is incremented,
it should be better to do both loads: the needed data are in cache lines that
are already in cache, possibly even in L1 data cache, so it`s pretty
"cheap" to just load them. Conditional branches based on difficult-to-predict
data that changes frequently [as the arrays are traversed],
OTOH, are likely to cause stalls and result in slower execution than the
fastest possible on that CPU for the source code in question.
In cases where the data upon which the condition depends lead to mostly-true or
mostly-false results,
the conditional branch should perform well for enough-iterations loops on any
CPU with a decent branch predictor,
but those are rare conditions IMO. I think we should assume random data in the
arrays unless/until we have analysis that tells
us otherwise, and I don`t expect much compile-time analysis of the contents of
arrays to become the norm in compilers anytime soon.
These are probably extremely-infrequent corner cases anyway, with the possible
exception of the processing of sparse matrices;
does anybody reading this with a strong background in numerical/scientific
computing have anything to comment on this?
Predication of instructions can help to remove the burden of the conditional
branch, but is not available on all relevant architectures.
In some architectures that are typically implemented in modern high-speed
processors -- i.e. with high core frequency, caches, speculative execution,
etc. --
there is not full support for predication [as there is, for example, in 32-bit ARM] but there _is_
support for "conditional move" [hereinafter "cmove"].
If/when the ISA in question supports cmove to/from main memory, perhaps it
would be profitable to use two cmoves back-to-back with opposite conditions
and the same register [destination for load, source for store] to implement e.g. "temp = c ?
X[x] : Y[y]" and/or "temp = C[c] ? X[x] : Y[y]" etc.
Even without cmove to/from main memory, two cmoves back-to-back with opposite
conditions could still be used, e.g. [not for a real-world ISA]:
load X[x] -> reg1
load Y[y] -> reg2
cmove c ? reg1 -> reg3
cmove (!c) ? reg2 -> reg3
Or even better if the ISA can support something like:
load X[x] -> reg1
load Y[y] -> reg2
cmove (c ? reg1 : reg2) -> reg3
However, this is a code-gen issue from my POV, and at the present time all the
if-conversion work is occurring at the GIMPLE level.
If anybody reading this knows how I could make the if converter generate GIMPLE
that leads to code-gen that is better for at
least one ISA and the change[s] do/does not negatively impact upon code-gen for
any other ISA, I`d be glad to read about it.
Without -ftree-loop-if-convert-stores, if-conversion leaves this alone, and
vectorization fails (i.e. the vectorizer
> bails out because the loop has >2 basic blocks). With
-ftree-loop-if-convert-stores, if-conversion produces
[snip]
where I have commented the conditional loads that have become unconditional.
> (Hence, "-ftree-loop-if-convert-stores" looks misnamed - it affects how the
if-conversion phase converts loads,
> too - please correct me if I misunderstand (Richard?) ?!) This contains no
control flow, and so is vectorizable.
[snip]
> (This is all without your scratchpad patch, of course.)
I think the preceding is quite correct.
Sebastian and I have discussed [at some length, on my part ;-)] what to do
about the flags. We definitely want them to change.
We might wind up with a different number of flags, i.e. not just rename one or
both of the two we currently have.
Does anybody know of any large codebase that depends heavily on the current
flags?
AFAIK these have been infrequently-if-ever used so far in production code.
Regards,
Abe