Hi Andrew,
Given your latest patch set, and based on my understanding of LLVM's
sink-common-insts transformation, I wonder whether there are two possible
directions worth considering here.
The first would be a more general common-operation sinking/factoring pass
in GCC, perhaps placed in or near the existing sink / phiopt machinery.
Such a pass could try to match identical operations across control-flow
joins, build the required PHIs for differing operands, and then recreate
the common operation after the merge. This would be closer in spirit to
LLVM's approach of expression matching plus a profitability model.
I think some heuristic cost model would be important here, since sinking
is not always profitable and can sometimes hurt performance, for example
by extending live ranges or by moving computation to a less favorable
place in the CFG.
A second, more targeted direction might be to do a sinking/factoring step
specifically before the "No-Conditional Execution" loop conversion in
ifcvt. Since that path is already preparing loops for if-conversion and
eventual vectorization, a local pre-pass there could expose more cases to
the existing ifcvt factor_* logic before the conditional structure is
flattened.
This second approach seems especially relevant for the repeated or chained
factoring cases discussed above: if the current ifcvt machinery can already
handle simple matching-based sinking, then an earlier local sinking step
might make more complex chains visible in a form that ifcvt can consume.
So I was wondering whether the community would prefer to explore:
  * a more general common-operation sinking/factoring pass in the middle end,
    likely requiring an explicit profitability model, or
  * a narrower pre-ifcvt sinking step for ifcvt'able loops, aimed at improving
    the existing No-Conditional Execution conversion path.
Thanks again for the detailed explanation and pointers.
Best regards,
Guoce Feng

---- Replied Message ----
From    Guoce Feng<[email protected]><mailto:[email protected]>
Date    5/19/2026 12:47
To      Andrew 
Pinski<[email protected]><mailto:[email protected]>
Cc      [email protected]<[email protected]>,
<mailto:[email protected]>Richard Biener<[email protected]>,
<mailto:[email protected]>Andrew Pinski<[email protected]>,
<mailto:[email protected]>Jeff Law<[email protected]>,
<mailto:[email protected]>Kewen 
Lin<[email protected]><mailto:[email protected]>
Subject Re: [RFC] Factoring common pure operations across PHIs / branches to 
reduce duplicated sqrt calls
Hi Andrew,

Thanks for checking the previous testcase directly.

That clarifies things a lot. With your PR123113 patch set, the example I
sent is handled as desired under

-O3 -fno-math-errno -fno-trapping-math

and the ifcvt result with a single __builtin_sqrt after the conditional
selection is exactly the shape I was hoping for.

So, for that testcase, your patch series already addresses the issue.

More generally, it seems that the current ifcvt factoring machinery can
already handle some relatively simple matching-and-sinking cases.
I am now more interested in the broader extension you mentioned for the
factor_* logic in ifcvt.

As I understand it, the current implementation is mainly able to factor
operations when the matching structure is relatively simple, for example
when one operand already matches across the alternatives. A more general
form would be to recognize the same operation even when two or three
operands differ, and then materialize the corresponding 2/3 PHIs before
rebuilding the operation.

Conceptually, for example,

op1 (op2(a1, b1),c1)
op1 (op2(a2, b2),c2)

could become

a = PHI <a1, a2>
b = PHI <b1, b2>
c = PHI <c1,c2>
op1(op2 (a, b),c)

and similarly for operations with three varying inputs.

It also looks like there may still be a gap for repeated or chained
sinking/factoring opportunities. For example, this reduced case does not
appear to be fully handled:

https://godbolt.org/z/n3ezj1MPq

So, from these examples, my current impression is that ifcvt can already
perform simple matching-based sinking/factoring merges, but there may still
be limitations for multi-step chained sinking/factoring cases.

Is there community interest in extending the ifcvt factor_* machinery in
that direction, namely:

* matching identical operations with multiple differing operands,
* producing the required 2/3 PHIs, and
* handling repeated or chained factoring opportunities when profitable?

This seems related to the more general expression-matching direction you
mentioned, rather than only the current limited factoring cases.

Thanks again for the clarification and the pointers.

Best regards,
Guoce Feng
---- Replied Message ----
From    Andrew 
Pinski<[email protected]><mailto:[email protected]>
Date    5/19/2026 00:54
To      Guoce Feng<[email protected]><mailto:[email protected]>
Cc      [email protected]<[email protected]>,
<mailto:[email protected]>Richard Biener<[email protected]>,
<mailto:[email protected]>Andrew Pinski<[email protected]>,
<mailto:[email protected]>Jeff 
Law<[email protected]><mailto:[email protected]>
Subject Re: [RFC] Factoring common pure operations across PHIs / branches to 
reduce duplicated sqrt calls
On Mon, May 18, 2026 at 2:24 AM Guoce Feng 
<[email protected]<mailto:[email protected]>> wrote:

Hi Andrew,

Thanks for the detailed explanation and for pointing me to PR123113 and
the related patch series.

I checked the scope of PR123113 more carefully. My understanding is that
those patches mainly relax existing restrictions in phiopt / cselim /
factor_out_conditional_operation around merge blocks with more than two
predecessors. That certainly seems relevant to part of the problem. 
(Inbox<https://inbox.sourceware.org/gcc-bugs/?t=20251215070155&utm_source=chatgpt.com>)

However, I do not think those patches fully cover the motivating case I
had in mind.

The vectorized example you showed has the form

sqrt(expr1[i])
sqrt(expr2[i])
sqrt(expr3[i])


where expr1/expr2/expr3 are already available values, so ifcvt can form

select(select(...))
sqrt(selected_value)


quite naturally.

In my motivating kernel, the operands of sqrt are themselves computed
inside the individual control-flow paths. In simplified form, the three
arms look more like

sqrt(complex_expr1)
sqrt(complex_expr2)
sqrt(complex_expr3)


where the third arm also contains extra intermediate computations before
the final squared-distance expression is formed.

Here is a reduced Compiler Explorer example that still does not get
converted into the desired “blend operands first, then perform one sqrt”
shape:

https://godbolt.org/z/savsxzaWz

With my patch set with the above testcase with -O3  -fno-math-errno  
-fno-trapping-math, we get in ifcvt:
```
  _3 = _66 ? _21 : _16;
  _65 = _122 ? _11 : _3;
  distance_24 = __builtin_sqrt (_65);
  total_49 = distance_24 + total_53;
 ```
So it is working like you want it to work. There is only one sqrt in the 
vectorized code.

As for the other expressions that could/should be factor out we definitely need 
something better than just this.
LLVM has a pass which matching up the expressions and then doing some cost 
model.

The factor_* code in ifcvt could be improved to see if it is the same operation 
and produce 2/3 phis.
Right now the factor code is only designed to match if one of the operands of 
the operation matches because that was easiest at the time of writing it.
We could do this as a "pre"-pass for the ifcvt'able loops too.

Thanks,
Andrea





So the transformation I am interested in is still closer to:

PHI <sqrt(a), sqrt(b), sqrt(c)>
    =>
sqrt(PHI <a, b, c>)


in cases where a/b/c may be branch-local computed expressions, not just
already-materialized values.

Put differently:

PR123113 improves factoring around multi-pred PHIs,
but does GCC still lack a stronger common pure-operation
factoring/sinking transformation for branch-local computed operands?


I would be interested in your thoughts on whether this stronger form is
considered worthwhile in GCC, and if so, whether you would expect it to
belong in phiopt, ifcvt, or some other middle-end transformation.

Thanks again for the clarification.

Best regards,
Guoce Feng

---- Replied Message ----
From    Andrew 
Pinski<[email protected]><mailto:[email protected]>
Date    5/18/2026 16:49
To      Guoce Feng<[email protected]><mailto:[email protected]>
Cc      [email protected]<[email protected]>,
<mailto:[email protected]>Richard Biener<[email protected]>,
<mailto:[email protected]>Andrew Pinski<[email protected]>,
<mailto:[email protected]>Jeff 
Law<[email protected]><mailto:[email protected]>
Subject Re: [RFC] Factoring common pure operations across PHIs / branches to 
reduce duplicated sqrt calls

On Mon, May 18, 2026 at 1:16 AM Guoce Feng via Gcc 
<[email protected]<mailto:[email protected]>> wrote:

 Hi,

 I am investigating an optimization opportunity in GCC related to
 factoring common pure operations out of control-flow joins.

 Consider code of this general form:

 if (c1 <= 0.0)
   distance = sqrt(expr1);
 else if (c2 <= c1)
   distance = sqrt(expr2);
 else
   distance = sqrt(expr3);


 After control-flow simplification / if-conversion, this can effectively
 lead to a form resembling:

 distance = select(cond1,
                   sqrt(expr1),
                   select(cond2, sqrt(expr2), sqrt(expr3)));


 For vectorized code, this may become "three vector sqrt operations plus
 blends", while a more profitable form would be:

 merged_expr = select(cond1,
                      expr1,
                      select(cond2, expr2, expr3));
 distance = sqrt(merged_expr);


 That is, transform a conceptual pattern like

 phi(sqrt(a), sqrt(b), sqrt(c))


 into

 sqrt(phi(a, b, c))


 when the operation is safe to factor out and doing so is profitable.

 I noticed that LLVM has a related transformation in SimplifyCFG under
 the sink-common-insts machinery. In GCC, I am trying to understand what
 the most appropriate place would be for a similar optimization.

 Two possible implementation directions seem plausible:

   1.  Extend the existing PHI factoring logic in
 tree-ssa-phiopt.cc, especially around
 factor_out_conditional_operation, to also handle suitable
 side-effect-free builtins / calls such as sqrt under the
 appropriate semantic restrictions.

So part of the problem with the above case at -Ofast is really
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123113 .
I have a patch set which is mostly approved but I need to double check
on the order of basic-blocks being looked at.
The patches are:
https://inbox.sourceware.org/gcc-patches/[email protected]/
https://inbox.sourceware.org/gcc-patches/[email protected]/
https://patchwork.sourceware.org/project/gcc/patch/[email protected]/

In that order. (I can't seen to find the 3rd patch on inbox for some reason).
With those patches we get:
```
  <bb 2> [local count: 1073741824]:
  if (c1_2(D) <= 0.0)
    goto <bb 3>; [41.00%]
  else
    goto <bb 4>; [59.00%]

  <bb 3> [local count: 440234144]:
  distance_7 = __builtin_sqrt (expr1_6(D)); [tail call]
  goto <bb 7>; [100.00%]

  <bb 4> [local count: 633507680]:
  if (c1_2(D) >= c2_3(D))
    goto <bb 5>; [50.00%]
  else
    goto <bb 6>; [50.00%]

  <bb 5> [local count: 316753840]:

  <bb 6> [local count: 633507680]:
  # _8 = PHI <expr3_4(D)(4), expr2_5(D)(5)>
  distance_10 = __builtin_sqrt (_8); [tail call]

  <bb 7> [local count: 1073741824]:
  # distance_1 = PHI <distance_10(6), distance_7(3)>
  return distance_1;
```

Extending factor_* to handle the above could be done but I am not sure
phiopt is the correct place.
Note there is code in the ifcvt that does a similar thing as factor_*
in phiopt which then does handle the rest of it and for:
```
double f(double *expr1, double *expr2, double *expr3, double *c1, double *c2)
{
double t = 0.0;
for(int i =0 ;i < 1024; i++)
{
  double distance = 0.0;
if (c1[i] <= 0.0)
  distance = __builtin_sqrt(expr1[i]);
else if (c2[i] <= c1[i])
  distance = __builtin_sqrt(expr2[i]);
else
  distance = __builtin_sqrt(expr3[i]);
t+=distance;
}
  return t;
}

```

With -mavx512f (since I was too lazy to figure out what expr* was and
too lazy to test non-x86_64 [since that is what I have built]) we get:
```
  _7 = _34 ? _14 : _12;
  _63 = _42 ? _6 : _7;
  distance_17 = __builtin_sqrt (_63);
  t_25 = distance_17 + t_29;
```
In ifcvt.

Which I assume is exactly what you were expecting.



   2.  Add a more general "common instruction sinking" transformation to
 tree-ssa-sink.cc, analogous in spirit to the existing common-store
 sinking support, but for a restricted class of pure scalar
 computations.

 My current intuition is that the PHIOPT direction may be a better fit
 for this specific pattern, because the optimization is fundamentally
 about factoring a common operation across PHI/merge structure rather
 than ordinary code sinking. However, I would appreciate guidance from
 the GCC community on the preferred design.

 I am especially interested in feedback on:

   *   Whether tree-ssa-phiopt.cc is indeed the right place for this form of
 transformation.

   *   Whether handling builtins such as sqrt is acceptable under the
 relevant floating-point / errno / exception semantics.

   *   Whether this should be restricted to cases where the operation is
 known to be side-effect-free and safe to move.

   *   Whether profitability should be guided primarily by code size,
 vectorization opportunities, or target-independent cost heuristics.

   *   Whether there is existing work or a PR in this area that I should
 build upon.

See above.


 A motivating reduced example comes from a geometric distance kernel in
 which three branch-local sqrt computations could ideally be turned into
 one sqrt after value merging, improving the shape of the vectorized code.
See above.

Thanks,
Andrea


 If this direction seems reasonable, I am interested in preparing a
 prototype patch and test cases.

 Thanks,
 Guoce Feng

Reply via email to