Hi Andrew, Thanks for the detailed explanation and for pointing me to PR123113 and the related patch series.
I checked the scope of PR123113 more carefully. My understanding is that those patches mainly relax existing restrictions in phiopt / cselim / factor_out_conditional_operation around merge blocks with more than two predecessors. That certainly seems relevant to part of the problem. (Inbox<https://inbox.sourceware.org/gcc-bugs/?t=20251215070155&utm_source=chatgpt.com>) However, I do not think those patches fully cover the motivating case I had in mind. The vectorized example you showed has the form sqrt(expr1[i]) sqrt(expr2[i]) sqrt(expr3[i]) where expr1/expr2/expr3 are already available values, so ifcvt can form select(select(...)) sqrt(selected_value) quite naturally. In my motivating kernel, the operands of sqrt are themselves computed inside the individual control-flow paths. In simplified form, the three arms look more like sqrt(complex_expr1) sqrt(complex_expr2) sqrt(complex_expr3) where the third arm also contains extra intermediate computations before the final squared-distance expression is formed. Here is a reduced Compiler Explorer example that still does not get converted into the desired “blend operands first, then perform one sqrt” shape: https://godbolt.org/z/savsxzaWz So the transformation I am interested in is still closer to: PHI <sqrt(a), sqrt(b), sqrt(c)> => sqrt(PHI <a, b, c>) in cases where a/b/c may be branch-local computed expressions, not just already-materialized values. Put differently: PR123113 improves factoring around multi-pred PHIs, but does GCC still lack a stronger common pure-operation factoring/sinking transformation for branch-local computed operands? I would be interested in your thoughts on whether this stronger form is considered worthwhile in GCC, and if so, whether you would expect it to belong in phiopt, ifcvt, or some other middle-end transformation. Thanks again for the clarification. Best regards, Guoce Feng ---- Replied Message ---- From Andrew Pinski<[email protected]><mailto:[email protected]> Date 5/18/2026 16:49 To Guoce Feng<[email protected]><mailto:[email protected]> Cc [email protected]<[email protected]>, <mailto:[email protected]>Richard Biener<[email protected]>, <mailto:[email protected]>Andrew Pinski<[email protected]>, <mailto:[email protected]>Jeff Law<[email protected]><mailto:[email protected]> Subject Re: [RFC] Factoring common pure operations across PHIs / branches to reduce duplicated sqrt calls On Mon, May 18, 2026 at 1:16 AM Guoce Feng via Gcc <[email protected]> wrote: Hi, I am investigating an optimization opportunity in GCC related to factoring common pure operations out of control-flow joins. Consider code of this general form: if (c1 <= 0.0) distance = sqrt(expr1); else if (c2 <= c1) distance = sqrt(expr2); else distance = sqrt(expr3); After control-flow simplification / if-conversion, this can effectively lead to a form resembling: distance = select(cond1, sqrt(expr1), select(cond2, sqrt(expr2), sqrt(expr3))); For vectorized code, this may become "three vector sqrt operations plus blends", while a more profitable form would be: merged_expr = select(cond1, expr1, select(cond2, expr2, expr3)); distance = sqrt(merged_expr); That is, transform a conceptual pattern like phi(sqrt(a), sqrt(b), sqrt(c)) into sqrt(phi(a, b, c)) when the operation is safe to factor out and doing so is profitable. I noticed that LLVM has a related transformation in SimplifyCFG under the sink-common-insts machinery. In GCC, I am trying to understand what the most appropriate place would be for a similar optimization. Two possible implementation directions seem plausible: 1. Extend the existing PHI factoring logic in tree-ssa-phiopt.cc, especially around factor_out_conditional_operation, to also handle suitable side-effect-free builtins / calls such as sqrt under the appropriate semantic restrictions. So part of the problem with the above case at -Ofast is really https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123113 . I have a patch set which is mostly approved but I need to double check on the order of basic-blocks being looked at. The patches are: https://inbox.sourceware.org/gcc-patches/[email protected]/ https://inbox.sourceware.org/gcc-patches/[email protected]/ https://patchwork.sourceware.org/project/gcc/patch/[email protected]/ In that order. (I can't seen to find the 3rd patch on inbox for some reason). With those patches we get: ``` <bb 2> [local count: 1073741824]: if (c1_2(D) <= 0.0) goto <bb 3>; [41.00%] else goto <bb 4>; [59.00%] <bb 3> [local count: 440234144]: distance_7 = __builtin_sqrt (expr1_6(D)); [tail call] goto <bb 7>; [100.00%] <bb 4> [local count: 633507680]: if (c1_2(D) >= c2_3(D)) goto <bb 5>; [50.00%] else goto <bb 6>; [50.00%] <bb 5> [local count: 316753840]: <bb 6> [local count: 633507680]: # _8 = PHI <expr3_4(D)(4), expr2_5(D)(5)> distance_10 = __builtin_sqrt (_8); [tail call] <bb 7> [local count: 1073741824]: # distance_1 = PHI <distance_10(6), distance_7(3)> return distance_1; ``` Extending factor_* to handle the above could be done but I am not sure phiopt is the correct place. Note there is code in the ifcvt that does a similar thing as factor_* in phiopt which then does handle the rest of it and for: ``` double f(double *expr1, double *expr2, double *expr3, double *c1, double *c2) { double t = 0.0; for(int i =0 ;i < 1024; i++) { double distance = 0.0; if (c1[i] <= 0.0) distance = __builtin_sqrt(expr1[i]); else if (c2[i] <= c1[i]) distance = __builtin_sqrt(expr2[i]); else distance = __builtin_sqrt(expr3[i]); t+=distance; } return t; } ``` With -mavx512f (since I was too lazy to figure out what expr* was and too lazy to test non-x86_64 [since that is what I have built]) we get: ``` _7 = _34 ? _14 : _12; _63 = _42 ? _6 : _7; distance_17 = __builtin_sqrt (_63); t_25 = distance_17 + t_29; ``` In ifcvt. Which I assume is exactly what you were expecting. 2. Add a more general "common instruction sinking" transformation to tree-ssa-sink.cc, analogous in spirit to the existing common-store sinking support, but for a restricted class of pure scalar computations. My current intuition is that the PHIOPT direction may be a better fit for this specific pattern, because the optimization is fundamentally about factoring a common operation across PHI/merge structure rather than ordinary code sinking. However, I would appreciate guidance from the GCC community on the preferred design. I am especially interested in feedback on: * Whether tree-ssa-phiopt.cc is indeed the right place for this form of transformation. * Whether handling builtins such as sqrt is acceptable under the relevant floating-point / errno / exception semantics. * Whether this should be restricted to cases where the operation is known to be side-effect-free and safe to move. * Whether profitability should be guided primarily by code size, vectorization opportunities, or target-independent cost heuristics. * Whether there is existing work or a PR in this area that I should build upon. See above. A motivating reduced example comes from a geometric distance kernel in which three branch-local sqrt computations could ideally be turned into one sqrt after value merging, improving the shape of the vectorized code. See above. Thanks, Andrea If this direction seems reasonable, I am interested in preparing a prototype patch and test cases. Thanks, Guoce Feng
