On Tue, Oct 21, 2025 at 11:20:38AM +0200, Richard Biener wrote:
> On Mon, Oct 20, 2025 at 4:47 PM Robin Dapp <[email protected]> wrote:
> >
> > Hi,
> >
> > When niter runs after the copy-header pass it sometimes fails to
> > simplify assumptions in a ctz loop.
> > As the assumption is a simple nonzero test here and we
> > know that a val >> 1 != 0 (or val << 1 != 0) if val & 1 != 0 we
> > can elide the shift when simplifying the assumption.
>
> Hmm, if val == 1 then val >> 1 == 0. So I'm not sure I follow
> your logic here?
I agree - this logic looks bogus to me. The relaxed assumption requires that
we've already checked whether any of the "ignored" bits are nonzero, but
there's no new code to verify this before modifying the assumption.
You mention range queries in your code comment - I think that approach would be
the correct way to fix this.
Incidentally, the vectoriser does use ranger in this context, which allows it
to determine the exact number of iterations already. With SVE enabled on
aarch64, this leads to codegen that computes the exact number of iterations
with CTZ, but then uses that value to look up the result in a vectorised
counter (which is probably the most expensive "nop" I've ever seen GCC
produce).
>
> I think the assumption tries to defend against the UB value
> of .C[LT]Z, but then we might as well use the ability to specify
> the value at zero (though I always fail to find documentation
> about our internal functions). That said, we build .CL[TZ] (src)
> so it's fair and correct to test src against zero.
If the assumption (in the existing source) doesn't hold, then that exit
condition never triggers (so in a single exit case this would be an infinite
loop). It's nothing to do with the definedness or otherwise of .C[LT]Z (0).
Alice
> Richard.
>
> >
> > This helps recognize a ctz loop in 502.gcc's compute_transp.
> >
> > Regtested on x86, power10, and rv64gcbv_zvl512b.
> >
> > Regards
> > Robin
> >
> > PR/tree-optimization 122207
> >
> > gcc/ChangeLog:
> >
> > * tree-ssa-loop-niter.cc (number_of_iterations_cltz): Use
> > unshifted src.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/tree-ssa/ctz-char.c: Remove -fno-tree-ch.
> > * gcc.dg/tree-ssa/ctz-int.c: Ditto.
> > * gcc.dg/tree-ssa/ctz-long-long.c: Ditto.
> > * gcc.dg/tree-ssa/ctz-long.c: Ditto.
> > * gcc.dg/tree-ssa/ctz-ch.c: New test.
> > ---
> > gcc/testsuite/gcc.dg/tree-ssa/ctz-ch.c | 23 ++++++++++
> > gcc/testsuite/gcc.dg/tree-ssa/ctz-char.c | 2 +-
> > gcc/testsuite/gcc.dg/tree-ssa/ctz-int.c | 2 +-
> > gcc/testsuite/gcc.dg/tree-ssa/ctz-long-long.c | 2 +-
> > gcc/testsuite/gcc.dg/tree-ssa/ctz-long.c | 2 +-
> > gcc/tree-ssa-loop-niter.cc | 45 ++++++++++++++++++-
> > 6 files changed, 70 insertions(+), 6 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ctz-ch.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ctz-ch.c
> > b/gcc/testsuite/gcc.dg/tree-ssa/ctz-ch.c
> > new file mode 100644
> > index 00000000000..5d725979971
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ctz-ch.c
> > @@ -0,0 +1,23 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-optimized" } */
> > +
> > +typedef unsigned long BITMAP_WORD;
> > +
> > +bool
> > +bmp_iter_set (BITMAP_WORD bits, unsigned *bit_no)
> > +{
> > + /* If our current word is nonzero, it contains the bit we want. */
> > + if (bits)
> > + {
> > + while (!(bits & 1))
> > + {
> > + bits >>= 1;
> > + *bit_no += 1;
> > + }
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times "__builtin_ctz|\\.CTZ" 1 "optimized"
> > } } */
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ctz-char.c
> > b/gcc/testsuite/gcc.dg/tree-ssa/ctz-char.c
> > index 3cd166acbd4..fa8b7f39de4 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/ctz-char.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ctz-char.c
> > @@ -1,6 +1,6 @@
> > /* { dg-do run } */
> > /* { dg-require-effective-target ctz } */
> > -/* { dg-options "-O2 -fno-tree-ch -fdump-tree-optimized" } */
> > +/* { dg-options "-O2 -fdump-tree-optimized" } */
> >
> > #define PREC (__CHAR_BIT__)
> >
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ctz-int.c
> > b/gcc/testsuite/gcc.dg/tree-ssa/ctz-int.c
> > index 7f63493eb73..2bf3ae69b93 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/ctz-int.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ctz-int.c
> > @@ -1,6 +1,6 @@
> > /* { dg-do run } */
> > /* { dg-require-effective-target ctz } */
> > -/* { dg-options "-O2 -fno-tree-ch -fdump-tree-optimized" } */
> > +/* { dg-options "-O2 -fdump-tree-optimized" } */
> >
> > #define PREC (__CHAR_BIT__ * __SIZEOF_INT__)
> >
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ctz-long-long.c
> > b/gcc/testsuite/gcc.dg/tree-ssa/ctz-long-long.c
> > index 924f61b76f0..2e159485cb9 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/ctz-long-long.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ctz-long-long.c
> > @@ -1,6 +1,6 @@
> > /* { dg-do run } */
> > /* { dg-require-effective-target ctzll } */
> > -/* { dg-options "-O2 -fno-tree-ch -fdump-tree-optimized" } */
> > +/* { dg-options "-O2 -fdump-tree-optimized" } */
> >
> > #define PREC (__CHAR_BIT__ * __SIZEOF_LONG_LONG__)
> >
> > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ctz-long.c
> > b/gcc/testsuite/gcc.dg/tree-ssa/ctz-long.c
> > index 178945daa8a..2e3be652a0b 100644
> > --- a/gcc/testsuite/gcc.dg/tree-ssa/ctz-long.c
> > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ctz-long.c
> > @@ -1,6 +1,6 @@
> > /* { dg-do run } */
> > /* { dg-require-effective-target ctzl } */
> > -/* { dg-options "-O2 -fno-tree-ch -fdump-tree-optimized" } */
> > +/* { dg-options "-O2 -fdump-tree-optimized" } */
> >
> > #define PREC (__CHAR_BIT__ * __SIZEOF_LONG__)
> >
> > diff --git a/gcc/tree-ssa-loop-niter.cc b/gcc/tree-ssa-loop-niter.cc
> > index 6e130862549..bda583e9e2d 100644
> > --- a/gcc/tree-ssa-loop-niter.cc
> > +++ b/gcc/tree-ssa-loop-niter.cc
> > @@ -2438,6 +2438,9 @@ number_of_iterations_cltz (loop_p loop, edge exit,
> > tree src = gimple_phi_arg_def (phi, loop_preheader_edge
> > (loop)->dest_idx);
> > int src_precision = TYPE_PRECISION (TREE_TYPE (src));
> >
> > + /* Save the non-preprocessed SRC. */
> > + tree unshifted_src = src;
> > +
> > /* Apply any needed preprocessing to src. */
> > int num_ignored_bits;
> > if (left_shift)
> > @@ -2463,8 +2466,46 @@ number_of_iterations_cltz (loop_p loop, edge exit,
> >
> > expr = fold_convert (unsigned_type_node, expr);
> >
> > - tree assumptions = fold_build2 (NE_EXPR, boolean_type_node, src,
> > - build_zero_cst (TREE_TYPE (src)));
> > + /* If the copy-header (ch) pass peeled one iteration we're shifting
> > + SRC by preprocessing it above.
> > +
> > + A loop like
> > + if (bits)
> > + {
> > + while (!(bits & 1))
> > + {
> > + bits >>= 1;
> > + cnt += 1;
> > + }
> > + return cnt;
> > + }
> > + ch (roughly) transforms into:
> > + if (bits)
> > + {
> > + if (!(bits & 1)
> > + {
> > + do
> > + {
> > + bits >>= 1;
> > + cnt += 1;
> > + } while (!(bits & 1));
> > + }
> > + else
> > + cnt = 1;
> > + return cnt;
> > + }
> > +
> > + Then, our preprocessed SRC (that is used for c[tl]z computation)
> > + will be bits >> 1, the assumption bits >> 1 != 0.
> > + In simplify_using_initial_conditions we walk dominating conditions
> > + to check if the assumption is unconditional.
> > + As bits >>/<< 1 != 0 is true when bits != 0 we can use the
> > + unshifted src to simplify assumptions instead.
> > +
> > + A range query would work as well but since the boundary conditions
> > + are very restricted this simpler approach is sufficient. */
> > + tree assumptions = fold_build2 (NE_EXPR, boolean_type_node,
> > unshifted_src,
> > + build_zero_cst (TREE_TYPE
> > (unshifted_src)));
> >
> > niter->assumptions = simplify_using_initial_conditions (loop,
> > assumptions);
> > niter->may_be_zero = boolean_false_node;
> > --
> > 2.51.0
> >