[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

acoplan at gcc dot gnu.org via Gcc-bugs Fri, 02 Aug 2024 06:50:41 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140


--- Comment #6 from Alex Coplan <acoplan at gcc dot gnu.org> ---
Just to give an update on this, the following testcase shows why adding:

#pragma GCC unroll 4

in libstdc++ doesn't immediately seem to help.  The testcase is:

$ cat lambda.cc
template<typename Iter, typename Pred>
inline Iter
my_find(Iter first, Iter last, Pred pred)
{
#pragma GCC unroll 4
    while (first != last && !pred(*first))
        ++first;
    return first;
}

short *use_find(short *p)
{
    auto pred = [](short x) { return x == 42; };
    return my_find(p, p + 1024, pred);
}

compiling, we get:

$ /xgcc -B . -c lambda.cc -S -o /dev/null
lambda.cc: In function ‘Iter my_find(Iter, Iter, Pred) [with Iter = short int*;
Pred = use_find(short int*)::<lambda(short int)>]’:
lambda.cc:6:5: warning: ignoring loop annotation
    6 |     while (first != last && !pred(*first))
      |     ^~~~~

so the #pragma is indeed getting dropped.  This warning comes from
tree-cfg.cc:replace_loop_annotate.  The exiting basic block here is:

<bb 8> :
D.4524 = .ANNOTATE (iftmp.1, 1, 4);
retval.0 = D.4524;
if (retval.0 != 0)
  goto <bb 3>; [INV]
else
  goto <bb 9>; [INV]

and the code in replace_loop_annotate_in_block (which looks for the .ANNOTATE
ifn call) iterates backwards over the gimple in that block, skipping over the
gcond, but it then expects to find any .ANNOTATE calls immediately before the
gcond.
In this case it doesn't, so we end up dropping the .ANNOTATE call on the floor
and emitting the warning (and not unrolling).

Consider the simpler testcase without the lambda:

template<typename Iter>
inline Iter
find_nolambda(Iter first, Iter last)
{
#pragma GCC unroll 4
    while (first != last && *first != 42)
        ++first;
    return first;
}

short *use_nolambda(short *p)
{
    return find_nolambda (p, p + 1024);
}

for this testcase, we don't get the warning, and indeed the exiting block for
this loop is just:

<bb 8> :
D.4460 = .ANNOTATE (iftmp.0, 1, 4);
if (D.4460 != 0)
  goto <bb 3>; [INV]
else
  goto <bb 9>; [INV]

i.e. the .ANNOTATE comes immediately before the gcond.  To see what is really
going on we can look at -fdump-tree-original.  For the problematic testcase we
have:

if (<<cleanup_point ANNOTATE_EXPR <first != last && !use_find(short
int*)::<lambda(short int)>::operator() (&pred, *first), unroll 4>>>) goto
<D.4518>; else goto <D.4516>;

and the simpler testcase without the lambda has:

if (ANNOTATE_EXPR <first != last && *first != 42, unroll 4>) goto <D.4457>;
else
goto <D.4455>;

so I think the problem is the CLEANUP_POINT_EXPR wrapping the ANNOTATE_EXPR in
the lambda case.  The following fixes that:

diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc
index a9abf32e01f..b2c29fbb028 100644
--- a/gcc/cp/semantics.cc
+++ b/gcc/cp/semantics.cc
@@ -966,6 +966,16 @@ maybe_convert_cond (tree cond)
   if (type_dependent_expression_p (cond))
     return cond;

+  /* If the condition has an ANNOTATE_EXPR, that must remain the outermost
+     expression of the condition.  Strip it off and re-apply it after the
+     conversion to maintain this invariant.  */
+  tree annotate = NULL_TREE;
+  if (TREE_CODE (cond) == ANNOTATE_EXPR)
+    {
+      annotate = cond;
+      cond = TREE_OPERAND (cond, 0);
+    }
+
   /* For structured binding used in condition, the conversion needs to be
      evaluated before the individual variables are initialized in the
      std::tuple_{size,elemenet} case.  cp_finish_decomp saved the conversion
@@ -983,7 +993,15 @@ maybe_convert_cond (tree cond)

   /* Do the conversion.  */
   cond = convert_from_reference (cond);
-  return condition_conversion (cond);
+  cond = condition_conversion (cond);
+
+  /* Restore the ANNOTATE_EXPR, if there was one.  */
+  if (annotate)
+    {
+      TREE_OPERAND (annotate, 0) = cond;
+      cond = annotate;
+    }
+  return cond;
 }

 /* Finish an expression-statement, whose EXPRESSION is as indicated.  */

where the CLEANUP_POINT_EXPR was getting added in condition_conversion.
That passes bootstrap on aarch64.  With that patch, adding:

#pragma GCC unroll 4

above the __find_if loop in stl_algobase.h, we get unrolled std::find
again.  E.g. for the following testcase I get:

#include <algorithm>
long *f(long *p)
{
  return std::find (p, p + 1024, 42);
}

_Z1fPl:
.LFB675:
        .cfi_startproc
        mov     x1, x0
        add     x0, x0, 8192
        .p2align 5,,15
.L3:
        ldr     x2, [x1]
        cmp     x2, 42
        beq     .L4
        ldr     x2, [x1, 8]
        add     x1, x1, 8
        mov     x3, x1
        cmp     x2, 42
        beq     .L4
        ldr     x2, [x1, 8]!
        cmp     x2, 42
        beq     .L4
        ldr     x2, [x3, 16]
        add     x1, x3, 16
        cmp     x2, 42
        beq     .L4
        add     x1, x3, 24
        cmp     x0, x1
        bne     .L3
        ret

at -O2.  But importantly this version should still be vectorizable
further down the line (unlike the hand-unrolled version).

Now for xalancbmk this seems to give back about 4.8% on Neoverse V1
_without_ LTO.  Unfortunately for some reason there is no difference in
the relevant hot function _with_ LTO, so that needs debugging (I'm
looking into that).

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

Reply via email to