[Bug libstdc++/110287] _M_check_len is expensive

2023-11-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287
Bug 110287 depends on bug 112653, which changed state.

Bug 112653 Summary: PTA should handle correctly escape information of values 
returned by a function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287
Bug 110287 depends on bug 112706, which changed state.

Bug 112706 Summary: missed simplification in FRE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112706

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287
Bug 110287 depends on bug 110377, which changed state.

Bug 110377 Summary: Early VRP and IPA-PROP should work out value ranges from 
__builtin_unreachable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110377

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-21 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #11 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:1d82fc2e6824bf83159389729c31a942f7b91b04

commit r14-5679-g1d82fc2e6824bf83159389729c31a942f7b91b04
Author: Jan Hubicka 
Date:   Tue Nov 21 15:17:16 2023 +0100

optimize std::vector::push_back

this patch speeds up the push_back at -O3 significantly by making the
reallocation to be inlined by default.  _M_realloc_insert is general
insertion that takes iterator pointing to location where the value
should be inserted.  As such it contains code to move other entries around
that is quite large.

Since appending to the end of array is common operation, I think we should
have specialized code for that.  Sadly it is really hard to work out this
from IPA passes, since we basically care whether the iterator points to
the same place as the end pointer, which are both passed by reference.
This is inter-procedural value numbering that is quite out of reach.

I also added extra check making it clear that the new length of the vector
is non-zero.  This saves extra conditionals.  Again it is quite hard case
since _M_check_len seem to be able to return 0 if its parameter is 0.
This never happens here, but we are not able to propagate this early nor
at IPA stage.

libstdc++-v3/ChangeLog:

PR libstdc++/110287
PR middle-end/109811
PR middle-end/109849
* include/bits/stl_vector.h (_M_realloc_append): New member
function.
(push_back): Use it.
* include/bits/vector.tcc: (emplace_back): Use it.
(_M_realloc_insert): Let compiler know that new vector size is
non-zero.
(_M_realloc_append): New member function.

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

Jan Hubicka  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2023-11-21

--- Comment #10 from Jan Hubicka  ---
We now produce reasonable code for _M_check_len and propagate value range of
the return value. This helps us to notice that later allocator call will not
throw exception on invalid size, so we are down from 3 throw calls to one.

Current code is:

size_type std::vector::_M_check_len (const struct vector * const this,
size_type __n, const char * __s)
{
  const size_type __len;
  long unsigned int _1;
  long unsigned int __n.3_2;
  size_type iftmp.4_3;
  long unsigned int _4;
  long unsigned int _7;
  long unsigned int _8;
  long int _9;
  long int _11;
  struct pair_t * _12;
  struct pair_t * _13;

   [local count: 1073741824]:
  _13 = this_6(D)->D.26060._M_impl.D.25361._M_finish;
  _12 = this_6(D)->D.26060._M_impl.D.25361._M_start;
  _11 = _13 - _12;
  _9 = _11 /[ex] 8;
  _7 = (long unsigned int) _9;
  _1 = 1152921504606846975 - _7;
  __n.3_2 = __n;
  if (_1 < __n.3_2)
goto ; [0.00%]
  else
goto ; [100.00%]

   [count: 0]:
  std::__throw_length_error (__s_14(D));

   [local count: 1073741824]:
  _8 = MAX_EXPR <__n.3_2, _7>;
  __len_10 = _7 + _8;
  if (_7 > __len_10)
goto ; [35.00%]
  else
goto ; [65.00%]

   [local count: 697932184]:
  _4 = MIN_EXPR <__len_10, 1152921504606846975>;

   [local count: 1073741824]:
  # iftmp.4_3 = PHI <1152921504606846975(4), _4(5)>
  return iftmp.4_3;

}
I still think we could play games with the 2^63 being too large for the
standard allocator and turn __throw_length_error into __builtin_unreachable for
that case. This would help early inliner to inline this function and save some
throw calls in real code.

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #9 from Jan Hubicka  ---
This is _M_realloc insert at release_ssa time:

eleased 63 names, 165.79%, removed 63 holes
void std::vector::_M_realloc_insert (struct vector *
const this, struct iterator __position, const struct pair_t & __args#0)
{ 
  struct pair_t * const __position;
  struct pair_t * __new_finish;
  struct pair_t * __old_finish;
  struct pair_t * __old_start;
  long unsigned int _1; 
  struct pair_t * _2;
  struct pair_t * _3;
  long int _4;
  long unsigned int _5;
  struct pair_t * _6;
  const size_type _10;
  long int _13;
  struct pair_t * iftmp.5_15;
  struct pair_t * _17; 
  struct _Vector_impl * _18;
  long unsigned int _22;
  long int _23;
  long unsigned int _24;
  long unsigned int _25;
  struct pair_t * _26;
  long unsigned int _36;

   [local count: 1073741824]:
  __position_27 = MEM[(struct __normal_iterator *)&__position];
  _10 = std::vector::_M_check_len (this_8(D), 1,
"vector::_M_realloc_insert");
  __old_start_11 = this_8(D)->D.25975._M_impl.D.25282._M_start;
  __old_finish_12 = this_8(D)->D.25975._M_impl.D.25282._M_finish;
  _13 = __position_27 - __old_start_11;
  if (_10 != 0)
goto ; [54.67%]
  else
goto ; [45.33%]

   [local count: 587014656]:
  _18 = [(struct _Vector_base *)this_8(D)]._M_impl;
  _17 = std::__new_allocator::allocate (_18, _10, 0B);

   [local count: 1073741824]:
  # iftmp.5_15 = PHI <0B(2), _17(3)>
  _1 = (long unsigned int) _13;
  _2 = iftmp.5_15 + _1;
  *_2 = *__args#0_14(D);
  if (_13 > 0)
goto ; [41.48%]
  else
goto ; [58.52%]

   [local count: 445388112]:
  __builtin_memmove (iftmp.5_15, __old_start_11, _1);

   [local count: 1073741824]:
  _36 = _1 + 8;
  __new_finish_16 = iftmp.5_15 + _36;
  _23 = __old_finish_12 - __position_27;
  if (_23 > 0)
goto ; [41.48%]
  else
goto ; [58.52%]

   [local count: 445388112]:
  _24 = (long unsigned int) _23;
  __builtin_memcpy (__new_finish_16, __position_27, _24);

   [local count: 1073741824]:
  _25 = (long unsigned int) _23;
  _26 = __new_finish_16 + _25;
  _3 = this_8(D)->D.25975._M_impl.D.25282._M_end_of_storage;
  _4 = _3 - __old_start_11;
  if (__old_start_11 != 0B)
goto ; [53.47%]
  else
goto ; [46.53%]

   [local count: 574129752]:
  _22 = (long unsigned int) _4;
  operator delete (__old_start_11, _22);

   [local count: 1073741824]:
  this_8(D)->D.25975._M_impl.D.25282._M_start = iftmp.5_15;
  this_8(D)->D.25975._M_impl.D.25282._M_finish = _26;
  _5 = _10 * 8;
  _6 = iftmp.5_15 + _5;
  this_8(D)->D.25975._M_impl.D.25282._M_end_of_storage = _6;
  return;

}

First it is not clear to me why we need memmove at all?

So first issue is:
   [local count: 1073741824]:
  __position_27 = MEM[(struct __normal_iterator *)&__position];
  _10 = std::vector::_M_check_len (this_8(D), 1,
"vector::_M_realloc_insert");
  __old_start_11 = this_8(D)->D.25975._M_impl.D.25282._M_start;
  __old_finish_12 = this_8(D)->D.25975._M_impl.D.25282._M_finish;
  _13 = __position_27 - __old_start_11;
  if (_10 != 0)
goto ; [54.67%]
  else
goto ; [45.33%]

Without inlining _M_check_len early we can not work out return value range,
since we need to know that paramter 2 is 1 and not 0.
Adding __builtin_unreachable check after helps to reduce

if (_10 != 0)

but I need to do something about inliner accounting the conditional to function
body size.

   [local count: 1073741824]:
  # iftmp.5_15 = PHI <0B(2), _17(3)>
  _1 = (long unsigned int) _13;
  _2 = iftmp.5_15 + _1;
  *_2 = *__args#0_14(D);
  if (_13 > 0)
goto ; [41.48%]
  else
goto ; [58.52%]

   [local count: 445388112]:
  __builtin_memmove (iftmp.5_15, __old_start_11, _1);

Is this code about inserting value to the middle?  Since push_back always
initializes iterator to point to the end, this seems quite sily to do.
Can't we do somehting like _M_realloc_append?

[Bug libstdc++/110287] _M_check_len is expensive

2023-11-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #8 from Jan Hubicka  ---
With return value range propagation
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637265.html
reduces --param max-inline-insns-auto needed for _M_realloc_insert to be
inlined on my testcase from 39 to 35.

This is done by eliminating two unnecesary trow calls by propagating fact that
check_len does not return incredibly large values.

Default inline limit at -O3 is 30, so we are not that far and I think we really
ought to solve this for next release since push_back is such a common case.

Is it known that check_len can not return 0 in this situation? Adding 
if (ret <= 0)
  __builtin_unreachable
saves another 2 instructions because _M_realloc_insert otherwise contain a code
path for case that vector gets increased to 0 elements.

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-23 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287
Bug 110287 depends on bug 110289, which changed state.

Bug 110289 Summary: Phiprop may be good idea in early opts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110289

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-19 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #7 from Jan Hubicka  ---
> 
> There is no guarantee that std::vector::max_size() is PTRDIFF_MAX. It
> depends on the Allocator type, A. A user-defined allocator could have
> max_size() == 100.

If inliner we see path to the throw functions, it will not determine
_M_check_len as early inlinable.
Perhaps we can __builtin_constant_p it as well and check that
max_size () * sizeof ()
is close to ptrdiff_max.  

Thanks for the comments on the patches.  I will try to update the patch.

I was wondering about the allocators. As shown in the mail, optimiznig
_M_check_len still leaves two independent throws for insanely large
ops.  Since allocator is user replaceable, I guess we can not add new
member function for safe_allocate or so.

We can use __builtin_unreachable to set the value range on the return
value.  For that to work during early optimizations we need 

 1) extend early VRP to retrofit the value determined by
__builtin_unreachable to the SSA name defned earlier based on fact
that the execution can not legally terminate in between
 2) teaching inliner to ignore conditionals guaring __builtin_unreacable
 3) add support for return functions to propagate the value range from
_M_check_len to _M_reallocate_insert.
so it is correctly propagated to allocator call.

This is not very easy, but can be generally useful elsewhere.

Re: [Bug libstdc++/110287] _M_check_len is expensive

2023-06-19 Thread Jan Hubicka via Gcc-bugs
> 
> There is no guarantee that std::vector::max_size() is PTRDIFF_MAX. It
> depends on the Allocator type, A. A user-defined allocator could have
> max_size() == 100.

If inliner we see path to the throw functions, it will not determine
_M_check_len as early inlinable.
Perhaps we can __builtin_constant_p it as well and check that
max_size () * sizeof ()
is close to ptrdiff_max.  

Thanks for the comments on the patches.  I will try to update the patch.

I was wondering about the allocators. As shown in the mail, optimiznig
_M_check_len still leaves two independent throws for insanely large
ops.  Since allocator is user replaceable, I guess we can not add new
member function for safe_allocate or so.

We can use __builtin_unreachable to set the value range on the return
value.  For that to work during early optimizations we need 

 1) extend early VRP to retrofit the value determined by
__builtin_unreachable to the SSA name defned earlier based on fact
that the execution can not legally terminate in between
 2) teaching inliner to ignore conditionals guaring __builtin_unreacable
 3) add support for return functions to propagate the value range from
_M_check_len to _M_reallocate_insert.
so it is correctly propagated to allocator call.

This is not very easy, but can be generally useful elsewhere.


[Bug libstdc++/110287] _M_check_len is expensive

2023-06-19 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #6 from Jonathan Wakely  ---
(In reply to Jan Hubicka from comment #5)
> > Do you mean something like this?
> I sent my own version, but yours looks nicer.
> > 
> > diff --git a/libstdc++-v3/include/bits/stl_vector.h
> > b/libstdc++-v3/include/bits/stl_vector.h
> > index 70ced3d101f..a4dbfeb8b5b 100644
> > --- a/libstdc++-v3/include/bits/stl_vector.h
> > +++ b/libstdc++-v3/include/bits/stl_vector.h
> > @@ -1902,6 +1902,21 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
> > return (__len < size() || __len > max_size()) ? max_size() : __len;
> >}
> > 
> > +  // Called by _M_insert_aux etc.
> > +  _GLIBCXX20_CONSTEXPR
> > +  size_type
> > +  _M_check_len_1(const char* __s) const
> > +  {
> > +   if (__builtin_constant_p(size()))
> Perhaps ruling this out for 32bit ports? Or can we assume that half of
> address space can never be allocated in single block?

No, I don't think we can assume that.

> > + {
> > +   if (size() == 0)
> > + return 1;
> > +   else if (size() < max_size() / 2)
> I think even this conditional is safe to be assumed to be true,
> since we can not allocate half of 64bit address space.  

There is no guarantee that std::vector::max_size() is PTRDIFF_MAX. It
depends on the Allocator type, A. A user-defined allocator could have
max_size() == 100.


> In general it is importnat to not fall through to _M_check_len.
> 
> As I noticed, we may want to use assume attribute to make clear that
> retval <= max_size ()
> to avoid other unnecesary throws downstream.
> 
> Honza

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-18 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #5 from Jan Hubicka  ---
> Do you mean something like this?
I sent my own version, but yours looks nicer.
> 
> diff --git a/libstdc++-v3/include/bits/stl_vector.h
> b/libstdc++-v3/include/bits/stl_vector.h
> index 70ced3d101f..a4dbfeb8b5b 100644
> --- a/libstdc++-v3/include/bits/stl_vector.h
> +++ b/libstdc++-v3/include/bits/stl_vector.h
> @@ -1902,6 +1902,21 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
> return (__len < size() || __len > max_size()) ? max_size() : __len;
>}
> 
> +  // Called by _M_insert_aux etc.
> +  _GLIBCXX20_CONSTEXPR
> +  size_type
> +  _M_check_len_1(const char* __s) const
> +  {
> +   if (__builtin_constant_p(size()))
Perhaps ruling this out for 32bit ports? Or can we assume that half of
address space can never be allocated in single block?
> + {
> +   if (size() == 0)
> + return 1;
> +   else if (size() < max_size() / 2)
I think even this conditional is safe to be assumed to be true,
since we can not allocate half of 64bit address space.  
In general it is importnat to not fall through to _M_check_len.

As I noticed, we may want to use assume attribute to make clear that
retval <= max_size ()
to avoid other unnecesary throws downstream.

Honza

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-17 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #4 from Jonathan Wakely  ---
(In reply to Jonathan Wakely from comment #3)
> @@ -946,7 +945,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
>else
> {
>   const size_type __len =
> -   _M_check_len(size_type(1), "vector::_M_insert_aux");
> +   _M_check_len_1("vector::_M_insert_aux");
>   _Bit_pointer __q = this->_M_allocate(__len);
>   iterator __start(std::__addressof(*__q), 0);
>   iterator __i = _M_copy_aligned(begin(), __position, __start);

Oops, this hunk is for vector which is different. Ignore this part.

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-17 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #3 from Jonathan Wakely  ---
Do you mean something like this?

diff --git a/libstdc++-v3/include/bits/stl_vector.h
b/libstdc++-v3/include/bits/stl_vector.h
index 70ced3d101f..a4dbfeb8b5b 100644
--- a/libstdc++-v3/include/bits/stl_vector.h
+++ b/libstdc++-v3/include/bits/stl_vector.h
@@ -1902,6 +1902,21 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
return (__len < size() || __len > max_size()) ? max_size() : __len;
   }

+  // Called by _M_insert_aux etc.
+  _GLIBCXX20_CONSTEXPR
+  size_type
+  _M_check_len_1(const char* __s) const
+  {
+   if (__builtin_constant_p(size()))
+ {
+   if (size() == 0)
+ return 1;
+   else if (size() < max_size() / 2)
+ return size() * 2;
+ }
+   return _M_check_len(1, __s);
+  }
+
   // Called by constructors to check initial size.
   static _GLIBCXX20_CONSTEXPR size_type
   _S_check_init_len(size_type __n, const allocator_type& __a)
diff --git a/libstdc++-v3/include/bits/vector.tcc
b/libstdc++-v3/include/bits/vector.tcc
index acd11e2dc68..1f0b3123c3b 100644
--- a/libstdc++-v3/include/bits/vector.tcc
+++ b/libstdc++-v3/include/bits/vector.tcc
@@ -458,8 +458,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
 _M_realloc_insert(iterator __position, const _Tp& __x)
 #endif
 {
-  const size_type __len =
-   _M_check_len(size_type(1), "vector::_M_realloc_insert");
+  const size_type __len = _M_check_len_1("vector::_M_realloc_insert");
   pointer __old_start = this->_M_impl._M_start;
   pointer __old_finish = this->_M_impl._M_finish;
   const size_type __elems_before = __position - begin();
@@ -946,7 +945,7 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
   else
{
  const size_type __len =
-   _M_check_len(size_type(1), "vector::_M_insert_aux");
+   _M_check_len_1("vector::_M_insert_aux");
  _Bit_pointer __q = this->_M_allocate(__len);
  iterator __start(std::__addressof(*__q), 0);
  iterator __i = _M_copy_aligned(begin(), __position, __start);

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #2 from Jan Hubicka  ---
With patch in PR110289 to optimize the std::max int MAX_EXPR and the throw
commented out I get:

size_type std::vector >::_M_check_len
(const struct vector * const this, size_type __n, const char * __s)
{
  const size_type __len;
  size_type iftmp.2_1;
  long unsigned int _2;
  long unsigned int _5;
  long unsigned int _7;
  struct pair * _8;
  struct pair * _9;
  long int _10;
  long int _11;
  long unsigned int _12;

   [local count: 1073741824]:
  _8 = this_4(D)->D.26707._M_impl.D.26014._M_finish;
  _9 = this_4(D)->D.26707._M_impl.D.26014._M_start;
  _10 = _8 - _9;
  _11 = _10 /[ex] 8;
  _12 = (long unsigned int) _11;
  _7 = __n;
  _5 = MAX_EXPR <_7, _12>;
  __len_6 = _5 + _12;
  if (__len_6 < _12)
goto ; [35.00%]
  else
goto ; [65.00%]

   [local count: 697932185]:
  _2 = MIN_EXPR <__len_6, 1152921504606846975>;

   [local count: 1073741824]:
  # iftmp.2_1 = PHI <1152921504606846975(2), _2(3)>
  return iftmp.2_1;

}

which fits early inlining limits. So modifying libstdc++ to avoid the throw
would be great.

Again optimized _M_check_len could avoid:
   [local count: 697932185]:
  _2 = MIN_EXPR <__len_6, 1152921504606846975>;

   [local count: 1073741824]:
  # iftmp.2_1 = PHI <1152921504606846975(2), _2(3)>
  return iftmp.2_1;

since __len can not get that large without running out of memory. Which would
help further inlining.

Testcase from PR109849 now runs in 0.39s while without these two changes it
runs in 0.528s, clang's build runs in 0.057s, so still a long way to go.

[Bug libstdc++/110287] _M_check_len is expensive

2023-06-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287

--- Comment #1 from Jan Hubicka  ---
Another problem is:

  D.27747 = _8;
  if (__n.3_2 > _8)
goto ; [34.00%]
  else
goto ; [66.00%]

   [local count: 364926196]:

   [local count: 1073312330]:
  # _18 = PHI <(4), &__n(5)>
  _3 = *_18;

which is simply computation of maximum but done in a way not understood until
phiprop