[Bug ipa/67051] symtab_node::equal_address_to too conservative?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67051 --- Comment #2 from Jan Hubicka --- I believe that there was some discussion on this in the past. I would be quite happy to change the predicate to be more aggressive. Current code basically duplicates what original fold-const.c did. One problem is that we have no way to declare in header that one symbol is alias of another while being defined in other translation unit. jan@localhost:/tmp> cat t.c extern int a; extern int b __attribute ((alias("a"))); jan@localhost:/tmp> gcc t.c t.c:2:12: error: ‘b’ aliased to undefined symbol ‘a’ 2 | extern int b __attribute ((alias("a"))); |^ jan@localhost:/tmp> clang t.c t.c:2:28: error: alias must point to a defined variable or function 2 | extern int b __attribute ((alias("a"))); |^ t.c:2:28: note: the function or variable specified in an alias must refer to its mangled name 1 error generated. So if one wants to use aliases intentionally (to do something smart about superposing) then basically only valid testcases would be if translation units never use both names together. Also folding is done early when alias may not be declared yet, but that can be solved by check for symtab state.
[Bug middle-end/115277] [13/14/15 regression] ICF needs to match loop bound estimates
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115277 Jan Hubicka changed: What|Removed |Added Summary|ICF needs to match loop |[13/14/15 regression] ICF |bound estimates |needs to match loop bound ||estimates --- Comment #1 from Jan Hubicka --- Reproduces on 14 and trunk. GCC 12 is not able to determine the loop bound during early optimizations
[Bug middle-end/115277] New: ICF needs to match loop bound estimates
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115277 Bug ID: 115277 Summary: ICF needs to match loop bound estimates Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- jan@localhost:/tmp> cat tt.c int array[1000]; void test (int a) { if (__builtin_expect (a > 3, 1)) return; for (int i = 0; i < a; i++) array[i]=i; } void test2 (int a) { if (__builtin_expect (a > 10, 1)) return; for (int i = 0; i < a; i++) array[i]=i; } int main() { test(1); test(2); test(3); test2(10); if (array[9] != 9) __builtin_abort (); return 0; } jan@localhost:/tmp> gcc -O2 tt.c ; ./a.out jan@localhost:/tmp> gcc -O3 tt.c ; ./a.out Aborted (core dumped) The problem here is that we do not match value ranges and thus we can end up with different estimates on number of iterations.
[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787 Jan Hubicka changed: What|Removed |Added Summary|[12/13/14/15 Regression]|[12/13/14 Regression] Wrong |Wrong code at -O with |code at -O with ipa-modref |ipa-modref on aarch64 |on aarch64 --- Comment #22 from Jan Hubicka --- Fixed on trunk so far
[Bug libstdc++/109442] Dead local copy of std::vector not removed from function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109442 --- Comment #19 from Jan Hubicka --- Note that the testcase from PR115037 also shows that we are not able to optimize out dead stores to the vector, which is another quite noticeable problem. void test() { std::vector test; test.push_back (1); } We alocate the block, store 1 and immediately delete it. void test () { int * test$D25839$_M_impl$D25146$_M_start; struct vector test; int * _61; [local count: 1073741824]: _61 = operator new (4); [local count: 1063439392]: *_61 = 1; operator delete (_61, 4); test ={v} {CLOBBER}; test ={v} {CLOBBER(eol)}; return; [count: 0]: : test ={v} {CLOBBER}; resx 2 } So my understanding is that we decided to not optimize away the dead stores since the particular operator delete does not pass test: /* If the call is to a replaceable operator delete and results from a delete expression as opposed to a direct call to such operator, then we can treat it as free. */ if (fndecl && DECL_IS_OPERATOR_DELETE_P (fndecl) && DECL_IS_REPLACEABLE_OPERATOR (fndecl) && gimple_call_from_new_or_delete (stmt)) return ". o "; This is because we believe that operator delete may be implemented in an insane way that inspects the values stored in the block being freed. I can sort of see that one can write standard conforming code that allocates some data that is POD and inspects it in destructor. However for std::vector this argument is not really applicable. Standard does specify that new/delete is used to allocate/deallocate the memory but does not say how the memory is organized or what happens before deallocation. (i.e. it is probably valid for std::vector to memset the block just before deallocating it). Similar argument can IMO be used for eliding unused memory allocations. It is kind of up to std::vector implementation on how many allocations/deallocations it does, right? So we need a way to annotate the new/delete calls in the standard library as safe for such optimizations (i.e. implement clang's __bulitin_operator_new/delete?) How clang manages to optimize this out without additional hinting?
[Bug middle-end/115037] Unused std::vector is not optimized away.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115037 Jan Hubicka changed: What|Removed |Added CC||jason at redhat dot com, ||jwakely at redhat dot com --- Comment #2 from Jan Hubicka --- I tried to look for duplicates, but did not find one. However I think the first problem is that we do not optimize away the store of 1 to vector while clang does. I think this is because we do not believe we can trust that delete operator is safe? We get: void test () { int * test$D25839$_M_impl$D25146$_M_start; struct vector test; int * _61; [local count: 1073741824]: _61 = operator new (4); [local count: 1063439392]: *_61 = 1; operator delete (_61, 4); test ={v} {CLOBBER}; test ={v} {CLOBBER(eol)}; return; [count: 0]: : test ={v} {CLOBBER}; resx 2 } If we can not trust fact that operator delete is good, perhaps we can arrange explicit clobber before calling it? I think it is up to std::vector to decide what it will do with the stored array so in this case even instane oprator delete has no right to expect that the data in vector will be sane :)
[Bug middle-end/115037] New: Unused std::vector is not optimized away.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115037 Bug ID: 115037 Summary: Unused std::vector is not optimized away. Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Compiling #include void test() { std::vector test; test.push_back (1); } leads to _Z4testv: .LFB1253: .cfi_startproc subq$8, %rsp .cfi_def_cfa_offset 16 movl$4, %edi call_Znwm movl$4, %esi movl$1, (%rax) movq%rax, %rdi addq$8, %rsp .cfi_def_cfa_offset 8 jmp _ZdlPvm while clang optimizes to: _Z4testv: # @_Z4testv .cfi_startproc # %bb.0: retq
[Bug middle-end/115036] New: division is not shortened based on value range
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115036 Bug ID: 115036 Summary: division is not shortened based on value range Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- For long test(long a, long b) { if (a > 65535 || a < 0) __builtin_unreachable (); if (b > 65535 || b < 0) __builtin_unreachable (); return a/b; } we produce test: .LFB0: .cfi_startproc movq%rdi, %rax cqto idivq %rsi ret while clang does: test: # @test .cfi_startproc # %bb.0: movq%rdi, %rax # kill: def $ax killed $ax killed $rax xorl%edx, %edx divw%si movzwl %ax, %eax retq clang also by default adds 32bit divide path even when value range is not known long test(long a, long b) { return a/b; } compiles as test: # @test .cfi_startproc # %bb.0: movq%rdi, %rax movq%rdi, %rcx orq %rsi, %rcx shrq$32, %rcx je .LBB0_1 # %bb.2: cqto idivq %rsi retq
[Bug ipa/114985] [15 regression] internal compiler error: in discriminator_fail during stage2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114985 --- Comment #14 from Jan Hubicka --- So this is problem in ipa_value_range_from_jfunc? It is Maritn's code, I hope he will know why types are wrong here. Once can get type compatibility problem on mismatched declarations and LTO, but it seems that this testcase is single-file. So indeed this looks like a bug either in jump function construction or even earlier...
[Bug middle-end/114852] New: jpegxl 10.0.1 is faster with clang18 then with gcc14
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114852 Bug ID: 114852 Summary: jpegxl 10.0.1 is faster with clang18 then with gcc14 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3 reports about 8% difference. I can measure 13% on zen3. The code has changed and it is no longer bound by push_back but runs AVX2 version of inner loops. The hottest loops looks comparable 0.00 │266:┌─→vmovaps (%r14,%rax,4),%ymm0 0.11 ││ vmulps (%rcx,%rax,4),%ymm7,%ymm2 1.18 ││ vfnmadd213ps (%rsi,%rax,4),%ymm11,%ymm0 0.25 ││ vmulps %ymm2,%ymm0,%ymm0 5.94 ││ vroundps $0x8,%ymm0,%ymm2 0.35 ││ vsubps %ymm2,%ymm0,%ymm0 1.05 ││ vmulps (%rdx,%rax,4),%ymm0,%ymm0 3.19 ││ vmovaps %ymm0,0x0(%r13,%rax,4) 0.15 ││ vandps %ymm10,%ymm2,%ymm0 0.03 ││ add $0x8,%rax 0.03 ││ vcmpeqps %ymm8,%ymm0,%ymm2 0.09 ││ vsqrtps %ymm0,%ymm0 27.25 ││ vaddps %ymm0,%ymm6,%ymm6 0.35 ││ vandnps %ymm9,%ymm2,%ymm0 0.12 ││ vaddps %ymm0,%ymm5,%ymm5 0.05 │├──cmp %r12,%rax 0.02 │└──jb 266 and clang 0.00 │ c90:┌─→vmulps (%r9,%rdx,4),%ymm0,%ymm2 0.97 │ │ vmovaps (%r15,%rdx,4),%ymm1 0.36 │ │ vsubps %ymm2,%ymm1,%ymm1 4.24 │ │ vmulps (%rcx,%rdx,4),%ymm4,%ymm2 1.92 │ │ vmulps %ymm2,%ymm1,%ymm1 0.65 │ │ vroundps $0x8,%ymm1,%ymm2 0.06 │ │ vsubps %ymm2,%ymm1,%ymm1 1.11 │ │ vmulps (%rax,%rdx,4),%ymm1,%ymm1 3.53 │ │ vmovaps %ymm1,(%rsi,%rdx,4) 0.68 │ │ vandps %ymm6,%ymm2,%ymm1 0.23 │ │ vcmpneqps%ymm5,%ymm2,%ymm2 3.64 │ │ add $0x8,%rdx 0.24 │ │ vsqrtps %ymm1,%ymm1 22.16 │ │ vaddps %ymm1,%ymm8,%ymm8 0.25 │ │ vbroadcastss 0x31eba5(%rip),%ymm1# 34f840 0.05 │ │ vandps %ymm1,%ymm2,%ymm1 0.04 │ │ vaddps %ymm1,%ymm7,%ymm7 0.11 │ ├──cmp %rdi,%rdx 0.07 │ └──jb c90▒ GCC profile: 10.78% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::EstimateEntropy(jxl::AcStrategy const&, float, unsigned long, unsigned long, jxl::ACSConfig const&, float con 7.02% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::FindBestMultiplier(float const*, float const*, unsigned long, float, float, bool) [clone .part.0] 4.50% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::Symmetric5Row(jxl::Plane const&, jxl::RectT const&, long, jxl: 4.47% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::TransformFromPixels(jxl::AcStrategy::Type, float const*, unsigned long, float*, float* 4.31% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::TransformToPixels(jxl::AcStrategy::Type, float*, float*, unsigned long, float*) 4.00% cjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState const&, int const* restrict*, jxl::AcStra 3.56% cjxl libm.so.6 [.] __ieee754_pow_fma 3.49% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::IDCT1DImpl<8ul, 8ul>::operator()(float const*, unsigned long, float*, unsigned long, f 3.43% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::AdaptiveQuantizationImpl::ComputeTile(float, float, jxl::Image3 const&, jxl::Re 3.27% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::(anonymous namespace)::DCT1DWrapper<32ul, 0ul, jxl::N_AVX2::(anonymous namespace)::DCTFrom, jxl::N_AVX2: 3.16% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<8ul, 8ul>::operator()(float*, float*) [clone .isra.0] 2.87% cjxl libjxl.so.0.10.1 [.] void jxl::N_AVX2::(anonymous namespace)::ComputeScaledIDCT<4ul, 8ul>::operator()::operator()::operator() const&, jxl::RectT const&, jxl::DequantMatrices const&, jxl::AcStrategyImage const*, jxl::Plane const*, jxl::Quantizer const*, jxl::Rect▒ 5.03% cjxl libjxl.so.0.10.1 [.] jxl::ThreadPool::RunCallState const&, jxl::RectT const&, jxl::WeightsSymmetric5 const&, jxl::ThreadPool*, jxl::Pla▒ 4.66% cjxl libjxl.so.0.10.1 [.] jxl::N_AVX2::(anonymous namespace)::DCT1DImpl<16ul, 8ul>::operator()(float*, float*)
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #9 from Jan Hubicka --- Phoronix still claims the difference https://www.phoronix.com/review/gcc14-clang18-amd-zen4/2
[Bug target/113236] WebP benchmark is 20% slower vs. Clang on AMD Zen 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113236 --- Comment #3 from Jan Hubicka --- Seems this perofmance difference is still there on zen4 https://www.phoronix.com/review/gcc14-clang18-amd-zen4/3
[Bug tree-optimization/114787] [13 Regression] wrong code at -O1 on x86_64-linux-gnu (the generated code hangs)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114787 --- Comment #18 from Jan Hubicka --- predict.cc queries number of iterations using number_of_iterations_exit and loop_niter_by_eval and finally using estimated_stmt_executions. The first two queries are not updating the upper bounds datastructure so that is why we get around without computing them in some cases. I guess we can just drop dumping here. We now dump the recorded estimates elsehwere, so this is somewhat redundant.
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #13 from Jan Hubicka --- Thanks a lot, looks great! Do we still auto-detect memmove when the copy constructor turns out to be memcpy equivalent after optimization?
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #9 from Jan Hubicka --- Your patch gives me error compiling testcase jh@ryzen3:/tmp> ~/trunk-install/bin/g++ -O3 ~/t.C In file included from /home/jh/trunk-install/include/c++/14.0.1/vector:65, from /home/jh/t.C:1: /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h: In instantiation of ‘_ForwardIterator std::__relocate_a(_InputIterator, _InputIterator, _ForwardIterator, _Allocator&) [with _InputIterator = const pair*; _ForwardIterator = pair*; _Allocator = allocator >; _Traits = allocator_traits > >]’: /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1127:31: required from ‘_Tp* std::__relocate_a(_Tp*, _Tp*, _Tp*, allocator<_T2>&) [with _Tp = pair; _Up = pair]’ 1127 | return std::__relocate_a(__cfirst, __clast, __result, __alloc); | ~^~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:509:26: required from ‘static std::vector<_Tp, _Alloc>::pointer std::vector<_Tp, _Alloc>::_S_relocate(pointer, pointer, pointer, _Tp_alloc_type&) [with _Tp = std::pair; _Alloc = std::allocator >; pointer = std::pair*; _Tp_alloc_type = std::vector >::_Tp_alloc_type]’ 509 | return std::__relocate_a(__first, __last, __result, __alloc); |~^~~~ /home/jh/trunk-install/include/c++/14.0.1/bits/vector.tcc:647:32: required from ‘void std::vector<_Tp, _Alloc>::_M_realloc_append(_Args&& ...) [with _Args = {const std::pair&}; _Tp = std::pair; _Alloc = std::allocator >]’ 647 | __new_finish = _S_relocate(__old_start, __old_finish, |~~~^~~ 648 |__new_start, _M_get_Tp_allocator()); | ~~~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_vector.h:1294:21: required from ‘void std::vector<_Tp, _Alloc>::push_back(const value_type&) [with _Tp = std::pair; _Alloc = std::allocator >; value_type = std::pair]’ 1294 | _M_realloc_append(__x); | ~^ /home/jh/t.C:8:25: required from here 8 | stack.push_back (pair); | ^~ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56: error: use of deleted function ‘const _Tp* std::addressof(const _Tp&&) [with _Tp = pair]’ 1084 | std::addressof(std::move(*__first | ~~^ In file included from /home/jh/trunk-install/include/c++/14.0.1/bits/stl_pair.h:61, from /home/jh/trunk-install/include/c++/14.0.1/bits/stl_algobase.h:64, from /home/jh/trunk-install/include/c++/14.0.1/vector:62: /home/jh/trunk-install/include/c++/14.0.1/bits/move.h:168:16: note: declared here 168 | const _Tp* addressof(const _Tp&&) = delete; |^ /home/jh/trunk-install/include/c++/14.0.1/bits/stl_uninitialized.h:1084:56: note: use ‘-fdiagnostics-all-candidates’ to display considered candidates 1084 | std::addressof(std::move(*__first | ~~^ It is easy to check if conversion happens - just compile it and see if there is memcpy or memmove in the optimized dump file (or final assembly)
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #8 from Jan Hubicka --- I had wrong noexcept specifier. This version works, but I still need to inline relocate_object_a into the loop diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..f02d4fb878f 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1100,8 +1100,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION "relocation is only possible for values of the same type"); _ForwardIterator __cur = __result; for (; __first != __last; ++__first, (void)++__cur) - std::__relocate_object_a(std::__addressof(*__cur), -std::__addressof(*__first), __alloc); + { + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, std::__addressof(*__cur), std::move(*std::__addressof(*__first))); + __traits::destroy(__alloc, std::__addressof(*std::__addressof(*__first))); + } return __cur; } @@ -1109,8 +1112,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template _GLIBCXX20_CONSTEXPR inline __enable_if_t::value, _Tp*> -__relocate_a_1(_Tp* __first, _Tp* __last, - _Tp* __result, +__relocate_a_1(_Tp* __restrict __first, _Tp* __last, + _Tp* __restrict __result, [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept { ptrdiff_t __count = __last - __first; @@ -1147,6 +1150,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::__niter_base(__result), __alloc); } + template +_GLIBCXX20_CONSTEXPR +inline _Tp* +__relocate_a(_Tp* __restrict __first, _Tp* __last, +_Tp* __restrict __result, +allocator<_Up>& __alloc) +noexcept(noexcept(__relocate_a_1(__first, __last, __result, __alloc))) +{ + return std::__relocate_a_1(__first, __last, __result, __alloc); +} + /// @endcond #endif // C++11
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #6 from Jan Hubicka --- Thanks. I though the relocate_a only cares about the fact if the pointed-to type can be bitwise copied. It would be nice to early produce memcpy from libstdc++ for std::pair, so the second patch makes sense to me (I did not test if it works) I think it would be still nice to tell GCC that the copy loop never gets overlapping memory locations so the cases which are not early optimized to memcpy can still be optimized later (or vectorized if it does really something non-trivial). So i tried your second patch fixed so it compiles: diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..0d2e588ae5e 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1109,8 +1109,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template _GLIBCXX20_CONSTEXPR inline __enable_if_t::value, _Tp*> -__relocate_a_1(_Tp* __first, _Tp* __last, - _Tp* __result, +__relocate_a_1(_Tp* __restrict __first, _Tp* __last, + _Tp* __restrict __result, [[__maybe_unused__]] allocator<_Up>& __alloc) noexcept { ptrdiff_t __count = __last - __first; @@ -1147,6 +1147,17 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION std::__niter_base(__result), __alloc); } + template +_GLIBCXX20_CONSTEXPR +inline _Tp* +__relocate_a(_Tp* __restrict __first, _Tp* __last, +_Tp* __restrict __result, +allocator<_Up>& __alloc) +noexcept(std::__is_bitwise_relocatable<_Tp>::value) +{ + return std::__relocate_a_1(__first, __last, __result, __alloc); +} + /// @endcond #endif // C++11 it does not make ldist to hit, so the restrict info is still lost. I think the problem is that if you call relocate_object the restrict reduces scope, so we only know that the elements are pairwise disjoint, not that the vectors are. This is because restrict is interpreted early pre-inlining, but it is really Richard's area. It seems that the patch makes us to go through __uninitialized_copy_a instead of __uninit_copy. I am not even sure how these are different, so I need to stare at the code bit more to make sense of it :)
[Bug middle-end/114822] New: ldist should produce memcpy/memset/memmove histograms based on loop information converted
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114822 Bug ID: 114822 Summary: ldist should produce memcpy/memset/memmove histograms based on loop information converted Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- When loop is converted to string builtin we lose information about its size. This means that we won't expand it inline when the block size is expected to be small. This causes performance problem i.e. on std::vector and testcase from PR114821 which at least with profile feedback runs significantly slower than variant where memcpy is produced early #include typedef unsigned int uint32_t; int pair; void test() { std::vector stack; stack.push_back (pair); while (!stack.empty()) { int cur = stack.back(); stack.pop_back(); if (true) { cur++; stack.push_back (cur); stack.push_back (cur); } if (cur > 1) break; } } int main() { for (int i = 0; i < 1; i++) test(); }
[Bug libstdc++/114821] _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 --- Comment #2 from Jan Hubicka --- What I am shooting for is to optimize it later in loop distribution. We can recognize memcpy loop if we can figure out that source and destination memory are different. We can help here with restrict, but I was bit lost in how to get them done. This seems to do the trick, but for some reason I get memmove diff --git a/libstdc++-v3/include/bits/stl_uninitialized.h b/libstdc++-v3/include/bits/stl_uninitialized.h index 7f84da31578..1a6223ea892 100644 --- a/libstdc++-v3/include/bits/stl_uninitialized.h +++ b/libstdc++-v3/include/bits/stl_uninitialized.h @@ -1130,7 +1130,58 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } return __result + __count; } + + template +_GLIBCXX20_CONSTEXPR +inline __enable_if_t::value, _Tp*> +__relocate_a(_Tp * __restrict __first, _Tp *__last, +_Tp * __restrict __result, _Allocator& __alloc) noexcept +{ + ptrdiff_t __count = __last - __first; + if (__count > 0) + { +#ifdef __cpp_lib_is_constant_evaluated + if (std::is_constant_evaluated()) + { + for (; __first != __last; ++__first, (void)++__result) + { + // manually inline relocate_object_a to not lose restrict qualifiers + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, __result, std::move(*__first)); + __traits::destroy(__alloc, std::__addressof(*__first)); + } + return __result; + } #endif + __builtin_memcpy(__result, __first, __count * sizeof(_Tp)); + } + return __result + __count; +} +#endif + + template +_GLIBCXX20_CONSTEXPR +#if _GLIBCXX_HOSTED +inline __enable_if_t::value, _Tp*> +#else +inline _Tp * +#endif +__relocate_a(_Tp * __restrict __first, _Tp *__last, +_Tp * __restrict __result, _Allocator& __alloc) +noexcept(noexcept(std::allocator_traits<_Allocator>::construct(__alloc, +__result, std::move(*__first))) +&& noexcept(std::allocator_traits<_Allocator>::destroy( + __alloc, std::__addressof(*__first +{ + for (; __first != __last; ++__first, (void)++__result) + { + // manually inline relocate_object_a to not lose restrict qualifiers + typedef std::allocator_traits<_Allocator> __traits; + __traits::construct(__alloc, __result, std::move(*__first)); + __traits::destroy(__alloc, std::__addressof(*__first)); + } + return __result; +} template
[Bug libstdc++/114821] New: _M_realloc_append should use memcpy instead of loop to copy data when possible
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114821 Bug ID: 114821 Summary: _M_realloc_append should use memcpy instead of loop to copy data when possible Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- In thestcase #include typedef unsigned int uint32_t; std::pair pair; void test() { std::vector> stack; stack.push_back (pair); while (!stack.empty()) { std::pair cur = stack.back(); stack.pop_back(); if (!cur.first) { cur.second++; stack.push_back (cur); stack.push_back (cur); } if (cur.second > 1) break; } } int main() { for (int i = 0; i < 1; i++) test(); } We produce _M_reallloc_append which uses loop to copy data instead of memcpy. This is bigger and slower. The reason why __relocate_a does not use memcpy seems to be fact that pair has copy constructor. It still can be pattern matched by ldist but it fails with: (compute_affine_dependence ref_a: *__first_1, stmt_a: *__cur_37 = *__first_1; ref_b: *__cur_37, stmt_b: *__cur_37 = *__first_1; ) -> dependence analysis failed So we can not disambiguate old and new vector memory and prove that loop is indeed memcpy loop. I think this is valid since operator new is not required to return new memory, but I think adding __restrict should solve this. Problem is that I got lost on where to add them, since relocate_a uses iterators instead of pointers
[Bug tree-optimization/114787] [13/14 Regression] wrong code at -O1 on x86_64-linux-gnu (the generated code hangs)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114787 --- Comment #13 from Jan Hubicka --- -fdump-tree-all-all changing generated code is also bad. We probably should avoid dumping loop bounds then they are not recorded. I added dumping of loop bounds and this may be unexpected side effect. WIll take a look.
[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008 --- Comment #8 from Jan Hubicka --- Note that cold attribute is also quite strong since it turns optimize_size codegen that is often a lot slower. Reading the discussion again, I don't think we have a way to make inline keyword ignored by inliner. We can make not_really_inline attribute (better name would be welcome).
[Bug tree-optimization/114779] __builtin_constant_p does not work in inline functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114779 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #7 from Jan Hubicka --- Note that the test about side-effects also makes it impossible to test for constantness of values passed to function by reference which could be also useful. Workaround is to load it into temporary so the side-effect is not seen. So that early folding to 0 never made too much of sense to me. I agree that it is a can of worms and it is not clear if changing behaviour would break things...
[Bug middle-end/114774] Missed DSE in simple code due to interleaving sotres
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774 Jan Hubicka changed: What|Removed |Added Summary|Missed DSE in simple code |Missed DSE in simple code |due to other stores being |due to interleaving sotres |conditional | --- Comment #1 from Jan Hubicka --- the other store being conditional is not the core issue. Here we miss DSE too: #include int a; short p,q; void test (int b) { a=1; if (b) p++; else q++; a=2; } The problem in DSE seems to be that instead of recursively walking the memory-SSA graph it insist the graph to form a chain. Now SRA leaves stores to scalarized variables and even removes the corresponding clobbers, so this is relatively common scenario in non-trivial C++ code.
[Bug middle-end/114774] New: Missed DSE in simple code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114774 Bug ID: 114774 Summary: Missed DSE in simple code Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- In the following #include int a; short *p; void test (int b) { a=1; if (b) { (*p)++; a=2; printf ("1\n"); } else { (*p)++; a=3; printf ("2\n"); } } We are not able to optimize out "a=1". This is simplified real-world scenario where SRA does not remove definition of SRAed variables. Note that clang does conditional move here test: # @test .cfi_startproc # %bb.0: movqp(%rip), %rax incw(%rax) xorl%eax, %eax testl %edi, %edi leaq.Lstr(%rip), %rcx leaq.Lstr.2(%rip), %rdi cmoveq %rcx, %rdi sete%al orl $2, %eax movl%eax, a(%rip) jmp puts@PLT# TAILCALL
[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596 --- Comment #19 from Jan Hubicka --- I looked into the remaining exit/nonexit rename discussed here earlier before the PR was closed. The following patch would restore the code to do the same calls as before my patch PR tree-optimization/109596 * tree-ssa-loop-ch.c (ch_base::copy_headers): Fix use of exit/nonexit edges. diff --git a/gcc/tree-ssa-loop-ch.cc b/gcc/tree-ssa-loop-ch.cc index b7ef485c4cc..cd5f6bc3c2a 100644 --- a/gcc/tree-ssa-loop-ch.cc +++ b/gcc/tree-ssa-loop-ch.cc @@ -952,13 +952,13 @@ ch_base::copy_headers (function *fun) if (!single_pred_p (nonexit->dest)) { header = split_edge (nonexit); - exit = single_pred_edge (header); + nonexit = single_pred_edge (header); } edge entry = loop_preheader_edge (loop); propagate_threaded_block_debug_into (nonexit->dest, entry->dest); - if (!gimple_duplicate_seme_region (entry, exit, bbs, n_bbs, copied_bbs, + if (!gimple_duplicate_seme_region (entry, nonexit, bbs, n_bbs, copied_bbs, true)) { delete candidate.static_exits; I however convinced myself this is an noop. both exit and nonexit sources have same basic blocks. propagate_threaded_block_debug_into walks predecessors of its first parameter and moves debug statements to the second parameter, so it does the same job, since the split BB is empty. gimple_duplicate_seme_region uses the parametr to update loop header but it does not do that correctly for loop header copying and we re-do it in tree-ssa-loop-ch. Still the code as it is now in trunk is very confusing, so perhaps we should update it?
[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208 --- Comment #28 from Jan Hubicka --- So the main problem is that in t2 we have _ZN6vectorI12QualityValueEC1ERKS1_/7 (vector<_Tp>::vector(const vector<_Tp>&) [with _Tp = QualityValue]) Type: function definition analyzed alias cpp_implicit_alias Visibility: semantic_interposition public weak comdat comdat_group:_ZN6vectorI12QualityValueEC5ERKS1_ one_only Same comdat group as: _ZN6vectorI12QualityValueEC2ERKS1_/6 References: _ZN6vectorI12QualityValueEC2ERKS1_/6 (alias) Referring: Function flags: Called by: _Z41__static_initialization_and_destruction_0v/8 (can throw external) Calls: and in t1 we have _ZN6vectorI12QualityValueEC1ERKS1_/2 (constexpr vector<_Tp>::vector(const vector<_Tp>&) [with _Tp = QualityValue]) Type: function definition Visibility: semantic_interposition external public weak comdat comdat_group:_ZN6vectorI12QualityValueEC1ERKS1_ one_only References: Referring: Function flags: Called by: Calls: This is the same symbol name but in two different comdat groups (C1 compared to C5). With -O0 both seems to get the C5 group I can silence the ICE by making aliases undefined during symbol merging (which is kind of hack but should make sanity checks happy), but I am still lost how this is supposed to work in valid code.
[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208 --- Comment #27 from Jan Hubicka --- OK, but the problem is same. Having comdats with same key defining different set of public symbols is IMO not a good situation for both non-LTO and LTO builds. Unless the additional alias is never used by valid code (which would make it useless and probably we should not generate it) it should be possible to produce a scenario where linker will pick wrong version of comdat and we get undefined symbol in non-LTO builds...
[Bug lto/113208] [14 Regression] lto1: error: Alias and target's comdat groups differs since r14-5979-g99d114c15523e0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113208 --- Comment #25 from Jan Hubicka --- So we have comdat groups that diverges in t1.o and t2.o. In one object it has alias in it while in other object it does not Merging nodes for _ZN6vectorI12QualityValueEC2ERKS1_. Candidates: _ZN6vectorI12QualityValueEC2ERKS1_/1 (__ct_base ) Type: function definition analyzed Visibility: externally_visible semantic_interposition prevailing_def_ironly public weak comdat comdat_group:_ZN6vectorI12QualityValueEC2ERKS1_ one_only next sharing asm name: 19 References: Referring: Read from file: t1.o Unit id: 1 Function flags: count:1073741824 (estimated locally) Called by: _Z1n1k/6 (1073741824 (estimated locally),1.00 per call) (can throw external) Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/10 (1073741824 (estimated locally),1.00 per call) (can throw external) _ZNK12_Vector_baseI12QualityValueE1gEv/9 (1073741824 (estimated locally),1.00 per call) (can throw external) _ZN6vectorI12QualityValueEC2ERKS1_/19 (__ct_base ) Type: function definition analyzed Visibility: externally_visible semantic_interposition preempted_ir public weak comdat comdat_group:_ZN6vectorI12QualityValueEC5ERKS1_ one_only Same comdat group as: _ZN6vectorI12QualityValueEC1ERKS1_/20 previous sharing asm name: 1 References: Referring: _ZN6vectorI12QualityValueEC1ERKS1_/20 (alias) Read from file: t2.o Unit id: 2 Function flags: count:1073741824 (estimated locally) Called by: Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/23 (1073741824 (estimated locally),1.00 per call) (can throw external) _ZNK12_Vector_baseI12QualityValueE1gEv/24 (1073741824 (estimated locally),1.00 per call) (can throw external) After resolution: _ZN6vectorI12QualityValueEC2ERKS1_/1 (__ct_base ) Type: function definition analyzed Visibility: externally_visible semantic_interposition prevailing_def_ironly public weak comdat comdat_group:_ZN6vectorI12QualityValueEC2ERKS1_ one_only next sharing asm name: 19 References: Referring: Read from file: t1.o Unit id: 1 Function flags: count:1073741824 (estimated locally) Called by: _Z1n1k/6 (1073741824 (estimated locally),1.00 per call) (can throw external) Calls: _ZN12_Vector_baseI12QualityValueEC2Eii/10 (1073741824 (estimated locally),1.00 per call) (can throw external) _ZNK12_Vector_baseI12QualityValueE1gEv/9 (1073741824 (estimated locally),1.00 per call) (can throw external) We opt for version without alias and later ICE in sanity check verifying that aliases have same comdat group as their targets. I wonder how this is ice-on-valid code, since with normal linking the aliased symbol may or may not appear in the winning comdat group, so using he alias has to break. If constexpr changes how the constructor is generated, isn't this violation of ODR? We probably can go and reset every node in losing comdat group to silence the ICE and getting undefined symbol instead
[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 --- Comment #8 from Jan Hubicka --- I am not sure this ought to be P1: - the compilation technically is finite, but not in reasonable time - it is possible to adjust the testcas (do early inlining manually) and get same infinite build on release branches - if you ask for inline bomb, you get it. But after some more testing, I do not see reasonably easy way to get better diagnostics. So I will retest the patch fro #6 and go ahead with it.
[Bug ipa/113359] [13/14 Regression] LTO miscompilation of ceph on aarch64 and x86_64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359 --- Comment #23 from Jan Hubicka --- The patch looks reasonable. We probably could hash the padding vectors at summary generation time to reduce WPA overhead, but that can be done incrementally next stage1. I however wonder if we really guarantee to copy the paddings everywhere else then the total scalarization part? (i.e. in all paths through the RTL expansion)
[Bug ipa/109817] internal error in ICF pass on Ada interfaces
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109817 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #5 from Jan Hubicka --- That check was added to verify that we do not lose the thunk annotations. Now when datastructure is stable, i think we can simply drop it, if that makes Ada to work.
[Bug gcov-profile/113765] [14 Regression] ICE: autofdo: val-profiler-threads-1.c compilation, error: probability of edge from entry block not initialized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113765 --- Comment #6 from Jan Hubicka --- Running auto-fdo without guessing branch probabilities is somewhat odd idea in general. I suppose we can indeed just avoid setting full_profile flag. Though the optimization passes are not that much tested to work with non-full profiles so there is some risk that resulting code will be worse than without auto-FDO.
[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #7 from Jan Hubicka --- Found it, probably. I renamed exit to nonexit (since name was misleading) and then forgot to update propagate_threaded_block_debug_into (exit->dest, entry->dest); I will check this after teaching (which I have in 10 mins)
[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596 --- Comment #6 from Jan Hubicka --- On this testcase trunk does get same dump as gcc13 for pass just before ch2 with ch2 we get: @@ -192,9 +236,8 @@ # DEBUG BEGIN_STMT goto ; [100.00%] - [local count: 954449105]: + [local count: 954449104]: # j_15 = PHI - # DEBUG j => j_15 # DEBUG BEGIN_STMT a[b_14][j_15] = 0; # DEBUG BEGIN_STMT @@ -203,29 +246,30 @@ # DEBUG j => j_9 # DEBUG BEGIN_STMT if (j_9 <= 7) -goto ; [88.89%] +goto ; [87.50%] else -goto ; [11.11%] +goto ; [12.50%] [local count: 119292720]: + # DEBUG j => 0 # DEBUG BEGIN_STMT b_7 = b_14 + 1; # DEBUG b => b_7 # DEBUG b => b_7 # DEBUG BEGIN_STMT if (b_7 <= 6) -goto ; [87.50%] +goto ; [85.71%] else -goto ; [12.50%] +goto ; [14.29%] [local count: 119292720]: # b_14 = PHI - # DEBUG b => b_14 # DEBUG j => 0 # DEBUG BEGIN_STMT goto ; [100.00%] [local count: 17041817]: + # DEBUG b => 0 # DEBUG BEGIN_STMT optimize_me_not (); # DEBUG BEGIN_STMT So in addition to updating BB profile, we indeed end up moving debug statements around. The change of dump is: + Analyzing: if (b_1 <= 6) +Will eliminate peeled conditional in bb 6. +May duplicate bb 6 + Not duplicating bb 8: it is single succ. + Analyzing: if (j_2 <= 7) +Will eliminate peeled conditional in bb 4. +May duplicate bb 4 + Not duplicating bb 3: it is single succ. Loop 2 is not do-while loop: latch is not empty. +Duplicating header BB to obtain do-while loop Copying headers of loop 1 Will duplicate bb 6 - Not duplicating bb 8: it is single succ. -Duplicating header of the loop 1 up to edge 6->8, 2 insns. +Duplicating header of the loop 1 up to edge 6->7 Loop 1 is do-while loop Loop 1 is now do-while loop. +Exit count: 17041817 (estimated locally) +Entry count: 17041817 (estimated locally) +Peeled all exits: decreased number of iterations of loop 1 by 1. Copying headers of loop 2 Will duplicate bb 4 - Not duplicating bb 3: it is single succ. -Duplicating header of the loop 2 up to edge 4->3, 2 insns. +Duplicating header of the loop 2 up to edge 4->5 Loop 2 is do-while loop Loop 2 is now do-while loop. +Exit count: 119292720 (estimated locally) +Entry count: 119292720 (estimated locally) +Peeled all exits: decreased number of iterations of loop 2 by 1. Dumps moved around, but we do same duplicaitons as before (BB6 and BB4 to eliminate the conditionals). [local count: 1073741824]: # j_2 = PHI <0(8), j_9(3)> # DEBUG j => j_2 # DEBUG BEGIN_STMT if (j_2 <= 7) goto ; [88.89%] else goto ; [11.11%] [local count: 136334537]: # b_1 = PHI <0(2), b_7(5)> # DEBUG b => b_1 # DEBUG BEGIN_STMT if (b_1 <= 6) goto ; [87.50%] else goto ; [12.50%]
[Bug testsuite/109596] [14 Regression] Lots of guality testcase fails on x86_64 after r14-162-gcda246f8b421ba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109596 --- Comment #4 from Jan Hubicka --- The change makes loop iteration estimates more realistics, but does not introduce any new code that actually changes the IL, so it seems this makes existing problem more visible. I will try to debug what happens.
[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #59 from Jan Hubicka --- just to explain what happens in the testcase. There is test and testb. They are almost same: int testb(void) { struct bar *fp; test2 ((void *)); fp = NULL; (*ptr)++; test3 ((void *)); } the difference is in the alias set of FP. In one case it aliases with the (*ptr)++ while in other it does not. This makes one function to have jump function specifying aggregate value of 0 for *fp, while other does not. Now with LTO both struct bar and foo becomes compatible for TBAA, so the functions gets merged and the winning variant has the jump function specifying aggregate 0, which is wrong in the context code is invoked.
[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #58 from Jan Hubicka --- Created attachment 57702 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57702=edit Compare value ranges in jump functions This patch implements the jump function compare, however it is not good enough. Here is another wrong code: jh@ryzen3:~/gcc/build/stage1-gcc> cat a.c #include #include __attribute__((used)) int val,val2 = 1; struct foo {int a;}; struct foo **ptr; __attribute__ ((noipa)) int test2 (void *a) { ptr = (struct foo **)a; } int test3 (void *a); int test(void) { struct foo *fp; test2 ((void *)); fp = NULL; (*ptr)++; test3 ((void *)); } int testb (void); int main() { for (int i = 0; i < val2; i++) if (val) testb (); else test(); } jh@ryzen3:~/gcc/build/stage1-gcc> cat b.c #include struct bar {int a;}; struct foo {int a;}; struct barp {struct bar *f; struct bar *g;}; extern struct foo **ptr; int test2 (void *); int test3 (void *); int testb(void) { struct bar *fp; test2 ((void *)); fp = NULL; (*ptr)++; test3 ((void *)); } jh@ryzen3:~/gcc/build/stage1-gcc> cat c.c #include __attribute__ ((noinline)) int test3 (void *a) { if (!*(void **)a) abort (); return 0; } jh@ryzen3:~/gcc/build/stage1-gcc> ./xgcc -B ./ -O3 a.c b.c -flto -c ; ./xgcc -B ./ -O3 c.c -flto -fno-strict-aliasing -c ; ./xgcc -B ./ b.o a.o c.o ; ./a.out Aborted (core dumped) jh@ryzen3:~/gcc/build/stage1-gcc> ./xgcc -B ./ -O3 a.c b.c -flto -c ; ./xgcc -B ./ -O3 c.c -flto -fno-strict-aliasing -c ; ./xgcc -B ./ b.o a.o c.o --disable-ipa-icf ; ./a.out lto1: note: disable pass ipa-icf for functions in the range of [0, 4294967295] lto1: note: disable pass ipa-icf for functions in the range of [0, 4294967295]
[Bug ipa/113907] [11/12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #55 from Jan Hubicka --- > Anyway, can we in the spot my patch changed just walk all > source->node->callees > cgraph_edges, for each of them find the corresponding > cgraph_edge in the alias > and for each walk all the jump_functions recorded > and union their m_vr? > Or is that something that can't be done in LTO for some reason? That was my fist idea too, but the problem is that icf has (very limited) support for matching function which differ by order of the basic blocks: it computes hash of every basic block and orders them by their hash prior comparing. This seems half-finished since i.e. order of edges in PHIs has to match exactly. Callee lists are officially randomly ordered, but practically they follows the order of basic blocks (as they are built this way). However since BB orders can differ, just walking both callee sequences and comparing pairwise does not work. This also makes merging the information harder, since we no longer have the BB map at the time decide to merge. It is however not hard to match the jump function while walking gimple bodies and comparing statements, which is backportable and localized. I am still waiting for my statistics to converge and will send it soon.
[Bug ipa/106716] Identical Code Folding (-fipa-icf) confuses between functions with different [[likely]] attributes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106716 --- Comment #6 from Jan Hubicka --- The reason why GIMPLE_PREDICT is ignored is that it is never used after ipa-icf and gets removed at the very beggining of late optimizations. GIMPLE_PREDICT is consumed by profile_generate pass which is run before ipa-icf. The reason why GIMPLE_PREDICT statements are not stripped during ICF is early inlining. If we early inline, we throw away its profile and estimate it again (in the context of function it was inlined to) and for that it is a good idea to keep predicts. There is no convenient place to remove them after early inlining was done and before IPA passes and that is the only reason why they are around. We may revisit that since streaming them to LTO bytecode is probably more harmful then adding extra pass after early opts to strip them. ICF doesn't code to compare edge profiles and stmt histograms. It knows how to merge them (so resulting BB profile is consistent with merging) but I suppose we may want to have some threshold on when we do not want to marge functions with very different branch probabilities in the hot part of their bodies...
[Bug lto/114241] False-positive -Wodr warning when using -flto and -fno-semantic-interposition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114241 Jan Hubicka changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #3 from Jan Hubicka --- mine. Will debug why the tables diverges.
[Bug debug/92387] [11/12/13 Regression] gcc generates wrong debug information at -O1 since r10-1907-ga20f263ba1a76a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92387 --- Comment #5 from Jan Hubicka --- The revision is changing inlining decisions, so it would be probably possible to reproduce the problem without that change with right alaways_inline and noinline attributes.
[Bug tree-optimization/114207] [12/13/14 Regression] modref gets confused by vecotorized code ` -O3 -fno-tree-forwprop` since r12-5439
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114207 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #3 from Jan Hubicka --- mine. The summary is: loads: Base 0: alias set 1 Ref 0: alias set 1 access: Parm 0 param offset:4 offset:0 size:64 max_size:64 stores: Base 0: alias set 1 Ref 0: alias set 1 access: Parm 0 param offset:0 offset:0 size:64 max_size:64 while with fwprop we get: loads: Base 0: alias set 1 Ref 0: alias set 1 access: Parm 0 param offset:0 offset:0 size:64 max_size:64 stores: Base 0: alias set 1 Ref 0: alias set 1 access: Parm 0 param offset:0 offset:0 size:64 max_size:64 So it seems that offset is misaccounted.
[Bug lto/85432] Wodr can be more verbose for C code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85432 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WORKSFORME --- Comment #1 from Jan Hubicka --- This should be solved for a long time. We recognize ODR types by mangled names produced only by C++ frontend. I checked that GCC 12, 13 and trunk does not produce the warning.
[Bug tree-optimization/114052] [11/12/13/14 Regression] Wrong code at -O2 for well-defined infinite loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114052 --- Comment #5 from Jan Hubicka --- So if I understand it right, you want to determine the property that if the loop header is executed then BB containing undefined behavior at that iteration will be executed, too. modref tracks if function will always return and if it can not determine it, it will set the side_effect flag. So you can check for that in modref summary. It uses finite_function_p which was originally done for pure/const detection and is implemented by looking at loop nest if all loops are known to be finite and also by checking for irreducible loops. In your setup you probably also want to check for volatile asms that are also possibly infinite. In mod-ref we get around by considering them to be side-effects anyway. There is also determine_unlikely_bbs which is trying to set profile_count to zero to as many basic blocks as possible by propagating from basic blocks containing undefined behaviour or cold noreturn call backward & forward. The backward walk can be used to determine the property hat executing header implies UB. It stops on all loops though. In this case it would be nice to walk through loops known to be finite...
[Bug ipa/108802] [11/12/13/14 Regression] missed inlining of call via pointer to member function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108802 --- Comment #5 from Jan Hubicka --- I don't think we can reasonably expect every caller of lambda function to be early inlined, so we need to extend ipa-prop to understand the obfuscated code. I disucussed that with Martin some time ago - I think this is quite common problem with modern C++, so we will need to pattern match this, which is quite unfortunate.
[Bug ipa/111960] [14 Regression] ICE: during GIMPLE pass: rebuild_frequencies: SIGSEGV (Invalid read of size 4) with -fdump-tree-rebuild_frequencies-all
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111960 --- Comment #5 from Jan Hubicka --- hmm. cfg.cc:815 for me is: fputs (", maybe hot", outf); which seems quite safe. The problem does not seem to reproduce for me: jh@ryzen3:~/gcc/build/gcc> ./xgcc -B ./ tt.c -O --param=max-inline-recursive-depth=100 -fdump-tree-rebuild_frequencies-all -wrapper valgrind ==25618== Memcheck, a memory error detector ==25618== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==25618== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==25618== Command: ./cc1 -quiet -iprefix /home/jh/gcc/build/gcc/../lib64/gcc/x86_64-pc-linux-gnu/14.0.1/ -isystem ./include -isystem ./include-fixed tt.c -quiet -dumpdir a- -dumpbase tt.c -dumpbase-ext .c -mtune=generic -march=x86-64 -O -fdump-tree-rebuild_frequencies-all --param=max-inline-recursive-depth=100 -o /tmp/ccpkfjdK.s ==25618== ==25618== ==25618== HEAP SUMMARY: ==25618== in use at exit: 1,818,714 bytes in 1,175 blocks ==25618== total heap usage: 39,645 allocs, 38,470 frees, 12,699,874 bytes allocated ==25618== ==25618== LEAK SUMMARY: ==25618==definitely lost: 0 bytes in 0 blocks ==25618==indirectly lost: 0 bytes in 0 blocks ==25618== possibly lost: 8,032 bytes in 1 blocks ==25618==still reachable: 1,810,682 bytes in 1,174 blocks ==25618== suppressed: 0 bytes in 0 blocks ==25618== Rerun with --leak-check=full to see details of leaked memory ==25618== ==25618== For lists of detected and suppressed errors, rerun with: -s ==25618== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ==25627== Memcheck, a memory error detector ==25627== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==25627== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==25627== Command: ./as --64 -o /tmp/ccp5TNme.o /tmp/ccpkfjdK.s ==25627== ==25637== Memcheck, a memory error detector ==25637== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==25637== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info ==25637== Command: ./collect2 -plugin ./liblto_plugin.so -plugin-opt=./lto-wrapper -plugin-opt=-fresolution=/tmp/cclWZD7F.res -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s -plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /lib/../lib64/crt1.o /lib/../lib64/crti.o ./crtbegin.o -L. -L/lib/../lib64 -L/usr/lib/../lib64 /tmp/ccp5TNme.o -lgcc --push-state --as-needed -lgcc_s --pop-state -lc -lgcc --push-state --as-needed -lgcc_s --pop-state ./crtend.o /lib/../lib64/crtn.o ==25637== /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: /lib/../lib64/crt1.o: in function `_start': /home/abuild/rpmbuild/BUILD/glibc-2.38/csu/../sysdeps/x86_64/start.S:103:(.text+0x2b): undefined reference to `main' collect2: error: ld returned 1 exit status ==25637== ==25637== HEAP SUMMARY: ==25637== in use at exit: 89,760 bytes in 39 blocks ==25637== total heap usage: 175 allocs, 136 frees, 106,565 bytes allocated ==25637== ==25637== LEAK SUMMARY: ==25637==definitely lost: 0 bytes in 0 blocks ==25637==indirectly lost: 0 bytes in 0 blocks ==25637== possibly lost: 0 bytes in 0 blocks ==25637==still reachable: 89,760 bytes in 39 blocks ==25637== of which reachable via heuristic: ==25637== newarray : 1,544 bytes in 1 blocks ==25637== suppressed: 0 bytes in 0 blocks ==25637== Rerun with --leak-check=full to see details of leaked memory ==25637== ==25637== For lists of detected and suppressed errors, rerun with: -s ==25637== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
[Bug middle-end/113907] [12/13/14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 Jan Hubicka changed: What|Removed |Added Summary|[14 regression] ICU |[12/13/14 regression] ICU |miscompiled since on x86|miscompiled since on x86 |since |since |r14-5109-ga291237b628f41|r14-5109-ga291237b628f41 --- Comment #41 from Jan Hubicka --- OK, the reason why this does not work is that ranger ignores earlier value ranges on everything but default defs and phis. // This is where the ranger picks up global info to seed initial // requests. It is a slightly restricted version of // get_range_global() above. // // The reason for the difference is that we can always pick the // default definition of an SSA with no adverse effects, but for other // SSAs, if we pick things up to early, we may prematurely eliminate // builtin_unreachables. // // Without this restriction, the test in g++.dg/tree-ssa/pr61034.C has // all of its unreachable calls removed too early. // // See discussion here: // https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571709.html void gimple_range_global (vrange , tree name, struct function *fun) { tree type = TREE_TYPE (name); gcc_checking_assert (TREE_CODE (name) == SSA_NAME); if (SSA_NAME_IS_DEFAULT_DEF (name) || (fun && fun->after_inlining) || is_a (SSA_NAME_DEF_STMT (name))) { get_range_global (r, name, fun); return; } r.set_varying (type); } This makes ipa-prop to ignore earlier known value range and mask the bug. However adding PHI makes the problem to reproduce: #include #include int data[100]; int c; static __attribute__((noinline)) int bar (int d, unsigned int d2) { if (d2 > 30) c++; return d + d2; } static int test2 (unsigned int i) { if (i > 100) __builtin_unreachable (); if (__builtin_expect (data[i] != 0, 1)) return data[i]; for (int j = 0; j < 100; j++) data[i] += bar (data[j], i&1 ? i+17 : i + 16); return data[i]; } static int test (unsigned int i) { if (i > 10) __builtin_unreachable (); if (__builtin_expect (data[i] != 0, 1)) return data[i]; for (int j = 0; j < 100; j++) data[i] += bar (data[j], i&1 ? i+17 : i + 16); return data[i]; } int main () { int ret = test (1) + test (2) + test (3) + test2 (4) + test2 (30); if (!c) abort (); return ret; } This fails with trunk, gcc12 and gcc13 and also with Jakub's patch.
[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #39 from Jan Hubicka --- This testcase #include int data[100]; __attribute__((noinline)) int bar (int d, unsigned int d2) { if (d2 > 10) printf ("Bingo\n"); return d + d2; } int test2 (unsigned int i) { if (i > 10) __builtin_unreachable (); if (__builtin_expect (data[i] != 0, 1)) return data[i]; printf ("%i\n",i); for (int j = 0; j < 100; j++) data[i] += bar (data[j], i+17); return data[i]; } int test (unsigned int i) { if (i > 100) __builtin_unreachable (); if (__builtin_expect (data[i] != 0, 1)) return data[i]; printf ("%i\n",i); for (int j = 0; j < 100; j++) data[i] += bar (data[j], i+17); return data[i]; } int main () { test (1); test (2); test (3); test2 (4); test2 (100); return 0; } gets me most of what I want to reproduce ipa-prop problem. Functions test and test2 are split with different value ranges visible in the fnsplit dump. However curiously enough ipa-prop analysis seems to ignore the value ranges and does not attach them to the jump function, which is odd...
[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 --- Comment #31 from Jan Hubicka --- Having a testcase is great. I was just playing with crafting one. I am still concerned about value ranges in ipa-prop's jump functions. Let me see if I can modify the testcase to also trigger problem with value ranges in ipa-prop jump functions. Not streaming value ranges is an omission on my side (I mistakely assumed we do stream them). We ought to stream them, since otherwise we will lose propagated return value ranges in partitioned programs, which is pity.
[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 --- Comment #6 from Jan Hubicka --- Created attachment 57427 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57427=edit patch The patch makes compilation to finish in reasonable time. I ended up in need to dropping DISREGARD_INLINE_LIMITS in late inlining for functions with self recursive always inlines, since these grow large quick and even non-recursive inlining is too slow. We also end up with quite ugly diagnostics of form: tt.c:13:1: error: inlining failed in call to ‘always_inline’ ‘f1’: --param max-inline-insns-auto limit reached 13 | f1 (void) | ^~ tt.c:17:3: note: called from here 17 | f1 (); | ^ tt.c:6:1: error: inlining failed in call to ‘always_inline’ ‘f0’: --param max-inline-insns-auto limit reached 6 | f0 (void) | ^~ tt.c:16:3: note: called from here 16 | f0 (); | ^ tt.c:13:1: error: inlining failed in call to ‘always_inline’ ‘f1’: --param max-inline-insns-auto limit reached 13 | f1 (void) | ^~ tt.c:15:3: note: called from here 15 | f1 (); | ^ In function ‘f1’, inlined from ‘f0’ at tt.c:8:3, which is quite large so I can not add it to a testuiste. I will see if I can reduce this even more.
[Bug middle-end/111054] [14 Regression] ICE: in to_sreal, at profile-count.cc:472 with -O3 -fno-guess-branch-probability since r14-2967
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111054 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #7 from Jan Hubicka --- Fixed.
[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 --- Comment #5 from Jan Hubicka --- There is a cap in want_inline_self_recursive_call_p which gives up on inlining after reaching max recursive inlining depth of 8. Problem is that the tree here is too wide. After early inlining f0 contains 4 calls to f1 and 3 calls to f0. Similarly for f0, so we have something like (9+3*9)^8 as a cap on number of inlines that takes a while to converge. One may want to limit number of copies of function A within function B rather than depth, but that number can be large even for sane code. I am making patch to make inliner to ignore always_inline on all self-recrusive inline decisions.
[Bug ipa/113291] [14 Regression] compilation never (?) finishes with recursive always_inline functions at -O and above since r14-2172
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113291 --- Comment #4 from Jan Hubicka --- There is a cap in want_inline_self_recursive_call_p which gives up on inlining after reaching max recursive inlining depth of 8. Problem is that the tree here is too wide. After early inlining f0 contains 4 calls to f1 and
[Bug middle-end/113907] [14 regression] ICU miscompiled since on x86 since r14-5109-ga291237b628f41
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113907 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #29 from Jan Hubicka --- Safest fix is to make equals_p to reject merging functions with different value ranges assigned to corresponding SSA names. I would hope that, since early opts are still mostly local, that does not lead to very large degradation. This is lame of course. If we go for smarter merging, we need to also handle ipa-prop jump functions. In that case I think equals_p needs to check if value range sin SSA_NAMES and jump functions differs and if so, keep that noted so the merging code can do corresponding update. I will check how hard it is to implement this. (Equality handling is Martin Liska's code, but if I recall right, each equivalence class has a leader, and we can keep track if there are some differences WRT that leader, but I do not recall how subdivision of equivalence classes is handled).
[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787 --- Comment #13 from Jan Hubicka --- So my understanding is that ivopts does something like offset = - and then translate val = base2[i] to val = *((base1+i)+offset) Where (base1+i) is then an iv variable. I wonder if we consider doing memory reference with base changed via offset a valid transformation. Is there way to tell when this happens? A quick fix would be to run IPA modref before ivopts, but I do not see how such transformation can work with rest of alias analysis (PTA etc)
[Bug tree-optimization/113787] [12/13/14 Regression] Wrong code at -O with ipa-modref on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113787 --- Comment #8 from Jan Hubicka --- I will take a look. Mod-ref only reuses the code detecting errneous paths in ssa-split-paths, so that code will get confused, too. It makes sense for ivopts to compute difference of two memory allocations, but I wonder if that won't also confuse PTA and other stuff, so perhaps we need way to exlicitely tag memory location where such optimization happen? (to make it clear that original base is lost, or keep track of it)
[Bug ipa/113359] [13 Regression] LTO miscompilation of ceph on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113359 --- Comment #11 from Jan Hubicka --- If there are two ODR types with same ODR name one with integer and other with pointer types third field, then indeed we should get ODR warning and give up on handling them as ODR types for type merging. So dumping their assembler names would be useful starting point. Of course if you have two ODR types of different names but you mix them up in COMDAT function of same name, then the warning will not trigger, so this might be some missing type compatibility check in ipa-sra or ipa-prop summary, too.
[Bug ipa/97119] Top level option to disable creation of IPA symbols such as .localalias is desired
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97119 --- Comment #7 from Jan Hubicka --- Local aliases are created by ipa-visibility pass. Most common case is that function is declared inline but ELF superposition rules say that the symbol can be overwritten by a different library. Since GCC knows that all implementaitons must be equivalent, it can force calls within DSO to be direct. I am not quite sure how this confuses stack unwinding on Solaris? For live patching, if you want to patch inline function, one definitely needs to look for places it has been inlined to. However in the situation the function got offlined, I think live patching should just work, since it will place jump in the beggining of function body. The logic for creating local aliases is in ipa-visibility.cc. Adding command line option to control it is not hard. There are other transformations we do there - like breaking up comdat groups and other things. part aliases are controlled by -fno-partial-inlining, isra by -fno-ipa-sra. There is also ipa-cp controlled by -fno-ipa-prop. We also do alises as part of openMP offlining and LTO partitioning that are kind of mandatory (there is no way to produce correct code without them).
[Bug ipa/113422] Missed optimizations in the presence of pointer chains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113422 --- Comment #2 from Jan Hubicka --- Cycling read-only var discovery would be quite expensive, since you need to interleave it with early opts each round. I wonder how llvm handles this? I think there is more hope with IPA-PTA getting scalable version at -O2 and possibly being able to solve this.
[Bug ipa/113520] ICE with mismatched types with LTO (tree check: expected array_type, have integer_type in array_ref_low_bound)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113520 --- Comment #8 from Jan Hubicka --- I think the ipa-cp summaries should be used only when types match. At least Martin added type streaming for all the jump functions. So we are missing some check?
[Bug tree-optimization/110852] [14 Regression] ICE: in get_predictor_value, at predict.cc:2695 with -O -fno-tree-fre and __builtin_expect() since r14-2219-geab57b825bcc35
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110852 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #16 from Jan Hubicka --- Fixed.
[Bug c++/109753] [13/14 Regression] pragma GCC target causes std::vector not to compile (always_inline on constructor)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109753 --- Comment #12 from Jan Hubicka --- I think this is a problem with two meanings of always_inline. One is "it must be inlined or otherwise we will not be able to generate code" other is "disregard inline limits". I guess practical solution here would be to ingore always inline for functions called from static construction wrappers (since they only optimize around array of function pointers). Question is how to communicate this down from FE to ipa-inline...
[Bug middle-end/79704] [meta-bug] Phoronix Test Suite compiler performance issues
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79704 Bug 79704 depends on bug 109811, which changed state. Bug 109811 Summary: libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #19 from Jan Hubicka --- I think we can declare this one fixed.
[Bug target/113236] WebP benchmark is 20% slower vs. Clang on AMD Zen 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113236 Jan Hubicka changed: What|Removed |Added Ever confirmed|0 |1 Last reconfirmed||2024-01-05 CC||hubicka at gcc dot gnu.org Status|UNCONFIRMED |NEW --- Comment #2 from Jan Hubicka --- On zen3 I get 0.75MP/s for GCC and 0.80MP/s for clang, so only 6.6%, but seems reproducible. Profile looks comparable: gcc 30.96% cwebplibwebp.so.7.1.5 [.] GetCombinedEntropyUnre 26.19% cwebplibwebp.so.7.1.5 [.] VP8LHashChainFill 3.34% cwebplibwebp.so.7.1.5 [.] CalculateBestCacheSize 3.30% cwebplibwebp.so.7.1.5 [.] CombinedShannonEntropy 3.21% cwebplibwebp.so.7.1.5 [.] CollectColorBlueTransf clang: 34.06% cwebplibwebp.so.7.1.5[.] GetCombinedEntropy 28.95% cwebplibwebp.so.7.1.5[.] VP8LHashChainFill 5.37% cwebplibwebp.so.7.1.5[.] VP8LGetBackwardReferences 4.39% cwebplibwebp.so.7.1.5[.] CombinedShannonEntropy_SS 4.28% cwebplibwebp.so.7.1.5[.] CollectColorBlueTransform In the first loop clang seems to ifconvert while GCC doesn't: 0.59 │ lea kSLog2Table,%rdi 3.69 │ vmovss (%rdi,%rax,4),%xmm0 0.98 │ 6f: vcvtsi2ss%edx,%xmm2,%xmm1 0.63 │ vfnmadd213ss 0x0(%r13),%xmm0,%xmm1 38.16 │ vmovss %xmm1,0x0(%r13) 5.48 │ cmp %r12d,0xc(%r13) 0.06 │ ↓ jae 89 │ mov %r12d,0xc(%r13) 0.99 │ 89: mov 0x4(%r13),%edi 0.96 │ 8d: xor %eax,%eax 0.40 │ test %r12d,%r12d 0.60 │ setne%al │ vcvtsd2ss%xmm0,%xmm0,%xmm1 0.02 │362: mov %r15d,%eax 0.57 │ imul %r12d,%eax 0.00 │ cmp %r12d,%r9d 0.03 │ cmovbe %r12d,%r9d 0.02 │ vmovd%eax,%xmm0 0.08 │ vpinsrd $0x1,%r15d,%xmm0,%xmm0 1.50 │ vpaddd %xmm0,%xmm4,%xmm4 1.08 │ vcvtsi2ss%r15d,%xmm5,%xmm0 0.87 │ vfnmadd231ss %xmm0,%xmm1,%xmm3 5.40 │ vmovaps %xmm3,%xmm0 0.02 │38c: xor %eax,%eax 0.16 │ cmp $0x4,%r15d
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #6 from Jan Hubicka --- The internal loops are: static const unsigned keccakf_rotc[24] = { 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 27, 41, 56, 8, 25, 43, 62, 18, 39, 61, 20, 44 }; static const unsigned keccakf_piln[24] = { 10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 15, 23, 19, 13, 12, 2, 20, 14, 22, 9, 6, 1 }; static void keccakf(ulong64 s[25]) { int i, j, round; ulong64 t, bc[5]; for(round = 0; round < SHA3_KECCAK_ROUNDS; round++) { /* Theta */ for(i = 0; i < 5; i++) bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20]; for(i = 0; i < 5; i++) { t = bc[(i + 4) % 5] ^ ROL64(bc[(i + 1) % 5], 1); for(j = 0; j < 25; j += 5) s[j + i] ^= t; } /* Rho Pi */ t = s[1]; for(i = 0; i < 24; i++) { j = keccakf_piln[i]; bc[0] = s[j]; s[j] = ROL64(t, keccakf_rotc[i]); t = bc[0]; } /* Chi */ for(j = 0; j < 25; j += 5) { for(i = 0; i < 5; i++) bc[i] = s[j + i]; for(i = 0; i < 5; i++) s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5]; } s[0] ^= keccakf_rndc[round]; } } I suppose with complete unrolling this will propagate, partly stay in registers and fold. I think increasing the default limits, especially -O3 may make sense. Value of 16 is there for very long time (I think since the initial implementation).
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Jan Hubicka changed: What|Removed |Added Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark |is almost 40% slower vs.|is almost 40% slower vs. |Clang |Clang (not enough complete ||loop peeling) --- Comment #5 from Jan Hubicka --- On my zen3 machine default build gets me 180MB/S -O3 -flto -funroll-all-loops gets me 193MB/s -O3 -flto --param max-completely-peel-times=30 gets me 382MB/s, speedup is gone with --param max-completely-peel-times=20, default is 16.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #4 from Jan Hubicka --- I keep mentioning to Larabel that he should use -fno-semantic-interposition, but he doesn't. Profile is very simple: 96.75% SMHasher[.] keccakf.lto_priv.0 ◆ All goes to simple loop. On Zen3 gcc 13 -march=native -Ofast -flto I get: 3.85 │330: mov%r8,%rdi 7.68 │ movslq (%rsi,%r9,1),%rcx 3.85 │ lea(%rax,%rcx,8),%r10 3.86 │ mov(%rdx,%r9,1),%ecx 3.83 │ add$0x4,%r9 3.86 │ mov(%r10),%r8 7.37 │ rol%cl,%rdi 7.37 │ mov%rdi,(%r10) 4.76 │ cmp$0x60,%r9 0.00 │ ↑ jne330 Clang seems to unroll it: 0.25 │ d0: mov -0x48(%rsp),%rdx ▒ 0.25 │ xor %r12,%rcx ▒ 0.25 │ mov %r13,%r12 ▒ 0.25 │ mov %r13,0x10(%rsp) ▒ 0.25 │ mov %rax,%r13 ◆ 0.26 │ xor %r15,%r13 ▒ 0.23 │ mov %r11,-0x70(%rsp) ▒ 0.25 │ mov %r8,0x8(%rsp) ▒ 0.25 │ mov %r15,-0x40(%rsp) ▒ 0.25 │ mov %r10,%r15 ▒ 0.26 │ mov %r10,(%rsp) ▒ 0.26 │ mov %r14,%r10 ▒ 0.25 │ xor %r12,%r10 ▒ 0.26 │ xor %rsi,%r15 ▒ 0.24 │ mov %rbp,-0x80(%rsp) ▒ 0.25 │ xor %rcx,%r15 ▒ 0.26 │ mov -0x60(%rsp),%rcx ▒ 0.25 │ xor -0x68(%rsp),%r15 ▒ 0.26 │ xor %rbp,%rdx ▒ 0.25 │ mov -0x30(%rsp),%rbp ▒ 0.25 │ xor %rdx,%r13 ▒ 0.24 │ mov -0x10(%rsp),%rdx ▒ 0.25 │ mov %rcx,%r12 ▒ 0.24 │ xor %rcx,%r13 ▒ 0.25 │ mov $0x1,%ecx ▒ 0.25 │ xor %r11,%rdx ▒ 0.24 │ mov %r8,%r11 ▒ 0.25 │ mov -0x28(%rsp),%r8 ▒ 0.26 │ xor -0x58(%rsp),%r8 ▒ 0.24 │ xor %rdx,%r8 ▒ 0.26 │ mov -0x8(%rsp),%rdx ▒ 0.25 │ xor %rbp,%r8 ▒ 0.26 │ xor %r11,%rdx ▒ 0.25 │ mov -0x20(%rsp),%r11 ▒ 0.25 │ xor %rdx,%r10 ▒
[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 --- Comment #23 from Jan Hubicka --- Created attachment 56970 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56970=edit Patch I am testing Hi, this adds -falign-all-functions parameter. It still look like more reasonable (and backward compatible) thing to do. I also poked about Richi's suggestion of extending the syntax of -falign-functions but I think it is less readable.
[Bug ipa/92606] [11/12/13 Regression][avr] invalid merge of symbols in progmem and data sections
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92606 --- Comment #31 from Jan Hubicka --- This is Maritn's code, but I agree that equals_wpa should reject pairs with "dangerous" attributes on them (ideally we should hash them). I think we could add test for same attributes to equals_wpa and eventually white list attributes we consider mergeable? There are attributes that serves no meaning once we enter backend, so it may be also good option to strip them, so they are not confusing passes like ICF.
[Bug ipa/81323] IPA-VRP doesn't handle return values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81323 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #9 from Jan Hubicka --- Note that r14-5628-g53ba8d669550d3 does just the easy part of propagating within single translation unit. We will need to add actual IPA bits into WPA next stage1
[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #18 from Jan Hubicka --- Reading all the discussion again, I am leaning towards -falign-all-functions + documentation update explaining that -falign-functions/-falign-loops are optimizations and ignored for -Os. I do use -falign-functions/-falign-loops when tuning for new generations of CPUs and I definitely want to have way to specify alignment that is ignored for cold functions (as perforance optimization) and we have this behavior since profile code was introduced in 2002. As an optimization, we also want to have hot functions aligned more than 8 byte boundary needed for patching. I will prepare patch for this and send it for disucssion. Pehraps we want -flive-patching to also imply FUNCTION_BOUNDARY increase on x86-64? Or is live patching useful if function entries are not aligned?
[Bug tree-optimization/110062] missed vectorization in graphicsmagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110062 --- Comment #11 from Jan Hubicka --- trunk -O3 -flto -march=native -fopenmp Operation: Sharpen: 257 256 256 Average: 256 Iterations Per Minute GCC13 -O3 -flto -march=native -fopenmp 257 256 256 Average: 256 Iterations Per Minute clang17 O3 -flto -march=native -fopenmp Operation: Sharpen: 257 256 256 Average: 256 Iterations Per Minute So I guess I will need to try on zen3 to see if there is any difference. the internal loop is: 0.00 │460:┌─→movzbl 0x2(%rdx,%rax,4),%esi ▒ 0.02 ││ vmovss (%r8,%rax,4),%xmm2▒ 0.95 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 20.22 ││ movzbl 0x1(%rdx,%rax,4),%esi ▒ 0.01 ││ vfmadd231ss %xmm1,%xmm2,%xmm3 ▒ 11.97 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 18.76 ││ movzbl (%rdx,%rax,4),%esi▒ 0.00 ││ inc %rax ▒ 0.72 ││ vfmadd231ss %xmm1,%xmm2,%xmm4 ▒ 12.55 ││ vcvtsi2ss %esi,%xmm0,%xmm1 ▒ 14.95 ││ vfmadd231ss %xmm1,%xmm2,%xmm5 ▒ 15.93 │├──cmp %rax,%r13 ▒ 0.35 │└──jne 460 so it still does not get
[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 --- Comment #18 from Jan Hubicka --- I made a typo: Mainline with -O2 -flto -march=native run manually since build machinery patch is needed 23.03 22.85 23.04 Should be Mainline with -O3 -flto -march=native run manually since build machinery patch is needed 23.03 22.85 23.04 So with -O2 we still get slightly lower score than clang with -O3 we are slightly better. push_back inlining does not seem to be a problem (as tested by increasing limits) so perhaps more agressive unrolling/vectorization settings clang has at -O2. I think upstream jpegxl should use -O3 or -Ofast instead of -O2. It is quite typical kind of task that benefits from large optimization levels. I filled in https://github.com/libjxl/libjxl/issues/2970
[Bug target/109812] GraphicsMagick resize is a lot slower in GCC 13.1 vs Clang 16 on Intel Raptor Lake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109812 --- Comment #20 from Jan Hubicka --- On zen4 hardware I now get GCC13 with -O3 -flto -march=native -fopenmp 2163 2161 2153 Average: 2159 Iterations Per Minute clang 17 with -O3 -flto -march=native -fopenmp 2004 1988 1991 Average: 1994 Iterations Per Minute trunk -O3 -flto -march=native -fopenmp Operation: Resizing: 2126 2135 2123 Average: 2128 Iterations Per Minute So no big changes here...
[Bug middle-end/112653] PTA should handle correctly escape information of values returned by a function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653 --- Comment #8 from Jan Hubicka --- On ARM32 and other targets methods returns this pointer. Togher with making return value escape this probably completely disables any chance for IPA tracking of C++ data types...
[Bug middle-end/110015] openjpeg is slower when built with gcc13 compared to clang16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110015 --- Comment #10 from Jan Hubicka --- runtimes on zen4 hardware. trunk -O3 -flto -march-native 42171 42964 42106 clang -O3 -flto -march=native 37393 37423 37508 gcc 13 -O3 -flto -march=native 42380 42314 43285 So seems the performance did not change
[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 --- Comment #15 from Jan Hubicka --- With SRA improvements r:aae723d360ca26cd9fd0b039fb0a616bd0eae363 we finally get good performance at -O2. Improvements to push_back implementation also helps a bit. Mainline with default flags (-O2): Input: JPEG - Quality: 90: 19.76 19.75 19.68 Mainline with -O2 -march=native: Input: JPEG - Quality: 90: 20.01 20 19.98 Mainline with -O2 -march=native -flto Input: JPEG - Quality: 90: 19.95 19.98 19.81 Mainline with -O2 -march=native -flto --param max-inline-insns-auto=80 (this makes push_back inlined) Input: JPEG - Quality: 90: 19.98 20.05 20.03 Mainline with -O2 -flto -march=native -I/usr/include/c++/v1 -nostdinc++ -lc++ (so clang's libc++) 21.38 21.37 21.32 Mainline with -O2 -flto -march=native run manualy since build machinery patch is needed 23.03 22.85 23.04 Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with system libstdc++ from GCC13 so before push_back improvements. 21.16 20.95 21.06 Clang 17 with -O2 -march=native -flto and also -fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with trunk libstdc++ with push_back improvements. 21.2 20.93 20.98 Clang 17 with -O2 -march=native -flto -stdlib=libc++ and also -fno-tree-vectorize -fno-tree-slp-vectorize added by cmake. This is with clan'g libc++ Input: JPEG - Quality: 90: 22.08 21.88 21.78 Clang 17 with -O3 -march=native -flto 23.08 22.90 22.84 libc++ declares push_back always_inline and splits out the slow copying path. I think the inlined part is still bit too large for inlining at -O2. We could still try to get remaining approx 10% without increasing code size at -O2 However major part of the problem is solved.
[Bug middle-end/112706] New: missed simplification in FRE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112706 Bug ID: 112706 Summary: missed simplification in FRE Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- Compiling the following testcase (simplified from repeated std::vector::push_back expansion): int *ptr; void link_error (); void test () { int *ptr1 = ptr + 10; int *ptr2 = ptr + 20; if (ptr1 == ptr2) link_error (); } with gcc -O2 t.C -fdump-tree-all-details one can check that link_error is optimized away really late: jh@ryzen4:/tmp> grep link_error a-t.C* a-t.C.106t.cunrolli: link_error (); a-t.C.107t.backprop: link_error (); a-t.C.108t.phiprop: link_error (); a-t.C.109t.forwprop2:link_error (); this is too late for some optimization to catch up (in the case of std::vector we end up missing DSE since the transform is delayed to forwprop3) I think this is something value numbering should catch.
[Bug middle-end/112653] PTA should handle correctly escape information of values returned by a function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653 --- Comment #7 from Jan Hubicka --- Thanks for explanation. I think it is quite common pattern that new object is construted and worked on and later returned, so I think we ought to handle this correctly. Another example just came up in https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637878.html We should gnerate same code for the following two functions: #include auto f() { std::vector x; x.reserve(10); for (int i = 0; i < 10; ++i) x.push_back(0); return x; } auto g() { return std::vector(10, 0); } but we don't since we lose track of values stored in x after every call to new.
[Bug rtl-optimization/112657] [13/14 Regression] missed optimization: cmove not used with multiple returns
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112657 --- Comment #8 from Jan Hubicka --- The negative return value branch predictor is set to have 98% hitrate (measured on SPEC2k17 some time ago). There is --param predictable-branch-outcome that is also set to 2% so indeed we consider the branch as well predictable by this heuristics. Reducing --param should make cmov to happen. With profile_probability data type we could try something smarter on guessing if given branch is predictable (such as ignoring guessed values and let predictor to optionally mark branches as (un)predictable). But it is not quite clear to me what desired behavior would be... Guessing predictability of data branches is generally quite hard problem. Predictablity of loop branches is easier, but we hardly apply BRANCH_COST on branch closing loop since those are not if-conversion candidates.
[Bug ipa/98925] Extend ipa-prop to handle return functions for slot optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98925 --- Comment #3 from Jan Hubicka --- Return value range propagation was added in r:53ba8d669550d3a1f809048428b97ca607f95cf5 however it works on scalar return values only for now. Extending it to aggregates is a logical next step and should not be terribly hard. The code also misses logic for IPA streaming so it works only in ealry and late opts.
[Bug middle-end/88345] -Os overrides -falign-functions=N on the command line
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #17 from Jan Hubicka --- -falign-functions/-falign-jumps/-falign-labels/-falign-loops are originally are intended for performance tuning. Starting function entry close to the end of page of code cache may lead to wasted code cache space as well as higher overhead calling the function when CPU fetches page which contains just little useful information. As such I would like to keep them affecting only hot code (we should update documentation for that). Internally we have FUNCTION_BOUNDARY which specifies minimal alignment needed by ABI, which is set to 8bits for i386. My understanding is that -fpatchable-function-entry requires the alignment to be 64bits in order to make it possible to atomically change the instruction. So perhaps we want to make FUNCTION_BOUNDARY to be 64 for functions where we output the patchable entry? I am also OK with extending the flag syntax or adding -fmin-function-alignment to specify optional user-defined minimum (increase FUNCTION_BOUNDARY) if that seems useful, but I think the first one is most consistent way to go with live patching?
[Bug middle-end/112653] We should optimize memmove to memcpy using alias oracle
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653 --- Comment #3 from Jan Hubicka --- PR82898 testcases seems to be about type based alias analysis. However PTA should be useable here.
[Bug middle-end/109849] suboptimal code for vector walking loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849 Bug 109849 depends on bug 110377, which changed state. Bug 110377 Summary: Early VRP and IPA-PROP should work out value ranges from __builtin_unreachable https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110377 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug libstdc++/110287] _M_check_len is expensive
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287 Bug 110287 depends on bug 110377, which changed state. Bug 110377 Summary: Early VRP and IPA-PROP should work out value ranges from __builtin_unreachable https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110377 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug middle-end/110377] Early VRP and IPA-PROP should work out value ranges from __builtin_unreachable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110377 Jan Hubicka changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #7 from Jan Hubicka --- Fixed.
[Bug middle-end/112653] New: We should optimize memmove to memcpy using alias oracle
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653 Bug ID: 112653 Summary: We should optimize memmove to memcpy using alias oracle Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- In this testcase (losely based on libstdc++ implementation of vectors) I we should be able to turn memmove to memcpy because we know that the two parameters can not alias #include #include char *test; char * copy_test () { char *test2 = malloc (1000); memmove (test2, test, 1000); return test2; }
[Bug libstdc++/110287] _M_check_len is expensive
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287 Jan Hubicka changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Last reconfirmed||2023-11-21 --- Comment #10 from Jan Hubicka --- We now produce reasonable code for _M_check_len and propagate value range of the return value. This helps us to notice that later allocator call will not throw exception on invalid size, so we are down from 3 throw calls to one. Current code is: size_type std::vector::_M_check_len (const struct vector * const this, size_type __n, const char * __s) { const size_type __len; long unsigned int _1; long unsigned int __n.3_2; size_type iftmp.4_3; long unsigned int _4; long unsigned int _7; long unsigned int _8; long int _9; long int _11; struct pair_t * _12; struct pair_t * _13; [local count: 1073741824]: _13 = this_6(D)->D.26060._M_impl.D.25361._M_finish; _12 = this_6(D)->D.26060._M_impl.D.25361._M_start; _11 = _13 - _12; _9 = _11 /[ex] 8; _7 = (long unsigned int) _9; _1 = 1152921504606846975 - _7; __n.3_2 = __n; if (_1 < __n.3_2) goto ; [0.00%] else goto ; [100.00%] [count: 0]: std::__throw_length_error (__s_14(D)); [local count: 1073741824]: _8 = MAX_EXPR <__n.3_2, _7>; __len_10 = _7 + _8; if (_7 > __len_10) goto ; [35.00%] else goto ; [65.00%] [local count: 697932184]: _4 = MIN_EXPR <__len_10, 1152921504606846975>; [local count: 1073741824]: # iftmp.4_3 = PHI <1152921504606846975(4), _4(5)> return iftmp.4_3; } I still think we could play games with the 2^63 being too large for the standard allocator and turn __throw_length_error into __builtin_unreachable for that case. This would help early inliner to inline this function and save some throw calls in real code.
[Bug libstdc++/110287] _M_check_len is expensive
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287 --- Comment #9 from Jan Hubicka --- This is _M_realloc insert at release_ssa time: eleased 63 names, 165.79%, removed 63 holes void std::vector::_M_realloc_insert (struct vector * const this, struct iterator __position, const struct pair_t & __args#0) { struct pair_t * const __position; struct pair_t * __new_finish; struct pair_t * __old_finish; struct pair_t * __old_start; long unsigned int _1; struct pair_t * _2; struct pair_t * _3; long int _4; long unsigned int _5; struct pair_t * _6; const size_type _10; long int _13; struct pair_t * iftmp.5_15; struct pair_t * _17; struct _Vector_impl * _18; long unsigned int _22; long int _23; long unsigned int _24; long unsigned int _25; struct pair_t * _26; long unsigned int _36; [local count: 1073741824]: __position_27 = MEM[(struct __normal_iterator *)&__position]; _10 = std::vector::_M_check_len (this_8(D), 1, "vector::_M_realloc_insert"); __old_start_11 = this_8(D)->D.25975._M_impl.D.25282._M_start; __old_finish_12 = this_8(D)->D.25975._M_impl.D.25282._M_finish; _13 = __position_27 - __old_start_11; if (_10 != 0) goto ; [54.67%] else goto ; [45.33%] [local count: 587014656]: _18 = [(struct _Vector_base *)this_8(D)]._M_impl; _17 = std::__new_allocator::allocate (_18, _10, 0B); [local count: 1073741824]: # iftmp.5_15 = PHI <0B(2), _17(3)> _1 = (long unsigned int) _13; _2 = iftmp.5_15 + _1; *_2 = *__args#0_14(D); if (_13 > 0) goto ; [41.48%] else goto ; [58.52%] [local count: 445388112]: __builtin_memmove (iftmp.5_15, __old_start_11, _1); [local count: 1073741824]: _36 = _1 + 8; __new_finish_16 = iftmp.5_15 + _36; _23 = __old_finish_12 - __position_27; if (_23 > 0) goto ; [41.48%] else goto ; [58.52%] [local count: 445388112]: _24 = (long unsigned int) _23; __builtin_memcpy (__new_finish_16, __position_27, _24); [local count: 1073741824]: _25 = (long unsigned int) _23; _26 = __new_finish_16 + _25; _3 = this_8(D)->D.25975._M_impl.D.25282._M_end_of_storage; _4 = _3 - __old_start_11; if (__old_start_11 != 0B) goto ; [53.47%] else goto ; [46.53%] [local count: 574129752]: _22 = (long unsigned int) _4; operator delete (__old_start_11, _22); [local count: 1073741824]: this_8(D)->D.25975._M_impl.D.25282._M_start = iftmp.5_15; this_8(D)->D.25975._M_impl.D.25282._M_finish = _26; _5 = _10 * 8; _6 = iftmp.5_15 + _5; this_8(D)->D.25975._M_impl.D.25282._M_end_of_storage = _6; return; } First it is not clear to me why we need memmove at all? So first issue is: [local count: 1073741824]: __position_27 = MEM[(struct __normal_iterator *)&__position]; _10 = std::vector::_M_check_len (this_8(D), 1, "vector::_M_realloc_insert"); __old_start_11 = this_8(D)->D.25975._M_impl.D.25282._M_start; __old_finish_12 = this_8(D)->D.25975._M_impl.D.25282._M_finish; _13 = __position_27 - __old_start_11; if (_10 != 0) goto ; [54.67%] else goto ; [45.33%] Without inlining _M_check_len early we can not work out return value range, since we need to know that paramter 2 is 1 and not 0. Adding __builtin_unreachable check after helps to reduce if (_10 != 0) but I need to do something about inliner accounting the conditional to function body size. [local count: 1073741824]: # iftmp.5_15 = PHI <0B(2), _17(3)> _1 = (long unsigned int) _13; _2 = iftmp.5_15 + _1; *_2 = *__args#0_14(D); if (_13 > 0) goto ; [41.48%] else goto ; [58.52%] [local count: 445388112]: __builtin_memmove (iftmp.5_15, __old_start_11, _1); Is this code about inserting value to the middle? Since push_back always initializes iterator to point to the end, this seems quite sily to do. Can't we do somehting like _M_realloc_append?
[Bug middle-end/109849] suboptimal code for vector walking loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849 --- Comment #21 from Jan Hubicka --- Patch https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637265.html gets us closer to inlining _M_realloc_insert at -O3 (3 insns away) Patch https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636935.html reduces the expense when _M_realloc_insert is not inlined at -O2 (where I think we should not inline it, unlike for clang)
[Bug libstdc++/110287] _M_check_len is expensive
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110287 --- Comment #8 from Jan Hubicka --- With return value range propagation https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637265.html reduces --param max-inline-insns-auto needed for _M_realloc_insert to be inlined on my testcase from 39 to 35. This is done by eliminating two unnecesary trow calls by propagating fact that check_len does not return incredibly large values. Default inline limit at -O3 is 30, so we are not that far and I think we really ought to solve this for next release since push_back is such a common case. Is it known that check_len can not return 0 in this situation? Adding if (ret <= 0) __builtin_unreachable saves another 2 instructions because _M_realloc_insert otherwise contain a code path for case that vector gets increased to 0 elements.
[Bug tree-optimization/112618] New: internal compiler error: in expand_MASK_CALL, at internal-fn.cc:4529
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112618 Bug ID: 112618 Summary: internal compiler error: in expand_MASK_CALL, at internal-fn.cc:4529 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- jh@ryzen4:~/gcc/build4/stage1-gcc> cat b.c /* PR tree-optimization/106433 */ int m, *p; __attribute__ ((simd)) int bar (int x) { if (x) { if (m < 1) for (m = 0; m < 1; ++m) ++x; p = for (;;) ++m; } return 0; } __attribute__ ((simd)) int foo (int x) { bar (x); return 0; } jh@ryzen4:~/gcc/build4/stage1-gcc> ./xgcc -B ./ -O2 b.c -fno-tree-vrp during RTL pass: expand b.c: In function ‘foo.simdclone.3’: b.c:23:2: internal compiler error: in expand_MASK_CALL, at internal-fn.cc:5013 23 | bar (x); | ^~~ 0x12db307 expand_MASK_CALL(internal_fn, gcall*) ../../gcc/internal-fn.cc:5013 0x12daa47 expand_internal_call(internal_fn, gcall*) ../../gcc/internal-fn.cc:4920 0x12daa72 expand_internal_call(gcall*) ../../gcc/internal-fn.cc:4928 0xf7637e expand_call_stmt ../../gcc/cfgexpand.cc:2737 0xf7a5a8 expand_gimple_stmt_1 ../../gcc/cfgexpand.cc:3880 0xf7ac2c expand_gimple_stmt ../../gcc/cfgexpand.cc:4044 0xf82d6f expand_gimple_basic_block ../../gcc/cfgexpand.cc:6100 0xf85322 execute ../../gcc/cfgexpand.cc:6835 Please submit a full bug report, with preprocessed source (by using -freport-bug). Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions.
[Bug tree-optimization/110641] [14 Regression] ICE in adjust_loop_info_after_peeling, at tree-ssa-loop-ivcanon.cc:1023 since r14-2230-g7e904d6c7f2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110641 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED --- Comment #3 from Jan Hubicka --- mine.
[Bug target/109811] libjxl 0.7 is a lot slower in GCC 13.1 vs Clang 16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109811 --- Comment #13 from Jan Hubicka --- So I re-tested it with current mainline and clang 16/17 For mainline I get (megapixels per second, bigger is better): 13.39 13.38 13.42 clang 16: 20.06 20.06 19.87 clang 17: 19.7 19.68 19.69 mainline with Martin's patch to enable SRA across calls where parameter doesn't example (improvement for PR109849) I get: 19.37 19.35 19.31 this is without inlining m_realloc_insert which we do at -O3 but we don't at -O2 since it is large (clang inlines at both at -O2 and -O3).
[Bug ipa/59948] Optimize std::function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59948 --- Comment #8 from Jan Hubicka --- Trunk optimized stuff return 0, but fails to optimize out functions which becomes unused after indirect inlining. With -fno-early-inlining we end up with: int m () { void * D.48296; int __args#0; struct function h; int _12; bool (*) (union _Any_data & {ref-all}, const union _Any_data & {ref-all}, _Manager_operation) _24; bool (*) (union _Any_data & {ref-all}, const union _Any_data & {ref-all}, _Manager_operation) _27; long unsigned int _29; long unsigned int _35; vector(2) long unsigned int _37; void * _42; [local count: 1073741824]: _29 = (long unsigned int) _M_invoke; _35 = (long unsigned int) _M_manager; _37 = {_35, _29}; h ={v} {CLOBBER}; MEM [(struct _Function_base *) + 8B] = {}; MEM[(int (*) (int) *)] = f; MEM [(void *) + 16B] = _37; __args#0 = 1; _12 = std::_Function_handler::_M_invoke (_M_functor, &__args#0); [local count: 1073312329]: __args#0 ={v} {CLOBBER(eol)}; _24 = MEM[(struct _Function_base *)]._M_manager; if (_24 != 0B) goto ; [70.00%] else goto ; [30.00%] [local count: 751318634]: _24 ([(struct _Function_base *)]._M_functor, [(struct _Function_base *)]._M_functor, 3); [local count: 1073312329]: h ={v} {CLOBBER}; h ={v} {CLOBBER(eol)}; return _12; [count: 0]: : _27 = MEM[(struct _Function_base *)]._M_manager; if (_27 != 0B) goto ; [0.00%] else goto ; [0.00%] [count: 0]: _27 ([(struct _Function_base *)]._M_functor, [(struct _Function_base *)]._M_functor, 3); [count: 0]: h ={v} {CLOBBER}; _42 = __builtin_eh_pointer (2); __builtin_unwind_resume (_42); } ipa-prop fails to track the pointer passed around: IPA function summary for int m()/288 inlinable global time: 41.256800 self size: 16 global size: 41 min size: 38 self stack: 32 global stack:32 size:19.00, time:8.66 size:3.00, time:2.00, executed if:(not inlined) calls: std::function::~function()/286 inlined freq:0.00 Stack frame offset 32, callee self size 0 std::_Function_base::~_Function_base()/71 inlined freq:0.00 Stack frame offset 32, callee self size 0 indirect call loop depth: 0 freq:0.00 size: 6 time: 18 std::function::~function()/404 inlined freq:1.00 Stack frame offset 32, callee self size 0 std::_Function_base::~_Function_base()/405 inlined freq:1.00 Stack frame offset 32, callee self size 0 indirect call loop depth: 0 freq:0.70 size: 6 time: 18 _Res std::function<_Res(_ArgTypes ...)>::operator()(_ArgTypes ...) const [with _Res = int; _ArgTypes = {int}]/304 inlined freq:1.00 Stack frame offset 32, callee self size 0 void std::__throw_bad_function_call()/374 function body not available freq:0.00 loop depth: 0 size: 1 time: 10 _M_empty.isra/384 inlined freq:1.00 Stack frame offset 32, callee self size 0 indirect call loop depth: 0 freq:1.00 size: 6 time: 18 std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = int (&)(int); _Constraints = void; _Res = int; _ArgTypes = {int}]/302 inlined freq:1.00 Stack frame offset 32, callee self size 0 std::function<_Res(_ArgTypes ...)>::function(_Functor&&) [with _Functor = int (&)(int); _Constraints = void; _Res = int; _ArgTypes = {int}]/375 inlined freq:0.33 Stack frame offset 32, callee self size 0 static void std::_Function_base::_Base_manager<_Functor>::_M_init_functor(std::_Any_data&, _Fn&&) [with _Fn = int (&)(int); _Functor = int (*)(int)]/310 inlined freq:0.33 Stack frame offset 32, callee self size 0 _M_create.isra/383 inlined freq:0.33 Stack frame offset 32, callee self size 0 void* std::_Any_data::_M_access()/388 inlined freq:0.33 Stack frame offset 32, callee self size 0 operator new.isra/386 inlined freq:0.33 Stack frame offset 32, callee self size 0 static bool std::_Function_base::_Base_manager<_Functor>::_M_not_empty_function(_Tp*) [with _Tp = int(int); _Functor = int (*)(int)]/308 inlined freq:1.00 Stack frame offset 32, callee self size 0 constexpr std::_Function_base::_Function_base()/299 inlined freq:1.00 Stack frame offset 32, callee self size 0
[Bug middle-end/111573] New: lambda functions often not inlined and optimized out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111573 Bug ID: 111573 Summary: lambda functions often not inlined and optimized out Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- #include using namespace std; static int dosum(std::function fn) { return fn(5,6); } int test() { auto sum = [](int a, int b) { return a + b; }; int s; for (s = 0; s < 10; s++) s+= dosum(sum); return s; } Gets optimized well only with early inlining. compiled with -fno-early-inlining yields to: _Z4testv: .LFB2166: .cfi_startproc subq$56, %rsp .cfi_def_cfa_offset 64 xorl%ecx, %ecx .p2align 4,,10 .p2align 3 .L8: leaq12(%rsp), %rdx leaq8(%rsp), %rsi movl$5, 8(%rsp) leaq16(%rsp), %rdi movl$6, 12(%rsp) call _ZNSt17_Function_handlerIFiiiEZ4testvEUliiE_E9_M_invokeERKSt9_Any_dataOiS6_ leal1(%rcx,%rax), %ecx cmpl$9, %ecx jle .L8 movl%ecx, %eax addq$56, %rsp .cfi_def_cfa_offset 8 ret So we fail to inline since ipa-prop fails to track the constant function address. I think this is really common in typical lambda function usage
[Bug middle-end/111552] New: 549.fotonik3d_r regression with -O2 -flto -march=native on zen between g:85d613da341b7630 (2022-06-21 15:51) and g:ecd11acacd6be57a (2022-07-01 16:07)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111552 Bug ID: 111552 Summary: 549.fotonik3d_r regression with -O2 -flto -march=native on zen between g:85d613da341b7630 (2022-06-21 15:51) and g:ecd11acacd6be57a (2022-07-01 16:07) Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=297.527.0=296.527.0;
[Bug middle-end/111551] New: Fix for PR106081 is not working with profile feedback on imagemagick
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111551 Bug ID: 111551 Summary: Fix for PR106081 is not working with profile feedback on imagemagick Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- As seen in https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=471.507.0=473.507.0=475.507.0=477.507.0; Fix for PR106081 improved imagemagick significantly without FDO but not with FDO.