[Bug testsuite/82951] gcc.c-torture/execute/20040409-1.c undefined behavior
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82951 --- Comment #1 from Marc Glisse --- Or I should just add -fwrapv since those tests were added to test an RTL transformation ( https://gcc.gnu.org/ml/gcc-patches/2004-04/msg00615.html ).
[Bug testsuite/82951] New: gcc.c-torture/execute/20040409-1.c undefined behavior
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82951 Bug ID: 82951 Summary: gcc.c-torture/execute/20040409-1.c undefined behavior Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: testsuite Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- While testing a VRP patch, I had failures for gcc.c-torture/execute/20040409-[1-3].c. If I run them with -fsanitize=undefined, I get 20040409-1.c:27:12: runtime error: signed integer overflow: 0 - -2147483648 cannot be represented in type 'int' 20040409-1.c:17:12: runtime error: signed integer overflow: -2147483648 + -2147483648 cannot be represented in type 'int' 20040409-2.c:47:13: runtime error: signed integer overflow: 0 - -2147483648 cannot be represented in type 'int' 20040409-2.c:57:23: runtime error: signed integer overflow: 4660 - -2147483648 cannot be represented in type 'int' 20040409-2.c:27:13: runtime error: signed integer overflow: -2147483648 + -2147483648 cannot be represented in type 'int' 20040409-2.c:37:23: runtime error: signed integer overflow: -2147478988 + -2147483648 cannot be represented in type 'int' 20040409-2.c:111:18: runtime error: signed integer overflow: -2147483648 + -2147478988 cannot be represented in type 'int' 20040409-3.c:27:14: runtime error: signed integer overflow: 0 - -2147483648 cannot be represented in type 'int' 20040409-3.c:17:14: runtime error: signed integer overflow: -2147483648 + -2147483648 cannot be represented in type 'int' Unless someone volunteers to improve the tests, I'll likely remove the offending cases (and probably more since this is a grid and I don't want to look for every cell) from those 3 files.
[Bug bootstrap/82948] New: [8 Regression] prefix.c:202:15: error: 'char* strncpy(char*, const char*, size_t)' destination unchanged after copying no bytes [-Werror=stringop-truncation]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82948 Bug ID: 82948 Summary: [8 Regression] prefix.c:202:15: error: 'char* strncpy(char*, const char*, size_t)' destination unchanged after copying no bytes [-Werror=stringop-truncation] Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: bootstrap Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Hello, I cannot bootstrap currently (r254649) on gcc112 (powerpc64le-unknown-linux-gnu) with --with-system-zlib --disable-nls --enable-languages=all,obj-c++,go --enable-host-shared /home/glisse/pristine/gcc/prefix.c: In function 'char* translate_name(char*)': /home/glisse/pristine/gcc/prefix.c:202:15: error: 'char* strncpy(char*, const char*, size_t)' destination unchanged after copying no bytes [-Werror=stringop-truncation] strncpy (key, [1], keylen); ^~~ /home/glisse/pristine/gcc/prefix.c:202:15: error: 'char* strncpy(char*, const char*, size_t)' destination unchanged after copying no bytes [-Werror=stringop-truncation] /home/glisse/pristine/gcc/prefix.c:202:15: error: 'char* strncpy(char*, const char*, size_t)' destination unchanged after copying no bytes [-Werror=stringop-truncation] /home/glisse/pristine/gcc/prefix.c:202:15: error: 'char* strncpy(char*, const char*, size_t)' destination unchanged after copying no bytes [-Werror=stringop-truncation] cc1plus: all warnings being treated as errors make[3]: *** [prefix.o] Error 1 make[3]: *** Waiting for unfinished jobs rm gfortran.pod fsf-funding.pod gcov.pod gpl.pod cpp.pod gfdl.pod gccgo.pod gcc.pod gcov-dump.pod gcov-tool.pod make[3]: Leaving directory `/home/glisse/test/pristine/build/gcc' make[2]: *** [all-stage2-gcc] Error 2
[Bug preprocessor/82939] genmatch fills up terminal with endless printing of periods
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82939 --- Comment #1 from Marc Glisse --- Is that during stage 1 or in a later stage?
[Bug target/82935] Unnecessary "sub rsp, 8", "call" and "add rsp, 8" instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82935 --- Comment #4 from Marc Glisse --- We keep *a1_2(D) = *a2_3(D); and only at expansion time turn it into a call to memcpy, so the gimple pass that detects tail calls doesn't have a chance to notice this case.
[Bug middle-end/82898] Aliasing knowledge is not used to replace memmove with memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82898 --- Comment #1 from Marc Glisse --- At least in the gcc model, the type of a pointer is meaningless as long as you do not dereference it using that type, so I am not sure what can be done here.
[Bug c++/82888] terrible code generation for initialization of POD array members vs. clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82888 --- Comment #4 from Marc Glisse --- The front-end internally uses VEC_INIT_EXPR, and gimplifies it to a loop. I believe we should end up with an empty CONSTRUCTOR instead.
[Bug middle-end/82885] memcpy does not propagate aliasing knowledge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82885 --- Comment #1 from Marc Glisse --- gcc (illegally) generates some calls to memcpy(p,q,n) where p and q may be the same pointer, although they mustn't overlap in any more complicated way. That makes such an optimization problematic (although this memcpy generation seems to happen at expansion time, so doing the optimization earlier might be ok).
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #11 from Marc Glisse --- (In reply to Wilco from comment #9) > It works for any C where (divisor*C) MOD 2^32 == 1 (or -1). For x%3==0, i.e. z==0 for x==3*y+z with 0<=y<5556 and 0<=z<3. Indeed, x*0xaaab==y+z*0xaaab is in the right range precisely for z==0 and the same can be done for any odd number. > You can support any kind of comparison, it doesn't need to be with 0 (but > zero is the easiest). Any ==cst will yield a range test. It is less obvious that inequalities are transformed to a contiguous range... (try x%7<3 maybe) > I forgot whether I made it work for signed too, but it's certainly > possible to skip the sign handling in x % 4 == 0 even if x is signed. 4 is a completely different story, as a power of 2.
[Bug target/82858] __builtin_add_overflow() generates suboptimal code with unsigned types on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82858 --- Comment #4 from Marc Glisse --- unsigned c; unsigned d = __builtin_add_overflow(a, b, )?-1:0; return c|d; gives the expected asm. Ideally phiopt would recognize a saturing add pattern, but we have nothing to model it in gimple. We could turn it into the branchless BIT_IOR form though. (the problem isn't with __builtin_add_overflow but with what comes afterwards)
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #7 from Marc Glisse --- Is that a special case of a more generic transformation, which might apply for other values of 3, 0, == etc, or is this meant only literally for x%3==0?
[Bug middle-end/56888] memcpy implementation optimized as a call to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888 --- Comment #38 from Marc Glisse --- *** Bug 82845 has been marked as a duplicate of this bug. ***
[Bug c/82845] -ftree-loop-distribute-patterns creates recursive loops on function called "memset"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82845 Marc Glisse changed: What|Removed |Added Resolution|FIXED |DUPLICATE --- Comment #3 from Marc Glisse --- Please don't touch the status field, I marked it as "duplicate" pointing to the other PR, that's more useful than "fixed" (which is false). Indeed we can hope that it will serve as a reminder for people working on PR 56888. *** This bug has been marked as a duplicate of bug 56888 ***
[Bug c/82845] -ftree-loop-distribute-patterns creates recursive loops on function called "memset"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82845 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Marc Glisse --- Richard's patch seems to have been forgotten :-( *** This bug has been marked as a duplicate of bug 56888 ***
[Bug middle-end/56888] memcpy implementation optimized as a call to memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888 Marc Glisse changed: What|Removed |Added CC||david at westcontrol dot com --- Comment #37 from Marc Glisse --- *** Bug 82845 has been marked as a duplicate of this bug. ***
[Bug middle-end/82839] missing -Wmaybe-uninitialized warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82839 --- Comment #1 from Marc Glisse --- You can simplify the function to int ts; g(); *t = ts; h(); Part of the analysis is not flow-sensitive: we see that ts escapes, we deduce that g() can write to it, so ts might be initialized and we do not warn. We miss the fact that the escape cannot happen before the call to g.
[Bug target/82830] New: short rotate with truncated length
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82830 Bug ID: 82830 Summary: short rotate with truncated length Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* #include unsigned short f(unsigned short x,int n){ return _rotwl(x,n&15); } andl$15, %ecx rolw%cl, %ax I believe the masking is unnecessary. We have some related things in i386.md, but only for SWI48.
[Bug c++/82818] Bad Codegen, delete does not check for nullptrs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82818 --- Comment #3 from Marc Glisse --- Please read the documentation for -flifetime-dse, your code is invalid.
[Bug tree-optimization/82776] Unable to optimize the loop when iteration count is unavailable.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82776 --- Comment #8 from Marc Glisse --- At some point, we could also think of taking advantage of what the C++ standard (for instance) says: "[intro.progress] The implementation may assume that any thread will eventually do one of the following: (1.1) — terminate, (1.2) — make a call to a library I/O function, (1.3) — perform an access through a volatile glvalue, or (1.4) — perform a synchronization operation or an atomic operation. [ Note: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. — end note ]" The only potential "progress" in this loop is the call to __builtin_ia32_pmovmskb128, but replacing it with a call to a function with attribute((const)) does not help. And if there is no progress in the loop, the loop must be finite. (we could have some new flag if people insist on for(;;); not being optimized away. I would even use a flag -fno-infinite-loop that says that no loop can be infinite, or -fmain-returns that says that no loop is infinite and the program cannot trap or terminate, etc, but that's getting a bit far from this PR)
[Bug c++/82781] [6/7/8 Regression] Vector extension operators return wrong result in constexpr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82781 Marc Glisse changed: What|Removed |Added Known to work||5.4.0 Target Milestone|--- |6.5 Summary|Vector extension operators |[6/7/8 Regression] Vector |return wrong result in |extension operators return |constexpr |wrong result in constexpr --- Comment #1 from Marc Glisse --- Have to write static_assert( ... ,"") with earlier compilers, but gcc-6 is the first that fails.
[Bug c++/82781] Vector extension operators return wrong result in constexpr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82781 Marc Glisse changed: What|Removed |Added Keywords||wrong-code Status|UNCONFIRMED |NEW Last reconfirmed||2017-10-31 Ever confirmed|0 |1
[Bug tree-optimization/82776] Unable to optimize the loop when iteration count is unavailable.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82776 --- Comment #1 from Marc Glisse --- That could be because gcc sadly refuses to optimize away infinite loops (happens for other cases, and cddce2 dump (the pass that removes the whole thing when the macro is defined) says "can not prove finiteness of loop 2"). Although ++chunk_ should be enough to prove that the loop terminates (otherwise chunk_ eventually overflows). (the unaligned vector use in this code seems strange)
[Bug c++/82760] Incorrect code generated for aligned new
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82760 --- Comment #2 from Marc Glisse --- In cp/call.c: - (**args)[0] = *size; + const_cast((*cand->args)[0]) = *size; since in the aligned case we are using a copy align_args of the arguments. Of course it should be done in a way that doesn't require a const_cast.
[Bug c++/82760] Incorrect code generated for aligned new
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82760 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-10-28 Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- If I make the destructor do something (print hello) and I delete[] gFoo, then I get a crash with c++17 and not c++14, indeed. In c++17 we don't seem to allocate any extra space to store the array size.
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 --- Comment #3 from Marc Glisse --- Actually, what CSE1 does might be fine, and it is LRA that should have noticed that the register it assigned was clobbered, so it should have spilled (or better rematerialized). Assuming the i386 backend does say that this unspec clobbers the registers, which I am not seeing right now (but I may not be looking in the right place).
[Bug target/82735] _mm256_zeroupper does not invalidate previously computed registers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82735 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-10-26 Ever confirmed|0 |1 --- Comment #2 from Marc Glisse --- CSE1 happily turns uses of the second constant, loaded after vzeroupper, into uses of the first constant, loaded before, ignoring the fact that vzeroupper clobbers (the upper part of) all avx registers.
[Bug tree-optimization/82732] malloc+zeroing other than memset not optimized to calloc, so asm output is malloc+memset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82732 --- Comment #2 from Marc Glisse --- If you use size_t consistently (for size and i), then the resulting code is a call to calloc.
[Bug tree-optimization/82732] malloc+zeroing other than memset not optimized to calloc, so asm output is malloc+memset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82732 --- Comment #1 from Marc Glisse --- We do recognize the memset early enough. What we fail to recognize is that the size argument to malloc is the same as the length of the memset: _1 = (long unsigned int) size_8(D); _2 = _1 * 4; p_11 = malloc (_2); if (size_8(D) != 0) goto ; [85.00%] [count: INV] else goto ; [15.00%] [count: INV] [12.75%] [count: INV]: _18 = size_8(D) + 4294967295; _21 = (sizetype) _18; _7 = _21 + 1; _6 = _7 * 4; __builtin_memset (p_11, 0, _6); VRP could be taught to simplify (unsigned long)(u-1)+1 to (unsigned long)u for unsigned int u non-zero (though there is no VRP between ldist and strlen), or we could try to generate some simpler code in ldist...
[Bug inline-asm/82677] Many projects (linux, coreutils, GMP, gcrypt, openSSL, etc) are misusing asm(divq/divl) etc, potentially resulting in faulty/unintended optimisations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82677 --- Comment #3 from Marc Glisse --- On x86, by default, the compiler already assumes that flags are clobbered. That's explained in a comment in GMP's longlong.h at least.
[Bug libstdc++/81797] gcc 7.1.0 fails to build on macOS 10.13 (High Sierra):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81797 --- Comment #32 from Marc Glisse --- (In reply to Misty De Meo from comment #31) > For what it's worth, Apple's response was: "We analyzed the issue and > determined the problem to be a latent bug in gcc’s build system that is > revealed by changes in macOS High Sierra. The FSF will need up issue a fix > in gcc." Thanks for forwarding. Their response is oh so precise and helpful... "bug on your side, washing my hands". I can't complain since I basically did the same thing in my previous comment, but if they really did analyze the issue, one might expect that they would share what the bug actually is :-(
[Bug libstdc++/81797] gcc 7.1.0 fails to build on macOS 10.13 (High Sierra):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81797 --- Comment #30 from Marc Glisse --- (In reply to Francois-Xavier Coudert from comment #29) > The result of "make -d --trace -j8 all-target-libstdc++-v3", in a build > where x86_64-apple-darwin17.0.0/libstdc++-v3 was entirely removed, can be > found here: > https://gist.github.com/fxcoudert/b621465a794d968593bc7ed90c0fc1fb make's I/O is not exactly a reliable way to debug multithreading issues, but the output looks right to me. If --disable-libstdcxx-pch works (does it?), and until someone can investigate more, I'd be tempted to consider it a mac bug and recommend that option in https://gcc.gnu.org/install/specific.html .
[Bug libstdc++/81797] gcc 7.1.0 fails to build on macOS 10.13 (High Sierra):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81797 --- Comment #28 from Marc Glisse --- I am also failing to see how this can happen without a bug in make or macos. The failing command is the recipe for ${pch1b_output}. That rule has ${allstamped} as a dependency, which includes stamp-bits-sup, whose recipe does link the header. At least, disabling precompiled headers should work around it (--disable-libstdcxx-pch IIRC) You could always remove the @ sign on the $(STAMP) lines (and the ones before) so it gets printed in the output, maybe that would show something suspicious. If you are building in a clean directory (the headers aren't there yet), you could also remove '-' at the beginning of the $(LN_S) lines, to make sure that no error occurs. Running make in verbose mode might also hint at something. Maybe print the date in the pch rule (or use the creation date of ${pch1_output_builddir}), and compare it to the creation date of the symlinks, etc. If the issue was with make, you could try replacing all-local: ${allstamped} ${allcreated} with all-local: $(MAKE) ${allstamped} $(MAKE) ${allcreated} Generally, I don't understand why we are linking sources in the build directory instead of passing -I flags pointing directly to the source directory.
[Bug tree-optimization/80511] [8 Regression] gcc.dg/Wstrict-overflow-18.c gcc.dg/Wstrict-overflow-7.c gcc.dg/pragma-diag-3.c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80511 Marc Glisse changed: What|Removed |Added Summary|[8 Regression] |[8 Regression] |gcc.dg/Wstrict-overflow-18. |gcc.dg/Wstrict-overflow-18. |c |c ||gcc.dg/Wstrict-overflow-7.c ||gcc.dg/pragma-diag-3.c --- Comment #3 from Marc Glisse --- https://gcc.gnu.org/viewcvs/gcc?view=revision=253642 2 more testcases got xfailed: gcc.dg/Wstrict-overflow-7.c and gcc.dg/pragma-diag-3.c. Some possibilities: - add the warning in match.pd: users keep complaining about those strict-overflow warnings, so we would have to take it out of Wall. - add the warning in match.pd, restricted to GENERIC: that gets us close to the gcc-7 situation. - reimplement the warning in the front-end. In general, telling users that we simplified x+1
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #10 from Marc Glisse --- f1...f6 already have a LROTATE_EXPR in the .original dump. The others don't get one until forwprop1, which is after einline, so there is a small chance of inlining causing other optimizations that mess with rotate detection (or the large-ish code before rotate is recognized may prevent early inlining, missing optimizations). I guess without going through the large job of moving the rotate code from forwprop to match.pd it would be possible to add one basic transform to recognize precisely the case in those intrinsics, if we pick one in f7...f11.
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 --- Comment #7 from Marc Glisse --- (In reply to Uroš Bizjak from comment #6) > You can use __rol{b,w,d,q} and __ror{b,w,d,q} (and their aliases) from > ia32intrin.h. These are standardized; you have to include x86intrin.h header. Some of those break if you use -fsanitize=undefined. #include int main(){ unsigned i = 0; return __rold(i,0); } /usr/lib/gcc-snapshot/lib/gcc/x86_64-linux-gnu/8/include/ia32intrin.h:150:30: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
[Bug target/82498] Missed optimization for x86 rotate instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82498 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-10-10 Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- Looks like https://stackoverflow.com/q/44000956/1918193 . During combine, we try to match (set (reg:SI 97) (rotate:SI (reg/v:SI 90 [ input ]) (and:QI (subreg:QI (reg:SI 92 [ rot ]) 0) (const_int 31 [0x1f] But the pattern in i386.md has 'and' and 'subreg' reversed. For the other part, we have a very limited transform that removes the test in this case: uint32_t rotate_left(uint32_t input, int rot) { if(rot == 0) return input; return static_cast((input << rot) | (input >> (8*sizeof(uint32_t)-rot)));; } But it only works when there is a single gimple insn involved, not and+cast+rotate.
[Bug c++/82505] g++ -O3 -funroll-loops generates weird code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82505 --- Comment #2 from Marc Glisse --- dest/src might alias anything (even themselves), so the compiler can't really optimize much.
[Bug middle-end/82504] Optimize away exception allocation and throws handled by catch(...){}
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82504 --- Comment #3 from Marc Glisse --- Dup of PR53294?
[Bug libstdc++/82470] Structured bindings don't work with std::tuple if a type has a get member function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82470 --- Comment #3 from Marc Glisse --- As with all the issues caused by the EBCO in std::tuple, I believe the answer is PR 63579 (I think it can be done in a way that preserves the layout of tuple).
[Bug ipa/82476] C++: Inlining fails for a simple function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82476 --- Comment #2 from Marc Glisse --- What is the point of inlining it? It isn't a hot call (called once from main). And unless you are using something like -flto of -fwhole-program (which would turn the function static), it has to be emitted as a separate function as well, so inlining it increases code size.
[Bug tree-optimization/82434] -fstore-merging does not work reliably.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82434 --- Comment #2 from Marc Glisse --- -Dbool=char lets it merge the stores, I guess this is because bool has precision < bitsize.
[Bug target/82418] Division on a constant is suboptimal because of not using imul instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82418 Marc Glisse changed: What|Removed |Added Target||x86_64-*-* Status|UNCONFIRMED |NEW Last reconfirmed||2017-10-04 Ever confirmed|0 |1
[Bug target/82418] Division on a constant is suboptimal because of not using imul instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82418 --- Comment #4 from Marc Glisse --- (In reply to Alexander Monakov from comment #3) > it's likely that your test measured something else, You are right, my test was bogus and clang's version is faster.
[Bug target/82418] Division on a constant is suboptimal because of not using imul instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82418 --- Comment #1 from Marc Glisse --- If I time it, gcc's code is several times faster than clang's on skylake. Why is clang's version supposed to be better?
[Bug libstdc++/82417] Macros from defined in C++11
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82417 --- Comment #3 from Marc Glisse --- (In reply to Jonathan Wakely from comment #2) > Thinking about this further, I think we must not include at all > for strict -std=c++1* modes, Yes. Can we get a #warning in that case which explains that including in strict C++11+ mode makes no sense? Actually, could also do with a #warning explaining that it never makes sense to include it.
[Bug libstdc++/82417] Macros from defined in C++11
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82417 --- Comment #1 from Marc Glisse --- (In reply to Jonathan Wakely from comment #0) > The C++11 standard says that should just include the C++ > header and completely ignore the C library's header. I am very surprised that nobody has cared enough to get the standard fixed. But I can't complain, I didn't write a proposal either. > For C++11 mode we should #undef the macros that defines with > non-reserved names, and maybe consider not including at all for > -std=c++1* modes. I guess so.
[Bug c++/82394] Pointer imposes an optimization barrier
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82394 --- Comment #1 from Marc Glisse --- What compiler flags? At -O3 we do optimize both the same.
[Bug target/79709] Subobtimal code with -mavx and explicit vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79709 --- Comment #8 from Marc Glisse --- Thomas, the code generated by gcc has changed (after some patches by Jakub IIRC). Do you consider the issue fixed or is the generated asm still problematic? .L13: vpextrq $1, %xmm2, %rax testq %rax, %rax je .L2 vextractf128$0x1, %ymm2, %xmm2 vmovq %xmm2, %rax testq %rax, %rax jne .L2 vpextrq $1, %xmm2, %rax vmovapd %ymm4, %ymm3 testq %rax, %rax jne .L2 .L3: vmulpd %ymm3, %ymm3, %ymm4 vmulpd %ymm8, %ymm3, %ymm3 vsubpd %ymm10, %ymm4, %ymm4 vmulpd %ymm9, %ymm3, %ymm3 vaddpd %ymm0, %ymm4, %ymm4 vaddpd %ymm1, %ymm3, %ymm9 vaddpd %ymm4, %ymm4, %ymm2 vmulpd %ymm9, %ymm9, %ymm10 vaddpd %ymm10, %ymm2, %ymm2 vcmpltpd%ymm7, %ymm2, %ymm2 vpaddq %xmm2, %xmm5, %xmm3 vextractf128$1, %ymm2, %xmm6 vmovq %xmm2, %rax vextractf128$1, %ymm5, %xmm5 testq %rax, %rax vpaddq %xmm6, %xmm5, %xmm5 vinsertf128 $0x1, %xmm5, %ymm3, %ymm5 jne .L13
[Bug target/68924] No intrinsic for x86 `MOVQ m64, %xmm` in 32bit mode.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924 --- Comment #2 from Marc Glisse --- Does anything bad happen if you remove the #ifdef/#endif for _mm_cvtsi64_si128? (2 files in the testsuite would need updating for a proper patch)
[Bug target/82261] x86: missing peephole for SHLD / SHRD
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261 --- Comment #1 from Marc Glisse --- Related to PR 55583.
[Bug target/82242] x86_64 bad optimization with -march
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82242 --- Comment #2 from Marc Glisse --- Nothing gets vectorized :-( Note that to fill the vector, this would be better std::vector array(size, 1e-9); In the reduction, we seem to do strange things with the accumulator. addsd (%rax), %xmm1 addq$8, %rax cmpq%rbx, %rax movsd %xmm1, (%rsp) jne .L13 or vmovq %rbp, %xmm2 vaddsd (%rax), %xmm2, %xmm1 addq$8, %rax vmovq %xmm1, %rbp cmpq%rbx, %rax jne .L13 We aren't happy with xmm1, we save the value to memory in the first case, and to an integer register in the second case where we even restore the value from that register...
[Bug middle-end/82223] Incorrect optimization for lossy round trips of arithmetic types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82223 --- Comment #2 from Marc Glisse --- (float)INT_MAX gets rounded to 2^31. When you try to convert it to int, it doesn't fit, so the compiler is at liberty to return INT_MAX if it likes. clang's -fsanitize=undefined does complain on your code (not gcc's though).
[Bug tree-optimization/81346] Missed constant propagation into comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 --- Comment #18 from Marc Glisse --- (In reply to Gergö Barany from comment #17) > the division used to be replaced by a shift that updated the condition code > register (again, on ARM; r250337): (just my opinion) At a high level (gimple), (unsigned)x+3<=6 seems like a more canonical way to represent an interval than x/4==0. If the second one turns out to be more efficient on some targets, it sounds like we could later turn (unsigned)x+3<=6 into x/4==0 (even if the user did not write it that way), i.e. add a new transform at RTL time. Looks like a separate enhancement request would be appropriate.
[Bug target/82170] gcc optimizes int range-checking poorly on x86-64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82170 --- Comment #2 from Marc Glisse --- Note that n==(int)n (gcc documents that this must work) may work with more gcc versions and is more readable.
[Bug c++/82146] if () is always true error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82146 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #2 from Marc Glisse --- Null references are illegal, use pointers if you want to use null pointers.
[Bug tree-optimization/82135] Missed constant propagation through possible unsigned wraparound, with std::align() variable pointer, constant everything else.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82135 --- Comment #1 from Marc Glisse --- This PR is a bit messy, please minimize your examples... Looking at the dse2 dump (before reassoc messes things up): __intptr_2 = (const long unsigned int) voidp_9(D); _3 = __intptr_2 + 63; __aligned_4 = _3 & 18446744073709551552; __diff_5 = __aligned_4 - __intptr_2; _6 = __diff_5 + 64; if (_6 > 1024) IIUC, essentially, you would like gcc to realize that __diff_5 is in [0,63], so the condition is always false. If aligned was not reused, we could simplify ((x+63)&-64)-x to 63&-x, but we don't want to do it in general. Maybe we could add a very special case in VRP (or CCP for nonzero bits)... (we could also add if(__diff>__align)__builtin__unreachable() in but that's getting really specific)
[Bug lto/82027] [5/6/7/8 Regression] wrong code with -O3 -flto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82027 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-08-29 Summary|wrong code with -O3 -flto |[5/6/7/8 Regression] wrong ||code with -O3 -flto Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- gcc mistakenly thinks it found some UB (division by zero) and inserts a trap.
[Bug c++/82021] Unnecessary null pointer check in global placement new (and also in any class-specific placement new operator declared as noexcept)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82021 --- Comment #3 from Marc Glisse --- You can search for "Ville Voutilainen", the patch was this year, not long before the release so maybe March.
[Bug c++/82021] Unnecessary null pointer check in global placement new (and also in any class-specific placement new operator declared as noexcept)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82021 --- Comment #1 from Marc Glisse --- Did you try with -std=c++1z? (if that solves your issue, this is a DUP, it should be enabled in all mode, but it isn't yet)
[Bug c++/82000] Missed optimization of char_traits::length() on constant string
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82000 --- Comment #3 from Marc Glisse --- (In reply to Louis Dionne from comment #2) > > Downloading the one from godbolt, we simplify it to: [...] > > I have no idea what this is and how you feed that to GCC, but I'm curious. That's what -fdump-tree-optimized shows (end of high-level optimizations). You don't feed it to gcc (it is missing all information about internal_buffer for instance), although with -fgimple there are variants that gcc could read.
[Bug c++/82000] Missed optimization of char_traits::length() on constant string
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82000 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-08-28 Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- The example you wrote in the bug report makes no sense: missing includes, and with the includes added it optimizes to return 0. Downloading the one from godbolt, we simplify it to: int main() () { struct string_view D.32298; long unsigned int _15; [14.44%] [count: INV]: _15 = __builtin_strlen (_buffer); MEM[(struct string_view *)] = _15; MEM[(struct string_view *) + 8B] = _buffer; __asm__ __volatile__("" : : "i,r,m" D.32298 : "memory"); return 0; } Indeed we don't seem to manage folding strlen there. I think there is a DUP asking to transform the buffer into STRING_CST or something like that. (btw, why do you use "g" for clang and not for gcc?)
[Bug c++/69433] missing -Wreturn-local-addr assigning address of a local to a static
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69433 --- Comment #3 from Marc Glisse --- f3: the inliner silently removes s (and the assignment to it) as write-only. You need to add a function that reads s (we don't warn in that case either, of course, but that's a first step). f2: the (atomic) initialization of the static is a lot of hard to optimize code. Still, since we manage to warn for f1: # iftmp.0_1 = PHI <(2), "def"(3)> a ={v} {CLOBBER}; return iftmp.0_1; we would probably manage it for f2: # prephitmp_14 = PHIa ={v} {CLOBBER}; return prephitmp_14; ... if there was an isolate-path pass after PRE, since before that we only see: s = __cxa_guard_release (&_ZGVZ2f2vE1s); [100.00%] [count: INV]: _10 = s; a ={v} {CLOBBER}; return _10; IMO we should look into why this optimization doesn't happen before PRE (why not FRE for instance?).
[Bug tree-optimization/81948] New: vectorize exp2 using exp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81948 Bug ID: 81948 Summary: vectorize exp2 using exp Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Using -Ofast -mavx2 and a recent glibc, g++ vectorizes #include void f(double*d){ d=(double*)__builtin_assume_aligned(d,256); for(int i=0;i<1024;++i) d[i]=std::exp(d[i]*std::log(2)); } However, if I write d[i]=std::exp2(d[i]) instead, it fails to vectorize (libmvec does not provide a vector version of exp2). It would be good, when checking if a standard function like exp2 has a vector version, to also check related, more canonical functions (exp in this case). (this could also be vaguely related to PR 81706)
[Bug libstdc++/81912] std::distance not constexpr in C++17 mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81912 Marc Glisse changed: What|Removed |Added CC||alexbaroni68 at gmail dot com --- Comment #3 from Marc Glisse --- *** Bug 81944 has been marked as a duplicate of this bug. ***
[Bug c++/81944] constexpr std::distance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81944 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Marc Glisse --- . *** This bug has been marked as a duplicate of bug 81912 ***
[Bug c++/81906] [7/8 Regression] Calls to rint() wrongly optimized away starting in g++ 6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81906 --- Comment #8 from Marc Glisse --- (In reply to Vadim Zeitlin from comment #5) > Perhaps you could consider this as a QoI issue, but it would be really great > if gcc could give a warning if the code tries to use fesetround() without > -frounding-math being on. First note that even with -frounding-math, there are several bugs related to rounding (maybe rint isn't considered pure, but operators like +-*/ are). Also, there are ways (inline asm that hides optimization opportunities) to use fesetround safely even with -fno-rounding-math (and it avoids the perf penalty in places where we don't care about the rounding). Still, I guess we could consider such a warning, if someone is willing to implement it...
[Bug c++/81906] [7/8 Regression] Calls to rint() wrongly optimized away starting in g++ 6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81906 Marc Glisse changed: What|Removed |Added Status|RESOLVED|NEW Last reconfirmed||2017-08-20 Resolution|INVALID |--- Summary|Calls to rint() wrongly |[7/8 Regression] Calls to |optimized away starting in |rint() wrongly optimized |g++ 6 |away starting in g++ 6 Ever confirmed|0 |1 --- Comment #2 from Marc Glisse --- Indeed you want -frounding-math, and with gcc-6 that makes things work, but starting with gcc-7 it doesn't anymore. (gimple looks fine, the problem comes later)
[Bug libstdc++/81905] New: partial_sort slower than sort
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81905 Bug ID: 81905 Summary: partial_sort slower than sort Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- (from https://stackoverflow.com/q/45455345/1918193 ) std::partial_sort of half an array can be slower than std::sort of the whole array, because it uses heap sort vs introsort. There may be a size threshold above which we could use a different algorithm than heap_select+sort_heap (say a variant of introsort where after partitioning (possibly with a biased pivot), depending where the pivot ends up, either we partial_sort the left and ignore the right, or we sort the left and partial_sort the right), or some other compromise.
[Bug target/81904] New: FMA and addsub instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81904 Bug ID: 81904 Summary: FMA and addsub instructions Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-* (asked in https://stackoverflow.com/questions/45298855/how-to-write-portable-simd-code-for-complex-multiplicative-reduction/45401182#comment77780455_45401182 ) Intel has instructions like vfmaddsubps. Gcc manages, under certain circumstances, to merge mult and plus or mult and minus into FMA, but not mult and this strange addsub mix. #include __m128d f(__m128d x, __m128d y, __m128d z){ return _mm_addsub_pd(_mm_mul_pd(x,y),z); } __m128d g(__m128d x, __m128d y, __m128d z){ return _mm_fmaddsub_pd(x,y,z); } (the order of the arguments is probably not right) My first guess as to how this could be implemented without too much trouble is in ix86_gimple_fold_builtin: for IX86_BUILTIN_ADDSUBPD and others, check that we are late enough in the optimization pipeline (roughly where "widening_mul" is), that contractions are enabled, and that the first (?) argument is a single-use MULT_EXPR. I didn't check what the situation is with the vectorizer (which IIRC can now generate code that ends up as addsub).
[Bug c/81389] _mm_cmpestri segfault on -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81389 --- Comment #13 from Marc Glisse --- (In reply to rockeet from comment #7) > @Marc @Jakub @Martin > Intel CPU document says: operand of _mm_cmpestri can be memory or mm > register, when the operand is memory, it does not require alignment. That's the doc for the CPU instruction. The intrinsic, as a C function, always takes an object of type __m128i, not a register or memory. The only question is what the alignment of the type __m128i is. In gcc, it is 16 bytes. What does alignof (or _Alignof or whatever variant you can get working) return with Intel's compiler? > The issue is: GCC does not know this knowledge(memory operand need not > memory align), and there is no way to enforce gcc to generate a _mm_cmpestri > which always use memory operand, not mm register. Use inline asm? Intrinsics are not quite as low level as you seem to expect. > If I manually load the unaligned memory into an aligned `__m128i`, it has > performance penalty on optimizing compilation. Uh? With -O1, the compiler merges the unaligned load with pcmpestri (it knows that the insn can read unaligned memory). Did you mean to talk about the performance of code generated with -O0? We explicitly do not care about that.
[Bug c/81630] powl returns values with insufficient accuracy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81630 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #1 from Marc Glisse --- > Apple LLVM version 8.1.0 (clang-802.0.42) That's not gcc.
[Bug tree-optimization/81607] Conditional operator: "type mismatch in shift expression" error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81607 Marc Glisse changed: What|Removed |Added Keywords||ice-checking Status|UNCONFIRMED |NEW Last reconfirmed||2017-07-29 Target Milestone|--- |8.0 Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- The original dump has (void) (f = (char) ((long int) d.c << a)) (using long instead of int) This happens only for a bitfield of size 32, not 31 or 33, we probably get confused about a NOP conversion somewhere along the way.
[Bug c++/81606] A small program works as expected with -O0 but fails with -O1 on all tested gcc versions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81606 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #1 from Marc Glisse --- What do you expect "A" >= "B" to mean? You are comparing addresses, the result is arbitrary.
[Bug c++/81597] returns link to temporary value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81597 --- Comment #3 from Marc Glisse --- -Werror=return-local-addr (we cannot reject those programs by default, if the caller ignores what the function returns, the program may be valid)
[Bug c++/81597] returns link to temporary value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81597 --- Comment #1 from Marc Glisse --- Sorry, what change are you asking for? Compiling with current gcc, we get plenty of warnings, and at runtime: int && zsh: segmentation fault ./a.out
[Bug tree-optimization/81555] Wrong code at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81555 --- Comment #3 from Marc Glisse --- (In reply to Dmitry Babokin from comment #2) > Hmmm, but this one is triggered at -O1, another only at -O2. -fno-tree-reassoc should help both. It is often a combination of optimizations that causes the bug. Reassoc is doing a good transformation, but it leaves wrong information around, which only matters if some other pass (rightfully) takes advantage of that information. Still, it was good to report both, and I expect we may add (a modified version of) both to the testsuite once this is fixed, thanks.
[Bug tree-optimization/81555] Wrong code at -O1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81555 --- Comment #1 from Marc Glisse --- Same reassoc issue as PR 81556 it seems.
[Bug tree-optimization/81556] Wrong code at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81556 Marc Glisse changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-07-26 Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- Reassoc not clearing the VRP info?
[Bug tree-optimization/81503] [8 Regression] Wrong code at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81503 --- Comment #4 from Marc Glisse --- if (a + b * -2) c = (b-1073741824)*-2; might let you find an earlier culprit.
[Bug tree-optimization/81503] Wrong code at -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81503 --- Comment #1 from Marc Glisse --- Looks like SLSR does an overflow-unsafe transformation, then VRP2 takes advantage of it. Maybe.
[Bug middle-end/81502] In some cases the data is moved to memory unnecessarily [partial regression]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81502 Marc Glisse changed: What|Removed |Added Keywords||missed-optimization Status|UNCONFIRMED |NEW Last reconfirmed||2017-07-21 Ever confirmed|0 |1 --- Comment #1 from Marc Glisse --- .optimized dump: int bar(void*) (void * ptr) { int res; __m128i word; long unsigned int _2; vector(2) long long int word.3_3; unsigned int _4; [100.00%] [count: INV]: _2 = (long unsigned int) ptr_9(D); word = { 0, 0 }; MEM[(char * {ref-all})] = _2; word.3_3 = word; word ={v} {CLOBBER}; _4 = BIT_FIELD_REF; res_5 = (int) _4; return res_5; } We missed turning the memory write into a BIT_INSERT_EXPR, and passes like PRE missed following the bit_field_expr all the way to _2. .combine dump: [...] (insn 8 3 10 2 (set (reg/v:V2DI 90 [ word ]) (vec_concat:V2DI (reg/v/f:DI 92 [ ptr ]) (const_int 0 [0]))) "b.c":16 3712 {vec_concatv2di} (expr_list:REG_DEAD (reg/v/f:DI 92 [ ptr ]) (nil))) (insn 10 8 15 2 (set (reg:SI 94 [ res ]) (vec_select:SI (subreg:V4SI (reg/v:V2DI 90 [ word ]) 0) (parallel [ (const_int 0 [0]) ]))) "b.c":20 3697 {*vec_extractv4si_0} (expr_list:REG_DEAD (reg/v:V2DI 90 [ word ]) (nil))) [...] combine tries (set (reg:SI 94 [ res ]) (vec_select:SI (subreg:V4SI (vec_concat:V2DI (reg/v/f:DI 92 [ ptr ]) (const_int 0 [0])) 0) (parallel [ (const_int 0 [0]) ]))) which we fail to simplify. The xmm1-xmm0 mov is not considered a mov by the compiler but concatenation with 0, so not a RA problem. The change of mode (64-bit pointer to 32-bit int) seems to play a big role in confusing things here.
[Bug tree-optimization/81396] [7/8 Regression] Optimization of reading Little-Endian 64-bit number with portable code has a regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81396 --- Comment #9 from Marc Glisse --- Should we open a separate PR for the transformation you suggested in comment 4, or does that seem not useful enough now, or will be part of bitfield gimple lowering when that lands?
[Bug libstdc++/81476] severe slow-down with range-v3 library compared to clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81476 --- Comment #17 from Marc Glisse --- (In reply to Jonathan Wakely from comment #14) > The advantage of doing it as in comment 13, rather than: > [comment #11] > is that when inserting the inputrange causes reallocations we only have to > transfer the already inserted elements of the inputrange to the new storage, > not the elements preceding the insertion point ("the beginning of the > vector" and "what we already inserted at the end"). I see what you mean. Note that as soon as there is some reallocation going on at any point, we should be able to avoid calling (in-place) rotate, which is quite a bit more expensive than a simple range move.
[Bug libstdc++/81476] severe slow-down with range-v3 library compared to clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81476 --- Comment #11 from Marc Glisse --- Or one could (not legal) directly start a new allocation, copy the beginning of the vector, append the range, then append the end of the vector. Or a combination of all that: first try appending the range to the vector. If that works without reallocating, rotate. If a reallocation is necessary, switch to the "new allocation" strategy, create a new vector, copy the beginning of the vector, copy what we already inserted at the end, append the rest of the inputrange, copy the rest of the original vector, and finally adopt this new vector.
[Bug libstdc++/81476] severe slow-down with range-v3 library compared to clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81476 --- Comment #10 from Marc Glisse --- Inserting an InputRange (not even Forward) at the beginning of a vector is really a misuse of vector. It is true that we can do better than what libstdc++ currently does, though we shouldn't encourage the practice. Trivial idea would be first to copy the InputRange to some array (either something dynamic like a vector, or by block to a fixed size buffer), then insert that. Variants include tricks like inserting the InputRange at the end of the vector, then calling std::rotate to move it to the right position.
[Bug tree-optimization/81346] Missed constant propagation into comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 --- Comment #13 from Marc Glisse --- (In reply to Jakub Jelinek from comment #12) > Created attachment 41781 [details] > gcc8-pr81346-2.patch > > Further optimization from build_range_check. I wonder if "1" is that special, this optimization basically applies to any range that ends at INT_MAX, turning (X-C1)<=C2 into (signed)X>=C3. Or do we consider that only the case that yields a simple sign check is a win?
[Bug target/81389] _mm_cmpestri segfault on -O0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81389 --- Comment #4 from Marc Glisse --- (In reply to rockeet from comment #3) > @Martin Liška Yes, my use case is: > > __m128i key128 = { key }; // key is an unsigned char > int idx = _mm_cmpestri(key128, 1, > *(const __m128i*)(data), // don't require memory align > len, > _SIDD_UBYTE_OPS|_SIDD_CMP_EQUAL_ORDERED|_SIDD_LEAST_SIGNIFICANT); > > // You should load the unaligned data using one of the loadu intrinsics and pass that to _mm_cmpestri. When optimizing, it should generate the code you want, but in a safe way.
[Bug lto/78795] LTO causes undefined reference errors when linking with GMP "make check"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78795 --- Comment #12 from Marc Glisse --- (In reply to Vincent Lefèvre from comment #11) > On Debian, after path canonicalization, this is /usr/lib/bfd-plugins, but > only packages should manage files under /usr/lib (unlike /usr/local, for > instance). I've sent a mail to the Debian GCC Maintainers so that they > provide a symlink: > > https://lists.debian.org/debian-gcc/2016/12/msg00122.html For the record, Debian (testing+unstable) recently added this symlink.
[Bug middle-end/81445] Dynamic stack allocation not optimized into static allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81445 --- Comment #6 from Marc Glisse --- (In reply to Wilco from comment #5) > Also it doesn't support these simple cases: > > void vla2(int x) > { > if (x == 10) > { > int arr[x]; > t (arr); > } > } Again, try something smaller. When the allocation is not always executed, the threshold is even lower.
[Bug tree-optimization/81346] Missed constant propagation into comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 --- Comment #10 from Marc Glisse --- (In reply to Jakub Jelinek from comment #9) > (In reply to Marc Glisse from comment #8) > > I think always using an unsigned type for the range check would be simpler. > > If we try to check that x>=INT_MIN+2 && x<=INT_MAX-2 with -fwrapv, int is > > still not a suitable type in which to do > > x-(INT_MIN+2)<=INT_MAX-2-(INT_MIN+2), while the issue doesn't exist with an > > unsigned type. > > I'm trying to preserve what we did before, it can be tweaked incrementally > if needed. Then you may need to check for overflow in "hi = const_binop (MINUS_EXPR, etype, hi, lo);", current build_range_check has "if (value != 0 && !TREE_OVERFLOW (value))" for the result of that operation. That should matter for instance when simplifying X/INT_MAX==0.
[Bug middle-end/81445] Dynamic stack allocation not optimized into static allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81445 --- Comment #4 from Marc Glisse --- (In reply to Wilco from comment #2) > I don't see it happen for the simplest case in current trunk: 400 bytes is too large, try again with something smaller. (I'm with you if you want to increase the threshold)
[Bug tree-optimization/81346] Missed constant propagation into comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81346 --- Comment #8 from Marc Glisse --- I think always using an unsigned type for the range check would be simpler. If we try to check that x>=INT_MIN+2 && x<=INT_MAX-2 with -fwrapv, int is still not a suitable type in which to do x-(INT_MIN+2)<=INT_MAX-2-(INT_MIN+2), while the issue doesn't exist with an unsigned type. I notice you call build_range_check in GENERIC (and new code for GIMPLE). Is that temporary until match.pd can optimize range checks? Do we want :s on trunc_div?
[Bug middle-end/81445] Dynamic stack allocation not optimized into static allocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81445 --- Comment #1 from Marc Glisse --- Note that we already do it for VLA (aka BUILT_IN_ALLOCA_WITH_ALIGN) in CCP.
[Bug tree-optimization/81396] [7/8 Regression] Optimization of reading Little-Endian 64-bit number with portable code has a regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81396 --- Comment #6 from Marc Glisse --- (In reply to Jakub Jelinek from comment #5) > Or both this bswap change and the match.pd addition. Doing both sounds good to me :-)
[Bug bootstrap/81425] Bootstrap broken since r250158
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81425 --- Comment #1 from Marc Glisse --- Isn't that already fixed? https://gcc.gnu.org/ml/gcc-patches/2017-07/msg00614.html
[Bug c++/81410] [5/6/7/8 Regression] -O3 breaks code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81410 --- Comment #5 from Marc Glisse --- Seems related to vectorization. These lines look suspicious: vect__37.14_78 = MEM[(long int *)_30]; vect__37.15_72 = MEM[(long int *)_30 + 16B]; vect__37.16_70 = MEM[(long int *)_30 + 32B]; vect__37.17_68 = MEM[(long int *)_30 + 48B]; MEM[(long int *)_28] = vect__37.14_78; MEM[(long int *)_28 + 16B] = vect__37.15_72; MEM[(long int *)_28 + 32B] = vect__37.16_70; MEM[(long int *)_28 + 48B] = vect__37.17_68; where _30 is for b, _28 is for a, and I would expect to see gaps in the reads from b (+24, +48, +72 instead of +16, +32 and +48). But I haven't checked, this is only a first guess.
[Bug tree-optimization/81409] Inefficient loops generated from range-v3 code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81409 --- Comment #1 from Marc Glisse --- The most obvious thing I notice is [100.00%] [count: INV]: # it$_M_current_23 = PHI_20 = _7 == it$_M_current_23; _5 = _20 | _53; if (_5 != 0) goto ; [7.36%] [count: INV] else goto ; [92.64%] [count: INV] [92.60%] [count: INV]: _27 = it$_M_current_23 + 4; if (_7 != _27) goto ; [3.75%] [count: INV] else goto ; [96.25%] [count: INV] where 7 -> 6 means that _7 == _27 == it$_M_current_23 so _5 != 0 has to be true. However, we do not thread that (at thread4 time, we go from 7 to 12 (empty latch) to 6 instead of directly to 6).
[Bug tree-optimization/81403] [8 Regression] wrong code at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81403 --- Comment #3 from Marc Glisse --- /* x & C -> x if we know that x & ~C == 0. */ Not clear where it is getting the bogus range/nonzero information from, I thought we had fixed all the places reusing SSA_NAMEs with stale information.
[Bug tree-optimization/81403] wrong code at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81403 --- Comment #1 from Marc Glisse --- PRE losing "& 10393" at -O3 but not -O2 (the previous dumps are identical) @@ -611,6 +639,7 @@ ;;6 [100.0%] (FALLTHRU,EXECUTABLE) # .MEM_21 = PHI <.MEM_26(5), .MEM_25(6)> # prephitmp_34 = PHI <_30(5), _30(6)> + # prephitmp_35 = PHI <_30(5), _30(6)> # VUSE <.MEM_21> var_33.4_11 = var_33D.35372; if (var_33.4_11 != 0) @@ -624,9 +653,7 @@ ;;prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED) ;;pred: 7 [54.0%] (TRUE_VALUE,EXECUTABLE) # RANGE [0, 10393] NONZERO 10393 - _29 = prephitmp_34 & 10393; - # RANGE [0, 10393] NONZERO 10393 - _15 = (long intD.12) _29; + _15 = (long intD.12) prephitmp_35;
[Bug tree-optimization/81396] [7/8 Regression] Optimization of reading Little-Endian 64-bit number with portable code has a regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81396 --- Comment #2 from Marc Glisse --- bswap was happy dealing with _2 = MEM[(const unsigned char *)]; _3 = (uint64_t) _2; _4 = MEM[(const unsigned char *) + 1B]; _5 = (uint64_t) _4; _6 = _5 << 8; _8 = MEM[(const unsigned char *) + 2B]; _9 = (uint64_t) _8; _10 = _9 << 16; _32 = _6 | _10; _11 = _3 | _32; etc, but has trouble with _21 = word_31(D) & 255; _1 = BIT_FIELD_REF; _23 = (uint64_t) _1; _2 = _23 << 8; _4 = BIT_FIELD_REF ; _24 = (uint64_t) _4; _5 = _24 << 16; _32 = _2 | _5; _6 = _21 | _32;