[Bug c++/109283] Destructor of co_yield conditional argument called twice
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109283 --- Comment #3 from ncm at cantrip dot org --- Appears fixed in 13.1 Still ICEs in trunk, Compiler-Explorer-Build-gcc-70d038235cc91ef1ea4fce519e628cfb2d297bff-binutils-2.40) 14.0.0 20230508 (experimental): : In function 'std::generator > source(int&, std::string)': :513:1: internal compiler error: in flatten_await_stmt, at cp/coroutines.cc:2899 513 | }
[Bug c++/59498] [DR 1430][10/11/12/13 Regression] Pack expansion error in template alias
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59498 --- Comment #22 from ncm at cantrip dot org --- CWG 1430 seems to be about disallowing a construct that requires capturing an alias declaration into a name mangling. This bug and at least some of those referred to it do not ask for any such action.
[Bug c++/109283] Destructor of co_yield conditional argument called twice
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109283 --- Comment #2 from ncm at cantrip dot org --- Betting this one is fixed by deleting code.
[Bug c++/109291] type alias template rejects pack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109291 --- Comment #2 from ncm at cantrip dot org --- CWG 1430 is still marked Open, and is anyway only superficially analogous. Here, there is no need for an alias to be encoded into a type signature.
[Bug c++/109291] New: type alias template rejects pack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109291 Bug ID: 109291 Summary: type alias template rejects pack Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- template struct type_identity { using type = T; }; template using type_identity_t = typename type_identity::type; template struct S1 { using alias1 = typename type_identity::type; }; template struct S2 { using alias2 = typename type_identity_t; }; int main() { S1::alias1 a; // OK S2::alias2 b; // Fails } // Here, alias1 is fine, but alias2, the same type, is not. // MSVC accepts both declarations. Clang matches Gcc. // error: pack expansion argument for non-pack parameter ‘T’ of alias template // error: expected nested-name-specifier
[Bug c++/109283] New: Destructor of co_yield conditional argument called twice
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109283 Bug ID: 109283 Summary: Destructor of co_yield conditional argument called twice Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- Created attachment 54754 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54754=edit Reproducer Basically: co_yield a ? s : t; segfaults, if (a) co_yield s; else co_yield t; does not. The segfault traces to s/t's destructor being called twice. Full reproducer attached, relying on Casey Carter's generator implementation, pasted in. This may be related to 101367. Compiled with gcc-12.2, this program segfaults. Compiled with gcc-trunk or gcc-coroutines on Godbolt, identified as: g++ (Compiler-Explorer-Build-gcc-13ec81eb4c3b484ad636000fa8f6d925e15fb983-binutils-2.38) 13.0.1 20230325 (experimental) the compiler ICEs: :513:1: internal compiler error: in flatten_await_stmt, at cp/coroutines.cc:2899 513 | }
[Bug c++/68703] __attribute__((vector_size(N))) template member confusion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68703 --- Comment #10 from ncm at cantrip dot org --- (In reply to ncm from comment #9) > This bug appears not to manifest in g++-8, 9, and 10. Of the three code samples in comment 4, the first and third fail to compile because N is undefined. What code was intended there? It seems like we should check the corrected versions of those before declaring this fixed. The code sample in example 3 still reports failings in g++-10.2.
[Bug target/87085] with -march=i386, gcc should not generate code including endbr instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87085 ncm at cantrip dot org changed: What|Removed |Added CC||ncm at cantrip dot org --- Comment #8 from ncm at cantrip dot org --- (In reply to H.J. Lu from comment #7) > (In reply to chengming from comment #4) > > Created attachment 44602 [details] > > ELF file > > > > compiled with command > > gcc -v -save-temps -m32 -march=i386 -fcf-protection=none -o onlyReturn > > onlyReturn.c > output.txt 2>&1 > > Fedora 28 run-time only supports i686 or above. You can't use any libraries > on Fedora 28. Not relevant: Reporter is not trying to run i386 code on fedora 28, but only generate i386 code to run on a cross target.
[Bug tree-optimization/97736] [9/10/11 Regression] switch codegen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97736 --- Comment #12 from ncm at cantrip dot org --- As it is, your probability of failure in 9 and 10 is exactly 1.0.
[Bug tree-optimization/97736] [9/10/11 Regression] switch codegen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97736 --- Comment #10 from ncm at cantrip dot org --- Don't understand, the compiler we are using (9) has the regression. It looks like a trivial backport.
[Bug libstdc++/42857] std::istream::ignore(std::streamsize n) calls unnecessary underflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42857 --- Comment #9 from ncm at cantrip dot org --- (In reply to Jonathan Wakely from comment #8) > Probably changed by one of the patches for PR 94749 or PR 96161, although I > still see two reads for the first example. Thank you, I was mistaken. This bug is still present in g++-10.
[Bug c++/68703] __attribute__((vector_size(N))) template member confusion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68703 --- Comment #9 from ncm at cantrip dot org --- This bug appears not to manifest in g++-8, 9, and 10.
[Bug c++/66028] false positive, unused loop variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66028 --- Comment #2 from ncm at cantrip dot org --- This bug appears not to manifest in g++-10.2.
[Bug libstdc++/42857] std::istream::ignore(std::streamsize n) calls unnecessary underflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42857 --- Comment #7 from ncm at cantrip dot org --- This bug appears not to manifest in g++-10.
[Bug c++/58855] Attributes ignored on type alias in template
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58855 --- Comment #2 from ncm at cantrip dot org --- This bug is still present in g++-10.2
[Bug tree-optimization/97736] [9/10/11 Regression] switch codegen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97736 --- Comment #6 from ncm at cantrip dot org --- The referenced patch seems to have also deleted a fair bit of explanatory comment text, including a list of possible refinements for selected targets.
[Bug target/97736] New: [9/10 Regression] switch codegen
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97736 Bug ID: 97736 Summary: [9/10 Regression] switch codegen Product: gcc Version: 9.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- In Gcc 8 and previous, the following code bool is_vowel(char c) { switch (c) case'a':case'e':case'i':case'o':case'u': return true; return false; } compiled with -O2 or better, for numerous x86-64 targets, resolves to a bitwise flag check, e.g. lea ecx, [rdi-97] xor eax, eax cmp cl, 20 ja .L1 mov eax, 1 sal rax, cl testeax, 1065233 setne al .L1: ret Starting in gcc-9, this optimization is not performed anymore at -O2 for many common targets (e.g. -march=skylake), and we get sub edi, 97 cmp dil, 20 ja .L2 movzx edi, dil jmp [QWORD PTR .L4[0+rdi*8]] .L4: .quad .L5 .quad .L2 .quad .L2 .quad .L2 .quad .L5 .quad .L2 .quad .L2 .quad .L2 .quad .L5 .quad .L2 .quad .L2 .quad .L2 .quad .L2 .quad .L2 .quad .L5 .quad .L2 .quad .L2 .quad .L2 .quad .L2 .quad .L2 .quad .L5 .L2: mov eax, 0 ret .L5: mov eax, 1 ret same as with -O0 or -O1.
[Bug target/94037] Runtime varies 2x just by order of two int assignments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037 --- Comment #10 from ncm at cantrip dot org --- (In reply to Uroš Bizjak from comment #9) > (In reply to ncm from comment #8) > > It seems worth mentioning that the round trip through > > L1 cache is just a workaround for the optimizer refusing > > to ever emit two CMOV instructions in a basic block. > > > > Recognizing and replacing the construct with CMOVs > > explicitly would speed up a great many algorithms. > > Not universally. See PR56309. I am aware of that report. Transforming this rendition of swap_if as suggested would not create any _new_ dependencies, so may be done without fear of introducing regressions. Actually using this version of swap_if in algorithms requires careful consideration of whether it may build such dependency chains, but its use in partitioning, specifically, has been proven safe.
[Bug target/94037] Runtime varies 2x just by order of two int assignments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037 --- Comment #8 from ncm at cantrip dot org --- It seems worth mentioning that the round trip through L1 cache is just a workaround for the optimizer refusing to ever emit two CMOV instructions in a basic block. Recognizing and replacing the construct with CMOVs explicitly would speed up a great many algorithms. Although, the L1 excursion remains necessary for the general case of user-defined types. It also seems worth mention that there is no worry over dependency chains, in partitioning. Once the values are swapped they are not looked at again until the next pass.
[Bug rtl-optimization/94037] New: Runtime varies 2x just by order of two int assignments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037 Bug ID: 94037 Summary: Runtime varies 2x just by order of two int assignments Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- (This report re-uses some code from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 but identifies an entirely different problem.) Given: ``` bool swap_if(bool c, int& a, int& b) { int v[2] = { a, b }; #ifdef FAST /* 4.6s */ b = v[1-c], a = v[c]; #else /* SLOW, 9.8s */ a = v[c], b = v[!c]; #endif return c; } int* partition(int* begin, int* end) { int pivot = end[-1]; int* left = begin; for (int* right = begin; right < end - 1; ++right) { left += swap_if(*right <= pivot, *left, *right); } int tmp = *left; *left = end[-1], end[-1] = tmp; return left; } void quicksort(int* begin, int* end) { while (end - begin > 1) { int* mid = partition(begin, end); quicksort(begin, mid); begin = mid + 1; } } static const int size = 100'000'000; #include #include #include int main(int, char**) { int fd = ::open("1e8ints", O_RDONLY); int perms = PROT_READ|PROT_WRITE; int flags = MAP_PRIVATE|MAP_POPULATE|MAP_NORESERVE; auto* a = (int*) ::mmap(nullptr, size * sizeof(int), perms, flags, fd, 0); quicksort(a, a + size); return a[0] == a[size - 1]; } ``` after ``` $ dd if=/dev/urandom of=1e8ints bs=100 count=400 ``` The run time of the the program above, built "-O3 -march=skylake" vs. "-DFAST -O3 -march=skylake", varies by 2x on Skylake, similarly on Haswell. Both cases are almost equally fast on Clang, matching G++'s fast version. The difference between "!c" and "1-c" in the array index exacerbates the disparity. Godbolt `<https://godbolt.org/z/w-buUF>` says, slow: ``` movl(%rax), %edx movl(%rbx), %esi movl%esi, 8(%rsp) movl%edx, 12(%rsp) cmpl%edx, %ecx setge %sil movzbl %sil, %esi movl8(%rsp,%rsi,4), %esi movl%esi, (%rbx) setl%sil movzbl %sil, %esi movl8(%rsp,%rsi,4), %esi movl%esi, (%rax) ``` and 2x as fast: ``` movl(%rax), %ecx cmpl%ecx, %r8d setge %dl movzbl %dl, %edx movl(%rbx), %esi movl%esi, 8(%rsp) movl%ecx, 12(%rsp) movl%r9d, %esi subl%edx, %esi movslq %esi, %rsi movl8(%rsp,%rsi,4), %esi movl%esi, (%rax) movslq %edx, %rdx movl8(%rsp,%rdx,4), %edx movl%edx, (%rbx) cmpl%ecx, %r8d ``` Clang 9.0.0, -DFAST, for reference: ``` movl(%rcx), %r11d xorl%edx, %edx xorl%esi, %esi cmpl%r8d, %r11d setle %dl setg%sil movl(%rbx), %eax movl%eax, (%rsp) movl%r11d, 4(%rsp) movl(%rsp,%rsi,4), %eax movl%eax, (%rcx) movl(%rsp,%rdx,4), %eax movl%eax, (%rbx) ```
[Bug tree-optimization/67153] [8/9/10 Regression] integer optimizations 53% slower than std::bitset<>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #29 from ncm at cantrip dot org --- > My reason for thinking this is not a bug is that the fastest choice will > depend on the contents of the word list. Regardless of layout, there will > be one option that is slightly faster than the other. I guess it's > reasonable to ask, though, whether it's better by default to try to save one > cycle on an already very fast empty loop, or to save one cycle on a more > expensive loop. But the real gain (if there is one) will be matching the > layout to the runtime behavior, for which the compiler requires outside > information. Saving one cycle on a two-cycle loop has a possibility of a much larger effect than saving one cycle of a fifty-cycle loop. Even if the fifty-cycle loop is the norm, an extra cycle costs only 2%, but if the two-cycle loop is the more common, as in this case, saving the one cycle is a big win.
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #7 from ncm at cantrip dot org --- (In reply to Richard Biener from comment #6) > (In reply to Andrew Pinski from comment #4) > > (In reply to Alexander Monakov from comment #3) > > > So perhaps an unpopular opinion, but I'd say a > > > __builtin_branchless_select(c, a, b) (guaranteed to live throughout > > > optimization pipeline as a non-branchy COND_EXPR) is badly missing. > > > > I am going to say otherwise. Many of the time conditional move is faster > > than using a branch; even if the branch is predictable (there are a few > > exceptions) on most non-Intel/AMD targets. This is because the conditional > > move is just one cycle and only a "predictable" branch is one cy`le too. > > The issue with a conditional move is that it adds a data dependence while > branches are usually speculated and thus have zero overhead in the execution > stage. The extra dependence can easily slow things down dependent on the > (three!) instructions feeding the conditional move (condition, first and > second source). This is why well-predicted branches are often so much > faster. > > > It is even worse when doing things like: > > if (a && b) > > where on aarch64, this can be done using only one cmp followed by one ccmp. > > NOTE on PowerPC, you could use in theory crand/cror (though this is not done > > currently and I don't know if they are profitable in any recent design). > > > > Plus aarch64 has conditional add and a few other things which improve the > > idea of a conditional move. > > I can see conditional moves are almost always a win on less > pipelined/speculative implementations. Nobody wants a change that makes code slower on our pipelined/ speculative targets, but this is a concrete case where code is already made slower. If the code before optimization has no branch, as in the case of "a = (m & b)|(~m & c)", we can be certain that replacing it with a cmov does not introduce any new data dependence. Anyway, for the case of ?:, where cmov would replace a branch, Gcc is already happy to substitute a cmov instruction. Gcc just refuses to put in a second cmov, after it, for no apparent reason.
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #5 from ncm at cantrip dot org --- (In reply to Alexander Monakov from comment #3) > The compiler has no way of knowing ahead of time that you will be evaluating > the result on random data; for mostly-sorted arrays branching is arguably > preferable. > > __builtin_expect_with_probability is a poor proxy for unpredictability: a > condition that is true every other time leads to a branch that is both very > predictable and has probability 0.5. If putting it in made my code slower, I would take it back out. The only value it has is if it changes something. If it doesn't improve matters, I need to try something else. For it to do nothing helps nobody. > I think what you really need is a way to express branchless selection in the > source code when you know you need it but the compiler cannot see that on > its own. Other algorithms like constant-time checks for security-sensitive > applications probably also need such computational primitive. > > So perhaps an unpopular opinion, but I'd say a > __builtin_branchless_select(c, a, b) (guaranteed to live throughout > optimization pipeline as a non-branchy COND_EXPR) is badly missing. We can quibble over whether the name of the intrinsic means anything with a value of 0.5, but no other meaning would be useful. But in general I would rather write portable code to get the semantics I need. I don't have a preference between the AND/OR notation and the indexing version, except that the former seems like a more generally useful optimization. Best would be both.
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #2 from ncm at cantrip dot org --- Created attachment 47593 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47593=edit a makefile This duplicates code found on the linked archive. E.g.: make all make CC=g++-9 INDEXED make CC=clang++-9 ANDANDOR make CC=clang++-9 INDEXED_PESSIMAL_ON_GCC make CC=g++-9 CHECK=CHECK BOG
[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 --- Comment #1 from ncm at cantrip dot org --- Created attachment 47592 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47592=edit code demonstrating the failure
[Bug rtl-optimization/93165] New: avoidable 2x penalty on unpredicted overwrite
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 Bug ID: 93165 Summary: avoidable 2x penalty on unpredicted overwrite Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- === Abstract: In important cases where using a "conditional move" instruction would provide a 2x performance improvement, Gcc fails to generate the instruction. In testing, a random field is sorted by the simplest possible Quicksort algorithm, varying only the method for conditionally swapping values, with widely varying runtimes, many substantially faster than `std::sort`. The 2x result is obtainable only with Clang, which for amd64 generates cmov instructions. A simple testing apparatus may be found at https://github.com/ncm/sortfast/ that depends only on shell tools. === Details: The basic sort is: ``` int* partition(int* begin, int* end) { int pivot = end[-1]; int* left = begin; for (int* right = begin; right < end - 1; ++right) { if (*right <= pivot) { std::swap(*left, *right); ++left; } } int tmp = *left; *left = end[-1], end[-1] = tmp; return left; } void quicksort(int* begin, int* end) { while (end - begin > 1) { int* mid = partition(begin, end); quicksort(begin, mid); begin = mid + 1; } } ``` which runs about as fast as `std::sort`, on truly random input. Replacing the body of the loop in `partition()` above with ``` left += swap_if(*right <= pivot, *left, *right); ``` where `swap_if` is defined ``` inline bool swap_if(bool c, int& a, int& b) { int ta = a, mask = -c; a = (b & mask) | (ta & ~mask); b = (ta & mask) | (b & ~mask); return c; } ``` the sort is substantially faster than `std::sort` compiled with Gcc. Compiled with Clang 8 or later, it runs fully 2x as fast as `std::sort`. Clang recognizes the pattern and substitutes `cmov` instructions. (Clang also unrolls the loop, which helps a little.) Another formulation, ``` inline bool swap_if(bool c, int& a, int& b) { int ta = a, tb = b; a = c ? tb : ta; b = c ? ta : tb; return c; } ``` also results in the same object code, with Clang, but is 2x slower compiled with Gcc, which generates a branch. A third formulation, ``` inline bool swap_if(bool c, int& a, int& b) { int v[2] = { a, b }; b = v[1-c], a = v[c]; return c; } ``` is much faster than `std::sort` compiled by both Gcc and Clang, but detours values through L1 cache, at some cost, so is slower than the `cmov` version. A fourth version, ``` inline bool swap_if(bool c, int& a, int& b) { int v[2] = { a, b }; a = v[c], b = v[!c]; return c; } ``` is about the same speed as the third when built with Clang, but with Gcc is quite a lot slower than `std::sort`. Order matters, somehow, as does the choice of operator. (I don't know how to express a bug report for this last case. Advice welcome.) In ``` inline bool swap_if(bool c, int& a, int& b) { int ta = a, tb = b; a = c ? tb : ta; // b = c ? ta : tb; return c; } ``` (at least when it is not inlined) Gcc seems happy to generate the `cmov` instruction. Apparently the optimization code is very jealous about what else is allowed in the basic block where a `cmov` is considered. Finally, even with ``` inline bool swap_if(bool c, int& a, int& b) { int ta = a, tb = b; a = __builtin_expect_with_probability(c, 0, 0.5) ? tb : ta; b = __builtin_expect_with_probability(c, 0, 0.5) ? ta : tb; return c; } ``` Gcc still will not generate the `cmov` instructions. === Discussion: Replacing a branch with `cmov` may result in slower code, particularly on older CPU targets. However, when the programmer provides direct information that the branch is unpredictable, it seems like the compiler should be willing to act on that expectation. In that light, Clang's conversion of "`(mask & a)|(~mask & b)`" to `cmov` seems to be universally correct. It is a portable formula that gives better results than the branching version even without specific optimization, but may easily be rewritten using `cmov` for even better results. In addition, when `__builtin_expect_with_probability` is used to indicate unpredictability, there seems to be no defensible reason not to rewrite the expression to use `cmov`. Finally, the indexed form of `swap_if` may be recognized and turned into `cmov` instructions without worry that a predictable branch has been replaced, avoiding the unnecessary detour through memory.
[Bug middle-end/89501] Odd lack of warning about missing initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501 --- Comment #13 from ncm at cantrip dot org --- What I am getting is that the compiler is leaving that permitted optimization -- eliminating the inode check -- on the table. It is doing that not so much because it would make Linus angry, but as an artifact of the particular optimization processes used in Gcc at the moment. Clang, or some later release of Gcc or Clang, or even this Gcc under different circumstances, might choose differently. But maybe there are some flavors of UB, among which returning uninitialized variables might be the poster child, that you don't ever want to use to drive some kinds of optimizations. Maybe Gcc's process has that baked in.
[Bug middle-end/89501] Odd lack of warning about missing initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501 ncm at cantrip dot org changed: What|Removed |Added CC||ncm at cantrip dot org --- Comment #9 from ncm at cantrip dot org --- What I don't understand is why it doesn't optimize away the check on (somecondition), since it is assuming the code in the dependent block always runs.
[Bug tree-optimization/67153] [6/7/8/9 Regression] integer optimizations 53% slower than std::bitset<>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #26 from ncm at cantrip dot org --- Still fails on Skylake (i7-6700HQ) and gcc 8.1.0. The good news is that clang++-7.0.0 code is slow on all four versions.
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #22 from ncm at cantrip dot org --- (In reply to Nathan Kurz from comment #21) > My current belief is > that everything here is expected behavior, and there is no bug with either > the compiler or processor. > > The code spends most of its time in a tight loop that depends on runtime > input, and the compiler doesn't have any reason to know which branch is more > likely. The addition of "count" changes the heuristic in recent compilers, > and by chance, changes it for the worse. I am going to disagree, carefully. It seems clear, in any case, that Haswell is off the hook. 1. As a correction: *without* the count takes twice as long to run as with, or when using bitset<>. 2. As a heuristic, favoring a branch to skip a biggish loop body evidently has much less downside than favoring the branch into it. Maybe Gcc already has such a heuristic, and the addition of 7 conditional increments in the loop, or whatever overhead bitset<> adds, was enough to push it over? Westmere runs both instruction sequences (with and without __builtin_expect) the same. Maybe on Westmere the loop takes two cycles regardless of code placement, and Gcc is (still) tuned for the older timings?
[Bug c++/58855] Attributes ignored on type alias in template
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58855 ncm at cantrip dot org changed: What|Removed |Added CC||ncm at cantrip dot org --- Comment #1 from ncm at cantrip dot org --- This bug is still present in g++-5.3.1: $ cat usingbug.cc template struct S { typedefunsigned __attribute__((vector_size(N*sizeof(unsigned T1; using T2 = unsigned __attribute__((vector_size(N*sizeof(unsigned; }; int main() { S<4u>::T1 v1; S<4u>::T2 v2; return v1[1] + v2[2]; } $ g++ -std=c++14 usingbug.cc usingbug.cc: In function ‘int main()’: usingbug.cc:13:24: error: invalid types ‘S<4u>::T2 {aka unsigned int}[int]’ for array subscript return v1[1] + v2[2]; ^ $ g++ --version g++ (Debian 5.3.1-4) 5.3.1 20151219
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #18 from ncm at cantrip dot org --- It is far from clear to me that gcc-5's choice to put the increment value in a register, and use just one loop body, is wrong. Rather, it appears that an incidental choice in the placement order of basic blocks or register assignment interacts badly with a bug in Haswell branch prediction, value dependency tracking, micro-op cache, or something. An actual fix for this would need to identify and step around Haswell's sensititvity to whichever detail of code generation this program happens upon.
[Bug c++/68703] New: __attribute__((vector_size(N))) template member confusion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68703 Bug ID: 68703 Summary: __attribute__((vector_size(N))) template member confusion Product: gcc Version: 5.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- $ cat vs2.cc template struct D { int v//[N] __attribute__((vector_size(N * sizeof(int; int f1() { return this->v[N-1]; } int f2() { return v[N-1]; } }; int main(int ac, char**) { D<> d = { { ac } }; return d.f1() + d.f2(); } $ g++ vs2.cc vs2.cc: In member function ‘int D::f2()’: vs2.cc:6:28: error: invalid types ‘int[int]’ for array subscript int f2() { return v[N-1]; } ^ Notice that f1, with "this->", is OK.
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #14 from ncm at cantrip dot org --- A notable difference between g++-4.9 output and g++-5 output is that, while both hoist the (word == seven) comparison out of the innermost loop, gcc-4.9 splits inner loop into two versions, one that increments scores by 3 and another that increments by 1, where g++-5 saves 3 or 1 into a register and uses the same inner loop for both cases. Rewriting the critical loop - to run with separate inner loops - does not slow down the fast g++-4.9-compiled program, but - fails to speed up the slow g++-5-compiled program. - to precompute a 1 or 3 increment, with one inner loop for both cases - does slow down the previously fast g++-4.9-compiled program, and - does not change the speed of the slow g++-5-compiled program
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #13 from ncm at cantrip dot org --- This is essentially the entire difference between the versions of puzzlegen-int.cc without, and with, the added ++count; line referenced above (modulo register assignments and branch labels) that sidesteps the +50% pessimization: (Asm is from g++ -fverbose-asm -std=c++14 -O3 -Wall -S $SRC.cc using g++ (Debian 5.2.1-15) 5.2.1 20150808, with no instruction-set extensions specified. Output with -mbmi -mbmi2 has different instructions, but they do not noticeably affect run time on Haswell i7-4770.) @@ -793,25 +793,26 @@ .L141: movl(%rdi), %esi# MEM[base: _244, offset: 0], word testl %r11d, %esi # D.66634, word jne .L138 #, xorl%eax, %eax # tmp419 cmpl%esi, %r12d # word, seven leaq208(%rsp), %rcx #, tmp574 sete%al #, tmp419 movl%r12d, %edx # seven, seven leal1(%rax,%rax), %r8d #, D.66619 .p2align 4,,10 .p2align 3 .L140: movl%edx, %eax # seven, D.66634 negl%eax# D.66634 andl%edx, %eax # seven, D.66622 testl %eax, %esi # D.66622, word je .L139 #, addl%r8d, 24(%rcx) # D.66619, MEM[base: _207, offset: 24B] + addl$1, %ebx#, count .L139: notl%eax# D.66622 subq$4, %rcx#, ivtmp.424 andl%eax, %edx # D.66622, seven jne .L140 #, addq$4, %rdi#, ivtmp.428 cmpq%rdi, %r10 # ivtmp.428, D.66637 jne .L141 #, I tried a version of the program with a fixed-length loop (over 'place' in [6..0]) so that branches do not depend on results of rest = ~-rest. The compiler unrolled the loop, but the program ran at pessimized speed with or without the ++count line. I am very curious whether this has been reproduced on others' Haswells, and on Ivybridge and Skylake.
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #12 from ncm at cantrip dot org --- As regards hot spots, the program has two: int score[7] = { 0, }; for (Letters word : words) /**/if (!(word ~seven)) for_each_in_seven([](Letters letter, int place) { if (word letter) /**/score[place] += (word == seven) ? 3 : 1; }); The first is executed 300M times, the second 3.3M times. Inserting a counter bump before the second eliminates the slowdown: if (word letter) { ++count; /**/score[place] += (word == seven) ? 3 : 1; } This fact seems consequential. The surrounding for_each_in_seven loop isn't doing popcounts, but is doing while (v = -v). I have repeated tests using -m[no-]bmi[2], with identical results (i.e. no effect).
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #9 from ncm at cantrip dot org --- I did experiment with -m[no-]bmi[2] a fair bit. It all made a significant difference in the instructions emitted, but exactly zero difference in runtime. That's actually not surprising at all; those instructions get decomposed into micro-ops that exactly match those from the equivalent instructions, and are cached, and the loops that dominate runtime execute out of the micro-op cache. The only real effect is maybe slightly shorter object code, which could matter in a program dominated by bus traffic with loops too big to cache well. I say maybe slightly shorter because instruction-set extension instructions are actually huge, mostly prefixes. I.e. most of the BMI stuff is marketing fluff, added mainly to make the competition waste money matching them instead of improving the product.
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #11 from ncm at cantrip dot org --- Aha, Uroš, I see your name in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011 Please forgive me for teaching you about micro-ops. The code being generated for all versions does use (e.g.) popcntq %rax, %rax almost everywhere. Not quite everywhere -- I see one popcntq %rax, %rdx -- but certainly in all the performance-sensitive bits.
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #10 from ncm at cantrip dot org --- I found this, which at first blush seems like it might be relevant. The hardware complained about here is the same Haswell i7-4770. http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #6 from ncm at cantrip dot org --- It seems worth adding that the same failure occurs without -march=native.
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #4 from ncm at cantrip dot org --- Also fails 5.2.1 (Debian 5.2.1--15) 5.2.1 20150808 As noted, the third version of the program, using bitset but not using lambdas, is as slow as the version using unsigned int -- even when built using gcc-4.9. (Recall the int version and the first bitset version run fast when built with gcc-4.9.) Confirmed that on Westmere, compiled -march=native, all versions run about the same speed with all versions of the compiler reported, and this runtime is about the same as the slow Haswell speed despite the very different clock rate.
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #5 from ncm at cantrip dot org --- My preliminary conclusion is that a hardware optimization provided in Haswell but not in Westmere is not recognizing the opportunity in the unsigned int test case, that it finds in the original bitset version, as compiled by gcc-5. I have also observed that adding an assertion that the array index is not negative, before the first array access, slows the program a further 100%, on Westmere. Note that the entire data set fits in L3 cache on all tested targets, so memory bandwidth does not figure. To my inexperienced eye the effects look like branch mispredictions. I do not understand why a 3.4 GHz DDR3 Haswell runs as slowly as a 2.4 GHz DDR2 Westmere, when branch prediction (or whatever it is) fails.
[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #3 from ncm at cantrip dot org --- Created attachment 36159 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=36159action=edit bitset, but using an inlined container adapter, not lambdas, and slow This version compiles just as badly as the integer version, even by gcc-4.9.
[Bug c++/67153] New: integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 Bug ID: 67153 Summary: integer optimizations 53% slower than std::bitset Product: gcc Version: 5.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- Created attachment 36146 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=36146action=edit The std::bitset version I have attached two small, semantically equivalent C++14 programs. One uses std::bitset26 for its operations; the other uses raw unsigned int. The one that uses unsigned int runs 53% slower than the bitset version, as compiled with g++-5.1 and running on a 2013-era Haswell i7-4770. While this represents, perhaps, a stunning triumph in the optimization of inline member and lambda functions operating on structs, it may represent an equally intensely embarrassing, even mystifying, failure for optimization of the underlying raw integer operations. For both, build and test was with $ g++-5 -O3 -march=native -mtune=native -g3 -Wall $PROG.cc $ time ./a.out | wc -l 2818 Times on a 3.2GHz Haswell are consistently 0.25s for the unsigned int version, 0.16s for the std::bitset26 version. These programs are archived at https://github.com/ncm/nytm-spelling-bee/. The runtimes of the two versions are identical as built and run on my 2009 Westmere 2.4GHz i5-M520, and about the same as the integer version on Haswell.
[Bug c++/67153] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 --- Comment #1 from ncm at cantrip dot org --- Created attachment 36147 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=36147action=edit The unsigned int version
[Bug c++/67153] integer optimizations 53% slower than std::bitset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153 ncm at cantrip dot org changed: What|Removed |Added Target||Linux amd64 Known to work||4.9.2 Host||Linux amd64 Version|5.1.0 |5.1.1 Known to fail||5.1.1 Build||Linux amd64 --- Comment #2 from ncm at cantrip dot org --- The 4.9.2 release, Debian 4.9.2-10, does not exhibit this bug. When built with g++-4.9, the unsigned int version is slightly faster than the std::bitset version. The g++-5 release used was Debian 5.1.1-9 20150602. The Haswell host is running under a virtualbox VM, with /proc/cpuinfo reporting stepping 3, microcode 0x19, and flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm The compiler used for the test on the Westmere M520, that appears not to exhibit the bug, was a snapshot g++ (GCC) 6.0.0 20150504.
[Bug libstdc++/66055] New: hash containers missing required reserving constructors
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66055 Bug ID: 66055 Summary: hash containers missing required reserving constructors Product: gcc Version: 4.8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- Created attachment 35487 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=35487action=edit missing hash container constructors, provided. Hash containers in libstdc++ each lack two required reserving constructors from size_type and other arguments.
[Bug libstdc++/66055] hash containers missing required reserving constructors
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66055 ncm at cantrip dot org changed: What|Removed |Added Attachment #35487|0 |1 is obsolete|| --- Comment #1 from ncm at cantrip dot org --- Created attachment 35488 -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=35488action=edit Better fix Previous patch was incompletely edited for inclusion.
[Bug c++/66028] New: false positive, unused loop variable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66028 Bug ID: 66028 Summary: false positive, unused loop variable Product: gcc Version: 5.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: ncm at cantrip dot org Target Milestone: --- struct range { int start; int stop; struct iter { int i; bool operator!=(iter other) { return other.i != i; }; iter operator++() { ++i; return *this; }; int operator*() { return i; } }; iter begin() { return iter{start}; } iter end() { return iter{stop}; } }; int main() { int power = 1; for (int i : range{0,10}) power *= 10; } bug.cc: In function ‘int main()’: bug.cc:15:13: warning: unused variable ‘i’ [-Wunused-variable] for (int i : range{0,10}) Manifestly, i is used to count loop iterations. The warning cannot be suppressed by any decoration of the declaration; the best we can do is void(i), power *= 10; in the loop body. The warning is useful in most cases. The exception might be that, here, the iterator has no reference or pointer members, and the loop body changes external state. [This matches clang bug https://llvm.org/bugs/show_bug.cgi?id=23416)
[Bug libstdc++/19495] basic_string::_M_rep() can produce an unnaturally aligned pointer to _Rep
--- Additional Comments From ncm at cantrip dot org 2005-04-01 13:24 --- Subject: Re: basic_string::_M_rep() can produce an unnaturally aligned pointer to _Rep On Fri, Apr 01, 2005 at 11:42:27AM -, pcarlini at suse dot de wrote: What|Removed |Added Severity|normal |enhancement http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19495 I don't see how this (or 8670, or the other one) is an enhancement request. Users are absolutely allowed to make allocators that enforce only the alignment of the type they are instantiated on, and string is certainly using the wrong kind of allocator. It's a fairly minor bug, but seems to me clearly a bug. N -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19495