[Bug c++/109283] Destructor of co_yield conditional argument called twice

2023-05-08 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109283

--- Comment #3 from ncm at cantrip dot org ---
Appears fixed in 13.1

Still ICEs in trunk,
Compiler-Explorer-Build-gcc-70d038235cc91ef1ea4fce519e628cfb2d297bff-binutils-2.40)
14.0.0 20230508 (experimental):
: In function 'std::generator >
source(int&, std::string)':
:513:1: internal compiler error: in flatten_await_stmt, at
cp/coroutines.cc:2899
  513 | }

[Bug c++/59498] [DR 1430][10/11/12/13 Regression] Pack expansion error in template alias

2023-03-29 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59498

--- Comment #22 from ncm at cantrip dot org ---
CWG 1430 seems to be about disallowing a construct that requires capturing an
alias declaration into a name mangling. This bug and at least some of those
referred to it do not ask for any such action.

[Bug c++/109283] Destructor of co_yield conditional argument called twice

2023-03-29 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109283

--- Comment #2 from ncm at cantrip dot org ---
Betting this one is fixed by deleting code.

[Bug c++/109291] type alias template rejects pack

2023-03-27 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109291

--- Comment #2 from ncm at cantrip dot org ---
CWG 1430 is still marked Open, and is anyway only superficially
analogous. Here, there is no need for an alias to be encoded
into a type signature.

[Bug c++/109291] New: type alias template rejects pack

2023-03-27 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109291

Bug ID: 109291
   Summary: type alias template rejects pack
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

template 
struct type_identity { using type = T; };

template 
using type_identity_t = typename type_identity::type;

template 
struct S1 { using alias1 = typename type_identity::type; };

template 
struct S2 { using alias2 = typename type_identity_t; };

int main() {
  S1::alias1 a; // OK
  S2::alias2 b; // Fails
}

// Here, alias1 is fine, but alias2, the same type, is not.
// MSVC accepts both declarations. Clang matches Gcc.

// error: pack expansion argument for non-pack parameter ‘T’ of alias template
// error: expected nested-name-specifier

[Bug c++/109283] New: Destructor of co_yield conditional argument called twice

2023-03-25 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109283

Bug ID: 109283
   Summary: Destructor of co_yield conditional argument called
twice
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

Created attachment 54754
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54754=edit
Reproducer

Basically:

  co_yield a ? s : t;

segfaults,

  if (a) co_yield s; else co_yield t;

does not. The segfault traces to s/t's destructor being called 
twice. Full reproducer attached, relying on Casey Carter's 
generator implementation, pasted in.

This may be related to 101367.

Compiled with gcc-12.2, this program segfaults.
Compiled with gcc-trunk or gcc-coroutines on Godbolt, identified as:

  g++
(Compiler-Explorer-Build-gcc-13ec81eb4c3b484ad636000fa8f6d925e15fb983-binutils-2.38)
13.0.1 20230325 (experimental)

the compiler ICEs:

  :513:1: internal compiler error: in flatten_await_stmt, at
cp/coroutines.cc:2899
  513 | }

[Bug c++/68703] __attribute__((vector_size(N))) template member confusion

2021-05-04 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68703

--- Comment #10 from ncm at cantrip dot org ---
(In reply to ncm from comment #9)
> This bug appears not to manifest in g++-8, 9, and 10.
Of the three code samples in comment 4, the first and 
third fail to compile because N is undefined. What 
code was intended there? It seems like we should check
the corrected versions of those before declaring this 
fixed.

The code sample in example 3 still reports failings in
g++-10.2.

[Bug target/87085] with -march=i386, gcc should not generate code including endbr instruction

2020-11-20 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87085

ncm at cantrip dot org changed:

   What|Removed |Added

 CC||ncm at cantrip dot org

--- Comment #8 from ncm at cantrip dot org ---
(In reply to H.J. Lu from comment #7)
> (In reply to chengming from comment #4)
> > Created attachment 44602 [details]
> > ELF file
> > 
> > compiled with command
> > gcc -v -save-temps -m32 -march=i386 -fcf-protection=none -o onlyReturn
> > onlyReturn.c > output.txt 2>&1
> 
> Fedora 28 run-time only supports i686 or above.  You can't use any libraries
> on Fedora 28.

Not relevant: Reporter is not trying to run i386 code on fedora 28, but only
generate i386 code to run on a cross target.

[Bug tree-optimization/97736] [9/10/11 Regression] switch codegen

2020-11-16 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97736

--- Comment #12 from ncm at cantrip dot org ---
As it is, your probability of failure in 9 and 10 is exactly 1.0.

[Bug tree-optimization/97736] [9/10/11 Regression] switch codegen

2020-11-16 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97736

--- Comment #10 from ncm at cantrip dot org ---
Don't understand, the compiler we are using (9) has the 
regression. It looks like a trivial backport.

[Bug libstdc++/42857] std::istream::ignore(std::streamsize n) calls unnecessary underflow

2020-11-10 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42857

--- Comment #9 from ncm at cantrip dot org ---
(In reply to Jonathan Wakely from comment #8)
> Probably changed by one of the patches for PR 94749 or PR 96161, although I
> still see two reads for the first example.

Thank you, I was mistaken. This bug is still present in g++-10.

[Bug c++/68703] __attribute__((vector_size(N))) template member confusion

2020-11-10 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68703

--- Comment #9 from ncm at cantrip dot org ---
This bug appears not to manifest in g++-8, 9, and 10.

[Bug c++/66028] false positive, unused loop variable

2020-11-10 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66028

--- Comment #2 from ncm at cantrip dot org ---
This bug appears not to manifest in g++-10.2.

[Bug libstdc++/42857] std::istream::ignore(std::streamsize n) calls unnecessary underflow

2020-11-10 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42857

--- Comment #7 from ncm at cantrip dot org ---
This bug appears not to manifest in g++-10.

[Bug c++/58855] Attributes ignored on type alias in template

2020-11-10 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58855

--- Comment #2 from ncm at cantrip dot org ---
This bug is still present in g++-10.2

[Bug tree-optimization/97736] [9/10/11 Regression] switch codegen

2020-11-09 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97736

--- Comment #6 from ncm at cantrip dot org ---
The referenced patch seems to have also deleted a fair bit of explanatory
comment text, including a list of possible refinements for selected targets.

[Bug target/97736] New: [9/10 Regression] switch codegen

2020-11-05 Thread ncm at cantrip dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97736

Bug ID: 97736
   Summary: [9/10 Regression] switch codegen
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

In Gcc 8 and previous, the following code 

  bool is_vowel(char c) {
switch (c)
case'a':case'e':case'i':case'o':case'u':
  return true;
return false;
  }

compiled with -O2 or better, for numerous x86-64 targets,
resolves to a bitwise flag check, e.g.

lea ecx, [rdi-97]
xor eax, eax
cmp cl, 20
ja  .L1
mov eax, 1
sal rax, cl
testeax, 1065233
setne   al
  .L1:
ret

Starting in gcc-9, this optimization is not performed 
anymore at -O2 for many common targets (e.g. -march=skylake),
and we get

sub edi, 97
cmp dil, 20
ja  .L2
movzx   edi, dil
jmp [QWORD PTR .L4[0+rdi*8]]
  .L4:
.quad   .L5
.quad   .L2
.quad   .L2
.quad   .L2
.quad   .L5
.quad   .L2
.quad   .L2
.quad   .L2
.quad   .L5
.quad   .L2
.quad   .L2
.quad   .L2
.quad   .L2
.quad   .L2
.quad   .L5
.quad   .L2
.quad   .L2
.quad   .L2
.quad   .L2
.quad   .L2
.quad   .L5
  .L2:
mov eax, 0
ret
  .L5:
mov eax, 1
ret

same as with -O0 or -O1.

[Bug target/94037] Runtime varies 2x just by order of two int assignments

2020-03-05 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037

--- Comment #10 from ncm at cantrip dot org ---
(In reply to Uroš Bizjak from comment #9)
> (In reply to ncm from comment #8)
> > It seems worth mentioning that the round trip through 
> > L1 cache is just a workaround for the optimizer refusing 
> > to ever emit two CMOV instructions in a basic block.
> > 
> > Recognizing and replacing the construct with CMOVs 
> > explicitly would speed up a great many algorithms.
>
> Not universally. See PR56309.

I am aware of that report.

Transforming this rendition of swap_if as suggested
would not create any _new_ dependencies, so may be done 
without fear of introducing regressions.

Actually using this version of swap_if in algorithms
requires careful consideration of whether it may build
such dependency chains, but its use in partitioning,
specifically, has been proven safe.

[Bug target/94037] Runtime varies 2x just by order of two int assignments

2020-03-05 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037

--- Comment #8 from ncm at cantrip dot org ---
It seems worth mentioning that the round trip through 
L1 cache is just a workaround for the optimizer refusing 
to ever emit two CMOV instructions in a basic block.

Recognizing and replacing the construct with CMOVs 
explicitly would speed up a great many algorithms.

Although, the L1 excursion remains necessary for the 
general case of user-defined types.

It also seems worth mention that there is no worry
over dependency chains, in partitioning. Once the
values are swapped they are not looked at again
until the next pass.

[Bug rtl-optimization/94037] New: Runtime varies 2x just by order of two int assignments

2020-03-04 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94037

Bug ID: 94037
   Summary: Runtime varies 2x just by order of two int assignments
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

(This report re-uses some code from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165 but identifies an entirely
different problem.)

Given:

```
bool swap_if(bool c, int& a, int& b) {
  int v[2] = { a, b };
#ifdef FAST  /* 4.6s */
  b = v[1-c], a = v[c];
#else /* SLOW, 9.8s */
  a = v[c], b = v[!c];
#endif
  return c;
}

int* partition(int* begin, int* end) {
  int pivot = end[-1];
  int* left = begin;
  for (int* right = begin; right < end - 1; ++right) {
left += swap_if(*right <= pivot, *left, *right);
  }
  int tmp = *left; *left = end[-1], end[-1] = tmp;
  return left;
}

void quicksort(int* begin, int* end) {
  while (end - begin > 1) {
int* mid = partition(begin, end);
quicksort(begin, mid);
begin = mid + 1;
} }

static const int size = 100'000'000;

#include 
#include 
#include 

int main(int, char**) {
  int fd = ::open("1e8ints", O_RDONLY);
  int perms = PROT_READ|PROT_WRITE;
  int flags = MAP_PRIVATE|MAP_POPULATE|MAP_NORESERVE;
  auto* a = (int*) ::mmap(nullptr, size * sizeof(int), perms, flags, fd, 0);

  quicksort(a, a + size);

  return a[0] == a[size - 1];
}
```
after
```
  $ dd if=/dev/urandom of=1e8ints bs=100 count=400
```

The run time of the the program above, built "-O3 -march=skylake"
vs. "-DFAST -O3 -march=skylake", varies by 2x on Skylake, similarly
on Haswell. Both cases are almost equally fast on Clang, matching 
G++'s fast version. The difference between "!c" and "1-c" in the 
array index exacerbates the disparity.

Godbolt `<https://godbolt.org/z/w-buUF>` says, slow:
```
movl(%rax), %edx
movl(%rbx), %esi
movl%esi, 8(%rsp)
movl%edx, 12(%rsp)
cmpl%edx, %ecx
setge   %sil
movzbl  %sil, %esi
movl8(%rsp,%rsi,4), %esi
movl%esi, (%rbx)
setl%sil
movzbl  %sil, %esi
movl8(%rsp,%rsi,4), %esi
movl%esi, (%rax)
```
and 2x as fast:
```
movl(%rax), %ecx
cmpl%ecx, %r8d
setge   %dl
movzbl  %dl, %edx
movl(%rbx), %esi
movl%esi, 8(%rsp)
movl%ecx, 12(%rsp)
movl%r9d, %esi
subl%edx, %esi
movslq  %esi, %rsi
movl8(%rsp,%rsi,4), %esi
movl%esi, (%rax)
movslq  %edx, %rdx
movl8(%rsp,%rdx,4), %edx
movl%edx, (%rbx)
cmpl%ecx, %r8d
```
Clang 9.0.0, -DFAST, for reference:
```
movl(%rcx), %r11d
xorl%edx, %edx
xorl%esi, %esi
cmpl%r8d, %r11d
setle   %dl
setg%sil
movl(%rbx), %eax
movl%eax, (%rsp)
movl%r11d, 4(%rsp)
movl(%rsp,%rsi,4), %eax
movl%eax, (%rcx)
movl(%rsp,%rdx,4), %eax
movl%eax, (%rbx)
```

[Bug tree-optimization/67153] [8/9/10 Regression] integer optimizations 53% slower than std::bitset<>

2020-01-17 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #29 from ncm at cantrip dot org ---
> My reason for thinking this is not a bug is that the fastest choice will
> depend on the contents of the word list.   Regardless of layout, there will
> be one option that is slightly faster than the other.   I guess it's
> reasonable to ask, though, whether it's better by default to try to save one
> cycle on an already very fast empty loop, or to save one cycle on a more
> expensive loop. But the real gain (if there is one) will be matching the
> layout to the runtime behavior, for which the compiler requires outside
> information.

Saving one cycle on a two-cycle loop has a possibility of a much larger effect
than saving one cycle of a fifty-cycle loop. Even if the fifty-cycle loop is
the norm, an extra cycle costs only 2%, but if the two-cycle loop is the more
common, as in this case, saving the one cycle is a big win.

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-09 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #7 from ncm at cantrip dot org ---
(In reply to Richard Biener from comment #6)
> (In reply to Andrew Pinski from comment #4)
> > (In reply to Alexander Monakov from comment #3)
> > > So perhaps an unpopular opinion, but I'd say a
> > > __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> > > optimization pipeline as a non-branchy COND_EXPR) is badly missing.
> > 
> > I am going to say otherwise.  Many of the time conditional move is faster
> > than using a branch; even if the branch is predictable (there are a few
> > exceptions) on most non-Intel/AMD targets.  This is because the conditional
> > move is just one cycle and only a "predictable" branch is one cy`le too.
> 
> The issue with a conditional move is that it adds a data dependence while
> branches are usually speculated and thus have zero overhead in the execution
> stage.  The extra dependence can easily slow things down dependent on the
> (three!) instructions feeding the conditional move (condition, first and
> second source).  This is why well-predicted branches are often so much
> faster.
> 
> > It is even worse when doing things like:
> > if (a && b)
> > where on aarch64, this can be done using only one cmp followed by one ccmp.
> > NOTE on PowerPC, you could use in theory crand/cror (though this is not done
> > currently and I don't know if they are profitable in any recent design).
> > 
> > Plus aarch64 has conditional add and a few other things which improve the
> > idea of a conditional move.
> 
> I can see conditional moves are almost always a win on less
> pipelined/speculative implementations.

Nobody wants a change that makes code slower on our pipelined/
speculative targets, but this is a concrete case where code is 
already made slower. If the code before optimization has no 
branch, as in the case of "a = (m & b)|(~m & c)", we can be 
certain that replacing it with a cmov does not introduce any 
new data dependence.

Anyway, for the case of ?:, where cmov would replace a branch, 
Gcc is already happy to substitute a cmov instruction. Gcc just 
refuses to put in a second cmov, after it, for no apparent reason.

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-06 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #5 from ncm at cantrip dot org ---
(In reply to Alexander Monakov from comment #3)
> The compiler has no way of knowing ahead of time that you will be evaluating
> the result on random data; for mostly-sorted arrays branching is arguably
> preferable.
> 
> __builtin_expect_with_probability is a poor proxy for unpredictability: a
> condition that is true every other time leads to a branch that is both very
> predictable and has probability 0.5.

If putting it in made my code slower, I would take it back out. The only value
it has is if it changes something. If it doesn't improve matters, I need to try
something else. For it to do nothing helps nobody.

> I think what you really need is a way to express branchless selection in the
> source code when you know you need it but the compiler cannot see that on
> its own. Other algorithms like constant-time checks for security-sensitive
> applications probably also need such computational primitive.
> 
> So perhaps an unpopular opinion, but I'd say a
> __builtin_branchless_select(c, a, b) (guaranteed to live throughout
> optimization pipeline as a non-branchy COND_EXPR) is badly missing.

We can quibble over whether the name of the intrinsic means anything with a
value of 0.5, but no other meaning would be useful.

But in general I would rather write portable code to get the semantics I need.
I don't have a preference between the AND/OR notation and the indexing version,
except that the former seems like a more generally useful optimization. Best
would be both.

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-05 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #2 from ncm at cantrip dot org ---
Created attachment 47593
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47593=edit
a makefile

This duplicates code found on the linked archive.

E.g.:

make all
make CC=g++-9 INDEXED
make CC=clang++-9 ANDANDOR
make CC=clang++-9 INDEXED_PESSIMAL_ON_GCC
make CC=g++-9 CHECK=CHECK BOG

[Bug rtl-optimization/93165] avoidable 2x penalty on unpredicted overwrite

2020-01-05 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

--- Comment #1 from ncm at cantrip dot org ---
Created attachment 47592
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47592=edit
code demonstrating the failure

[Bug rtl-optimization/93165] New: avoidable 2x penalty on unpredicted overwrite

2020-01-05 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93165

Bug ID: 93165
   Summary: avoidable 2x penalty on unpredicted overwrite
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

=== Abstract:

In important cases where using a "conditional move" instruction would provide a
2x performance improvement, Gcc fails to generate the instruction. In testing,
a random field is sorted by the simplest possible Quicksort algorithm, varying
only the method for conditionally swapping values, with widely varying
runtimes, many substantially faster than `std::sort`. The 2x result is
obtainable only with Clang, which for amd64 generates cmov instructions.

A simple testing apparatus may be found at https://github.com/ncm/sortfast/ 
that depends only on shell tools.

=== Details:

The basic sort is:
```
int* partition(int* begin, int* end) {
  int pivot = end[-1];
  int* left = begin;
  for (int* right = begin; right < end - 1; ++right) {
if (*right <= pivot) {
  std::swap(*left, *right);
  ++left;
}
  }
  int tmp = *left; *left = end[-1], end[-1] = tmp;
  return left;
}

void quicksort(int* begin, int* end) {
  while (end - begin > 1) {
int* mid = partition(begin, end);
quicksort(begin, mid);
begin = mid + 1;
} }
```
which runs about as fast as `std::sort`, on truly random input.

Replacing the body of the loop in `partition()` above with
```
  left += swap_if(*right <= pivot, *left, *right);
```
where `swap_if` is defined
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, mask = -c;
  a = (b & mask) | (ta & ~mask);
  b = (ta & mask) | (b & ~mask);
  return c;
}
```
the sort is substantially faster than `std::sort` compiled with Gcc. Compiled
with Clang 8 or later, it runs fully 2x as fast as `std::sort`. Clang
recognizes the pattern and substitutes `cmov` instructions. (Clang also unrolls
the loop, which helps a little.) Another formulation,
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = c ? tb : ta;
  b = c ? ta : tb;
  return c;
}
```
also results in the same object code, with Clang, but is 2x slower compiled
with Gcc, which generates a branch. A third formulation,
```
inline bool swap_if(bool c, int& a, int& b) {
  int v[2] = { a, b };
  b = v[1-c], a = v[c];
  return c;
}
```
is much faster than `std::sort` compiled by both Gcc and Clang, but detours
values through L1 cache, at some cost, so is slower than the `cmov` version. A
fourth version,
```
inline bool swap_if(bool c, int& a, int& b) {
  int v[2] = { a, b };
  a = v[c], b = v[!c];
  return c;
}
```
is about the same speed as the third when built with Clang, but with Gcc is
quite a lot slower than `std::sort`. Order matters, somehow, as does the choice
of operator. (I don't know how to express a bug report for this last case.
Advice welcome.)

In
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = c ? tb : ta;
//  b = c ? ta : tb;
  return c;
}
```
(at least when it is not inlined) Gcc seems happy to generate the `cmov`
instruction. Apparently the optimization code is very jealous about what else
is allowed in the basic block where a `cmov` is considered.

Finally, even with 
```
inline bool swap_if(bool c, int& a, int& b) {
  int ta = a, tb = b;
  a = __builtin_expect_with_probability(c, 0, 0.5) ? tb : ta;
  b = __builtin_expect_with_probability(c, 0, 0.5) ? ta : tb;
  return c;
}
```
Gcc still will not generate the `cmov` instructions.

=== Discussion:

Replacing a branch with `cmov` may result in slower code, particularly on older
CPU targets. However, when the programmer provides direct information that the
branch is unpredictable, it seems like the compiler should be willing to act on
that expectation.

In that light, Clang's conversion of "`(mask & a)|(~mask & b)`" to `cmov` seems
to be universally correct. It is a portable formula that gives better results
than the branching version even without specific optimization, but may easily
be rewritten using `cmov` for even better results.

In addition, when `__builtin_expect_with_probability` is used to indicate
unpredictability, there seems to be no defensible reason not to rewrite the
expression to use `cmov`.

Finally, the indexed form of `swap_if` may be recognized and turned into `cmov`
instructions without worry that a predictable branch has been replaced,
avoiding the unnecessary detour through memory.

[Bug middle-end/89501] Odd lack of warning about missing initialization

2019-03-04 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501

--- Comment #13 from ncm at cantrip dot org ---
What I am getting is that the compiler is leaving that permitted optimization
-- eliminating the inode check -- on the table. It is doing that not so much
because it would make Linus angry, but as an artifact of the particular
optimization processes used in Gcc at the moment. Clang, or some later release
of Gcc or Clang, or even this Gcc under different circumstances, might choose
differently.

But maybe there are some flavors of UB, among which returning uninitialized
variables might be the poster child, that you don't ever want to use to drive
some kinds of optimizations. Maybe Gcc's process has that baked in.

[Bug middle-end/89501] Odd lack of warning about missing initialization

2019-03-04 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89501

ncm at cantrip dot org changed:

   What|Removed |Added

 CC||ncm at cantrip dot org

--- Comment #9 from ncm at cantrip dot org ---
What I don't understand is why it doesn't optimize away the check on
(somecondition), since it is assuming the code in the dependent block always
runs.

[Bug tree-optimization/67153] [6/7/8/9 Regression] integer optimizations 53% slower than std::bitset<>

2018-07-17 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #26 from ncm at cantrip dot org ---
Still fails on Skylake (i7-6700HQ) and gcc 8.1.0.

The good news is that clang++-7.0.0 code is slow on all four versions.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2016-02-08 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #22 from ncm at cantrip dot org ---
(In reply to Nathan Kurz from comment #21)
> My current belief is
> that everything here is expected behavior, and there is no bug with either
> the compiler or processor.  
> 
> The code spends most of its time in a tight loop that depends on runtime
> input, and the compiler doesn't have any reason to know which branch is more
> likely.   The addition of "count" changes the heuristic in recent compilers,
> and by chance, changes it for the worse.  

I am going to disagree, carefully.  It seems clear, in any case, that Haswell
is off the hook.

1. As a correction: *without* the count takes twice as long to run as with,
   or when using bitset<>.

2. As a heuristic, favoring a branch to skip a biggish loop body evidently 
   has much less downside than favoring the branch into it.  Maybe Gcc
   already has such a heuristic, and the addition of 7 conditional 
   increments in the loop, or whatever overhead bitset<> adds, was enough
   to push it over?  

Westmere runs both instruction sequences (with and without __builtin_expect)
the same.  Maybe on Westmere the loop takes two cycles regardless of code 
placement, and Gcc is (still) tuned for the older timings?

[Bug c++/58855] Attributes ignored on type alias in template

2015-12-25 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58855

ncm at cantrip dot org changed:

   What|Removed |Added

 CC||ncm at cantrip dot org

--- Comment #1 from ncm at cantrip dot org ---
This bug is still present in g++-5.3.1:

$ cat usingbug.cc

template 
struct S {
  typedefunsigned  __attribute__((vector_size(N*sizeof(unsigned T1;
  using T2 = unsigned  __attribute__((vector_size(N*sizeof(unsigned;
};

int main()
{
S<4u>::T1 v1;
S<4u>::T2 v2;

return v1[1] + v2[2];
}

$ g++ -std=c++14 usingbug.cc 
usingbug.cc: In function ‘int main()’:
usingbug.cc:13:24: error: invalid types ‘S<4u>::T2 {aka unsigned int}[int]’ for
array subscript
 return v1[1] + v2[2];
^
$ g++ --version
g++ (Debian 5.3.1-4) 5.3.1 20151219

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-12-04 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #18 from ncm at cantrip dot org ---
It is far from clear to me that gcc-5's choice to put the increment value in a
register, and use just one loop body, is wrong. Rather, it appears that an
incidental choice in the placement order of basic blocks or register assignment
interacts badly with a bug in Haswell branch prediction, value dependency
tracking, micro-op cache, or something.  An actual fix for this would need to
identify and step around Haswell's sensititvity to whichever detail of code
generation this program happens upon.

[Bug c++/68703] New: __attribute__((vector_size(N))) template member confusion

2015-12-04 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68703

Bug ID: 68703
   Summary: __attribute__((vector_size(N))) template member
confusion
   Product: gcc
   Version: 5.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

$ cat vs2.cc
template 
struct D { 
int v//[N]
__attribute__((vector_size(N * sizeof(int;
int f1() { return this->v[N-1]; }
int f2() { return v[N-1]; }
};

int main(int ac, char**)
{
  D<> d = { { ac } };
  return d.f1() + d.f2();
}

$ g++ vs2.cc
vs2.cc: In member function ‘int D::f2()’:
vs2.cc:6:28: error: invalid types ‘int[int]’ for array subscript
 int f2() { return v[N-1]; }
^

Notice that f1, with "this->", is OK.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-16 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #14 from ncm at cantrip dot org ---
A notable difference between g++-4.9 output and g++-5 output is that,
while both hoist the (word == seven) comparison out of the innermost
loop, gcc-4.9 splits inner loop into two versions, one that increments 
scores by 3 and another that increments by 1, where g++-5 saves 3 or 1
into a register and uses the same inner loop for both cases.

Rewriting the critical loop
  - to run with separate inner loops
 - does not slow down the fast g++-4.9-compiled program, but
 - fails to speed up the slow g++-5-compiled program.
  - to precompute a 1 or 3 increment, with one inner loop for both cases
 - does slow down the previously fast g++-4.9-compiled program, and
 - does not change the speed of the slow g++-5-compiled program


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-16 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #13 from ncm at cantrip dot org ---
This is essentially the entire difference between the versions of
puzzlegen-int.cc without, and with, the added ++count; line 
referenced above (modulo register assignments and branch labels)
that sidesteps the +50% pessimization:

(Asm is from g++ -fverbose-asm -std=c++14 -O3 -Wall -S $SRC.cc using
g++ (Debian 5.2.1-15) 5.2.1 20150808, with no instruction-set extensions 
specified.  Output with -mbmi -mbmi2 has different instructions, but
they do not noticeably affect run time on Haswell i7-4770.)

@@ -793,25 +793,26 @@
 .L141:
movl(%rdi), %esi# MEM[base: _244, offset: 0], word
testl   %r11d, %esi # D.66634, word
jne .L138   #,
xorl%eax, %eax  # tmp419
cmpl%esi, %r12d # word, seven
leaq208(%rsp), %rcx #, tmp574
sete%al #, tmp419
movl%r12d, %edx # seven, seven
leal1(%rax,%rax), %r8d  #, D.66619
.p2align 4,,10
.p2align 3
  .L140:
movl%edx, %eax  # seven, D.66634
negl%eax# D.66634
andl%edx, %eax  # seven, D.66622
testl   %eax, %esi  # D.66622, word
je  .L139   #,
addl%r8d, 24(%rcx)  # D.66619, MEM[base: _207, offset: 24B]
+   addl$1, %ebx#, count
 .L139:
notl%eax# D.66622
subq$4, %rcx#, ivtmp.424
andl%eax, %edx  # D.66622, seven
jne .L140   #,
addq$4, %rdi#, ivtmp.428
cmpq%rdi, %r10  # ivtmp.428, D.66637
jne .L141   #,

I tried a version of the program with a fixed-length loop (over 
'place' in [6..0]) so that branches do not depend on results of
rest = ~-rest.  The compiler unrolled the loop, but the program
ran at pessimized speed with or without the ++count line.

I am very curious whether this has been reproduced on others' Haswells,
and on Ivybridge and Skylake.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-13 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #12 from ncm at cantrip dot org ---
As regards hot spots, the program has two:

int score[7] = { 0, };
for (Letters word : words)
/**/if (!(word  ~seven))
for_each_in_seven([](Letters letter, int place) {
if (word  letter)
/**/score[place] += (word == seven) ? 3 : 1;
});

The first is executed 300M times, the second 3.3M times.
Inserting a counter bump before the second eliminates the slowdown:

if (word  letter) {
++count;
/**/score[place] += (word == seven) ? 3 : 1;
}

This fact seems consequential.  The surrounding for_each_in_seven
loop isn't doing popcounts, but is doing while (v = -v).

I have repeated tests using -m[no-]bmi[2], with identical results
(i.e. no effect).


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-12 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #9 from ncm at cantrip dot org ---
I did experiment with -m[no-]bmi[2] a fair bit.  It all made a significant
difference in the instructions emitted, but exactly zero difference in 
runtime. That's actually not surprising at all; those instructions get 
decomposed into micro-ops that exactly match those from the equivalent
instructions, and are cached, and the loops that dominate runtime execute 
out of the micro-op cache.  The only real effect is maybe slightly shorter
object code, which could matter in a program dominated by bus traffic
with loops too big to cache well.  I say maybe slightly shorter because
instruction-set extension instructions are actually huge, mostly prefixes.

I.e. most of the BMI stuff is marketing fluff, added mainly to make the 
competition waste money matching them instead of improving the product.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-12 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #11 from ncm at cantrip dot org ---
Aha, Uroš, I see your name in 

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Please forgive me for teaching you about micro-ops.

The code being generated for all versions does use (e.g.)
popcntq %rax, %rax almost everywhere.  Not quite everywhere -- I see 
one popcntq %rax, %rdx -- but certainly in all the performance-sensitive 
bits.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-12 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #10 from ncm at cantrip dot org ---
I found this, which at first blush seems like it might be relevant.
The hardware complained about here is the same Haswell i7-4770.

http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-11 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #6 from ncm at cantrip dot org ---
It seems worth adding that the same failure occurs without -march=native.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-10 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #4 from ncm at cantrip dot org ---
Also fails 5.2.1 (Debian 5.2.1--15) 5.2.1 20150808
As noted, the third version of the program, using bitset but not using
lambdas, is as slow as the version using unsigned int -- even when built
using gcc-4.9.  (Recall the int version and the first bitset version
run fast when built with gcc-4.9.)

Confirmed that on Westmere, compiled -march=native, all versions 
run about the same speed with all versions of the compiler reported,
and this runtime is about the same as the slow Haswell speed despite
the very different clock rate.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-10 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #5 from ncm at cantrip dot org ---
My preliminary conclusion is that a hardware optimization provided in Haswell
but not in Westmere is not recognizing the opportunity in the unsigned int
test case, that it finds in the original bitset version, as compiled by gcc-5.

I have also observed that adding an assertion that the array index is not
negative, before the first array access, slows the program a further 100%, 
on Westmere.

Note that the entire data set fits in L3 cache on all tested targets, so
memory bandwidth does not figure.

To my inexperienced eye the effects look like branch mispredictions.
I do not understand why a 3.4 GHz DDR3 Haswell runs as slowly as a 
2.4 GHz DDR2 Westmere, when branch prediction (or whatever it is) 
fails.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset

2015-08-09 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #3 from ncm at cantrip dot org ---
Created attachment 36159
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=36159action=edit
bitset, but using an inlined container adapter, not lambdas, and slow

This version compiles just as badly as the integer version, even by gcc-4.9.


[Bug c++/67153] New: integer optimizations 53% slower than std::bitset

2015-08-07 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

Bug ID: 67153
   Summary: integer optimizations 53% slower than std::bitset
   Product: gcc
   Version: 5.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

Created attachment 36146
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=36146action=edit
The std::bitset version

I have attached two small, semantically equivalent C++14 programs.

One uses std::bitset26 for its operations; the other uses raw unsigned int.
The one that uses unsigned int runs 53% slower than the bitset version, as
compiled with g++-5.1 and running on a 2013-era Haswell i7-4770.  While this
represents, perhaps, a stunning triumph in the optimization of inline member
and lambda functions operating on structs, it may represent an equally
intensely embarrassing, even mystifying, failure for optimization of the
underlying raw integer operations.

For both, build and test was with

  $ g++-5 -O3 -march=native -mtune=native -g3 -Wall $PROG.cc
  $ time ./a.out | wc -l
  2818

Times on a 3.2GHz Haswell are consistently 0.25s for the unsigned int
version, 0.16s for the std::bitset26 version.

These programs are archived at https://github.com/ncm/nytm-spelling-bee/.

The runtimes of the two versions are identical as built and run on my
2009 Westmere 2.4GHz i5-M520, and about the same as the integer version
on Haswell.


[Bug c++/67153] integer optimizations 53% slower than std::bitset

2015-08-07 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #1 from ncm at cantrip dot org ---
Created attachment 36147
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=36147action=edit
The unsigned int version


[Bug c++/67153] integer optimizations 53% slower than std::bitset

2015-08-07 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

ncm at cantrip dot org changed:

   What|Removed |Added

 Target||Linux amd64
  Known to work||4.9.2
   Host||Linux amd64
Version|5.1.0   |5.1.1
  Known to fail||5.1.1
  Build||Linux amd64

--- Comment #2 from ncm at cantrip dot org ---
The 4.9.2 release, Debian 4.9.2-10, does not exhibit this bug.  When built 
with g++-4.9, the unsigned int version is slightly faster than the
std::bitset 
version.  The g++-5 release used was Debian 5.1.1-9 20150602.

The Haswell host is running under a virtualbox VM, with /proc/cpuinfo
reporting stepping 3, microcode 0x19, and flags: fpu vme de pse tsc msr 
pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 
ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm

The compiler used for the test on the Westmere M520, that appears not to 
exhibit the bug, was a snapshot g++ (GCC) 6.0.0 20150504.


[Bug libstdc++/66055] New: hash containers missing required reserving constructors

2015-05-07 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66055

Bug ID: 66055
   Summary: hash containers missing required reserving
constructors
   Product: gcc
   Version: 4.8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

Created attachment 35487
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=35487action=edit
missing hash container constructors, provided.

Hash containers in libstdc++ each lack two required reserving constructors from
size_type and other arguments.


[Bug libstdc++/66055] hash containers missing required reserving constructors

2015-05-07 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66055

ncm at cantrip dot org changed:

   What|Removed |Added

  Attachment #35487|0   |1
is obsolete||

--- Comment #1 from ncm at cantrip dot org ---
Created attachment 35488
  -- https://gcc.gnu.org/bugzilla/attachment.cgi?id=35488action=edit
Better fix

Previous patch was incompletely edited for inclusion.


[Bug c++/66028] New: false positive, unused loop variable

2015-05-05 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66028

Bug ID: 66028
   Summary: false positive, unused loop variable
   Product: gcc
   Version: 5.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ncm at cantrip dot org
  Target Milestone: ---

struct range {
int start; int stop;
struct iter {
int i;
bool operator!=(iter other) { return other.i != i; };
iter operator++() { ++i; return *this; };
int operator*() { return i; }
};
iter begin() { return iter{start}; }
iter end() { return iter{stop}; }
};
int main()
{
   int power = 1;
   for (int i : range{0,10})
   power *= 10;
}
bug.cc: In function ‘int main()’:
bug.cc:15:13: warning: unused variable ‘i’ [-Wunused-variable]
for (int i : range{0,10})

Manifestly, i is used to count loop iterations.  The warning cannot be
suppressed by any decoration of the declaration; the best we can do is

  void(i), power *= 10;

in the loop body.  The warning is useful in most cases.  The exception might be
that, here, the iterator has no reference or pointer members, and the loop body
changes external state.

[This matches clang bug https://llvm.org/bugs/show_bug.cgi?id=23416)

[Bug libstdc++/19495] basic_string::_M_rep() can produce an unnaturally aligned pointer to _Rep

2005-04-01 Thread ncm at cantrip dot org

--- Additional Comments From ncm at cantrip dot org  2005-04-01 13:24 
---
Subject: Re:  basic_string::_M_rep() can produce an unnaturally aligned pointer 
to _Rep

On Fri, Apr 01, 2005 at 11:42:27AM -, pcarlini at suse dot de wrote:
What|Removed |Added
 
Severity|normal  |enhancement
 
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19495

I don't see how this (or 8670, or the other one) is an enhancement 
request.  Users are absolutely allowed to make allocators that 
enforce only the alignment of the type they are instantiated on, 
and string is certainly using the wrong kind of allocator.

It's a fairly minor bug, but seems to me clearly a bug.

N


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19495