[Bug middle-end/32661] New: __builtin_ia32_vec_ext suboptimal for pointer/ref args

2007-07-06 Thread scovich at gmail dot com
Compiling the following with g++ -msse3 -O3:

#include emmintrin.h
int foo(__m128i* val) {
  return __builtin_ia32_vec_ext_v4si(*val, 1);
}
int bar(__m128i* val) {
  union vs {
__m128i *_v;
int* _s;
  } v = {val};
  return v._s[1];
}

yields the following assembler output. Ideally, both functions would be the
same:

_Z3fooPU8__vectorx:
.LFB497:
pshufd  $85, (%rdi), %xmm0
movd%xmm0, %rax
movq%xmm0, -8(%rsp)
ret
_Z3barPU8__vectorx:
.LFB498:
movl4(%rdi), %eax
ret


-- 
   Summary: __builtin_ia32_vec_ext suboptimal for pointer/ref args
   Product: gcc
   Version: 4.1.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: scovich at gmail dot com
GCC target triplet: x86_64-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32661



[Bug middle-end/32662] New: Significant extra code generation for 64x64=128-bit multiply

2007-07-06 Thread scovich at gmail dot com
Consider the following functions:

typedef unsigned long long int u64;
void foo(u64* d, u64 const* s, u64 k) {
*d = ((__uint128_t) *s*k)  64;
}
void foo(u64* d, u64 const* s, u64 k, u64 m) {
*d = ((__uint128_t) (*sm)*k)  64;
}
void foo2(u64* d, u64 const* s, u64 k) {
foo(d,  s,  k);
foo(d+1,s+1,k);
}
void foo2(u64* d, u64 const* s, u64 k, u64 m) {
foo(d,  s,  k, m);
foo(d+1,s+1,k, m);
}

Compiling them with g++ -O3 gives:

_Z3fooPyPKyy:
movq%rdx, %rax
mulq(%rsi)
movq%rdx, (%rdi)
ret
_Z3fooPyPKyyy:
andq(%rsi), %rcx
movq%rcx, %rax
mulq%rdx
movq%rdx, (%rdi)
ret
_Z4foo2PyPKyy:
movq(%rsi), %rax
xorl%r9d, %r9d
movq%rdx, %r8
movq%r9, %rcx
imulq   %rax, %rcx
mulq%rdx
leaq(%rcx,%rdx), %rdx
movq%r9, %rcx
movq%rdx, (%rdi)
movq8(%rsi), %rax
imulq   %rax, %rcx
mulq%r8
leaq(%rcx,%rdx), %rdx
movq%rdx, 8(%rdi)
ret
_Z4foo2PyPKyyy:
movq%rcx, %rax
andq(%rsi), %rax
movq%rdx, %r10
xorl%r11d, %r11d
xorl%edx, %edx
movq%rdx, %r8
movq%r11, %r9
imulq   %r10, %r8
imulq   %rax, %r9
mulq%r10
addq%r9, %r8
leaq(%r8,%rdx), %rdx
movq%rdx, (%rdi)
andq8(%rsi), %rcx
xorl%edx, %edx
movq%r11, %rsi
movq%rcx, %rax
movq%rdx, %rcx
imulq   %rax, %rsi
imulq   %r10, %rcx
mulq%r10
addq%rsi, %rcx
leaq(%rcx,%rdx), %rdx
movq%rdx, 8(%rdi)
ret

The two versions of foo() do exactly what you would expect: AND+MUL, then store
the high dword. The two versions of foo2(), on the other hand, perform two and
four signed multiplies, in addition to the two unsigned multiplies that would
be expected. In my debugger, at least, xorl %edx, %edx zeros out all 64 bits,
so the two signed multiplies give zero for their result, making them completely
redundant. 

Compiling without optimizations gives the IMUL+IMUL+MUL combination even for
foo(), so it appears that the optimizer is missing something once it has more
than one multiply to deal with.


-- 
   Summary: Significant extra code generation for 64x64=128-bit
multiply
   Product: gcc
   Version: 4.1.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: scovich at gmail dot com
GCC target triplet: x86_64-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32662



[Bug c++/32412] New: Passing struct as parameter breaks SRA for stack-allocated struct inside called function

2007-06-20 Thread scovich at gmail dot com
sra-bug.C (below) contains a function which stack-allocates a local struct
containing two small arrays. The function depends on SRA to eliminate repeated
memory accesses to the two arrays as it streams over a large, third array.

The performance of the executables resulting from
g++ -Wall -O3 -msse3 -fpeel-loops sra-bug.C
and
g++ -Wall -O3 -msse3 -fpeel-loops sra-bug.C -DTRIGGER_BUG
differs by exactly 2x on my machine (a 2.66GHz Core2 quad Xeon), with the
runtime increasing from .395 ns/value/entry to .790 ns/value/entry. 

The only difference between the two versions is whether the array pointer and
count are passed as separate arguments (fast) or wrapped in a struct (slow),
even though the latter gets copied into local variables before use. Use of the
__restrict keyword didn't seem to make a difference. The assembler output shows
that excessive loads and stores nearly double the instruction count of the
unrolled inner loop for the slower case.

FYI gcc-4.2.0 shows similar behavior, though its output is slower than 4.1 for
both cases (.420ns vs 1.10ns). gcc-4.3-20070617 performs equally badly on both
versions of the code (.690 ns/value/entry). 

sra-bug.C:
===
#include emmintrin.h
#include stdint.h
#include cassert
#include cstdio
#include sys/time.h

struct stopwatch_t {
struct timeval tv; long long mark;
stopwatch_t() { reset(); }
double time_ns() {
long long old_mark = mark; reset(); return 1e3*(mark - old_mark);
}
void reset() {
gettimeofday(tv, NULL); mark = tv.tv_usec + tv.tv_sec*100ll;
}
};

templateint N, class T, class Action
inline void unrolled_loop(T* entries, Action action) {
  for(int i=0; i  N; i++) action(entries[i]);
}

static __m128i const ALL_ZEROS = {0ull, 0ull};
static __m128i const ALL_ONES = {~0ull, ~0ull};
static int const COUNT=4;

struct Action16 {
  __m128i _results[COUNT];
  __m128i _values[COUNT];
  __m128i* _dest;
  Action16(__m128i* dest, uint64_t const* values) : _dest(dest) {
for(int i=0; i  COUNT; i++) {
  _results[i] = ALL_ZEROS;
  _values[i] = _mm_set1_epi16((short) values[i]);
}
  }
  void operator()(__m128i const entry) {
for(int i=0; i  COUNT; i++)
  _results[i] |= _mm_cmpeq_epi16(_values[i], entry);
  }
  ~Action16() {
for(int i=0; i  COUNT; i++)
  _dest[i] = _mm_movemask_epi8(_results[i])? ALL_ONES : ALL_ZEROS;
  }
};

struct wrapper {
  __m128i const* entries;
  int count;
};

#ifdef TRIGGER_BUG
void foo(__m128i* dest, uint64_t const* values, wrapper const w) {
  __m128i const* entries = w.entries;  int count = w.count;
#else
void foo(__m128i* dest, uint64_t const* values, __m128i const* entries, int
coun
t) {
#endif
  static int const unroll_count=16;
  Action16 action(dest, values);
  assert((count % unroll_count) == 0);
  for(int i=0; i+unroll_count  count; i+=unroll_count)
unrolled_loopunroll_count(entries[i], action);
}

int main() {
  int VALUE_COUNT = 100;
  int LIST_SIZE = 2048;
  uint64_t* values = new uint64_t[VALUE_COUNT];
  __m128i* dest = (__m128i*) _mm_malloc(16*VALUE_COUNT, 16);
  __m128i entries[LIST_SIZE];
  wrapper w = {entries, LIST_SIZE};
  stopwatch_t timer;
  for(int j=0; j  5; j++) {
for(int i=0; i  VALUE_COUNT; i+= COUNT) {
#ifdef TRIGGER_BUG
  foo(dest+i, values+i, w);
#else
  foo(dest+i, values+i, entries, LIST_SIZE);
#endif
}
printf(%.3lf ns/value/entry\n, timer.time_ns()/LIST_SIZE/VALUE_COUNT);
  }
}


-- 
   Summary: Passing struct as parameter breaks SRA for stack-
allocated struct inside called function
   Product: gcc
   Version: 4.1.2
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: c++
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: scovich at gmail dot com
GCC target triplet: x86_64-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32412



[Bug middle-end/32412] Passing struct as parameter breaks SRA for stack-allocated struct inside called function

2007-06-20 Thread scovich at gmail dot com


--- Comment #2 from scovich at gmail dot com  2007-06-20 17:49 ---
(In reply to comment #1)
 wrapper const w
 
 You are passing via reference which does not break SRA, just changes the ABI
 and such.
 
 This is a very very hard problem to solve without the whole program.
 
 I wondering if I should close it as won't fix.
 

I'm not convinced the ABI change by itself is the culprit:
1. Passing w by value gives the same result. Granted, passing a struct at all
changes the ABI, but the const ref part isn't an issue, at least.
2. You have to actually use the wrapper's 'entries' pointer for the problem to
appear (diff for modified test case below).
3. The problem goes away if you convert Action16 to use scalars instead of
arrays, so SRA for structs is unaffected. 

Why does passing a pointer inside a struct on the stack instead of passing it
in a register suddenly require the whole program to analyze properly? There's
no way stack-allocated arrays can alias with arrays passed into the function. I
would have expected a few extra instructions in the function prologue to load
the values into registers, followed by business as usual. 

$ diff sra-bug.C.orig sra-bug.C
==
51a52,54
 void foo(__m128i* dest, uint64_t const* values,
__m128i const* _entries, int _count, wrapper w)
 {
53d55
 void foo(__m128i* dest, uint64_t const* values, wrapper const w) {
56c58
 void foo(__m128i* dest, uint64_t const* values, __m128i const* entries, int
co
unt) {
---
   __m128i const* entries = _entries; int count = _count;
75,79c77
 #ifdef TRIGGER_BUG
   foo(dest+i, values+i, w);
 #else
   foo(dest+i, values+i, entries, LIST_SIZE);
 #endif
---
   foo(dest+i, values+i, entries, LIST_SIZE, w);


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32412



[Bug middle-end/32412] Passing struct as parameter breaks SRA for stack-allocated struct inside called function

2007-06-20 Thread scovich at gmail dot com


--- Comment #3 from scovich at gmail dot com  2007-06-20 18:22 ---
(In reply to comment #1)

Sorry for the double post, but I just tried creating a wrapper_foo() that
copies the values out of the struct, then passes them on to foo() as scalars.
The problem only appears if foo() gets inlined into wrapper_foo().


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32412



[Bug c++/32291] New: -Wformat either too picky or not picky enough

2007-06-11 Thread scovich at gmail dot com
Compiling the following snippet with -Wformat (or -Wall) causes the compiler to
complain: wformat-bug.C:8: warning: format '%u' expects type 'unsigned int',
but argument 2 has type 'uint32_t'

The problem seems to be that stdint.h defines uint32_t as long in cygwin. I
realize that int != long on some platforms, but i686 isn't one of them. Why
should the user be forced to cast their uint32_t (read: 32-bit unsigned int) to
unsigned int before passing it to printf() when they are logically identical? 

On the other hand, there's no complaint about passing a signed integer into an
unsigned format or vice-versa, even though the output value might actually
change because of the oversight in those cases.

wformat.C:
===
#include cstdio
#include stdint.h

int main() {
  uint32_t a = ~0;
  unsigned int b = a;
  uint32_t c = b;
  printf(%u\n, c); // warning (?)

  int d = c;
  unsigned e = d;
  printf(%d\n, e); // no warning
  printf(%u\n, d); // no warning
}


-- 
   Summary: -Wformat either too picky or not picky enough
   Product: gcc
   Version: 4.2.0
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: c++
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: scovich at gmail dot com
GCC target triplet: i686-pc-cygwin


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32291



[Bug c/32292] New: pthread_exit should have attribute __noreturn__

2007-06-11 Thread scovich at gmail dot com
The following generates a spurious warning about control reaching the end of a
non-void function:

#include pthread.h
void* foo(void*) {
  pthread_exit(1);
}


-- 
   Summary: pthread_exit should have attribute __noreturn__
   Product: gcc
   Version: 4.2.0
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: scovich at gmail dot com
GCC target triplet: i686-pc-cygwin


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32292



[Bug target/30315] optimize unsigned-add overflow test on x86 to use cpu flags from addl

2007-06-05 Thread scovich at gmail dot com


--- Comment #1 from scovich at gmail dot com  2007-06-06 03:39 ---
Happens on x86_64-unknown-linux-gnu as well, for both 4.2.0 and 4.3 (20070605)

The problem is even worse for 128-bit arithmetic because it has to check two
registers (with associated branches) before making a decision. This in spite of
the fact that sbb sets the flags properly AFAIK:

bool sub128(__uint128_t dest, __uint128_t a, __uint128_t b) {
  dest = a - b;
  if(dest  a) abort();
}

_Z6sub128Rooo:
.LFB557:
movq%rsi, %rax
movq%rdx, %r10
pushq   %rbx
.LCFI0:
subq%rcx, %rax
sbbq%r8, %rdx
movq%rax, (%rdi)
cmpq%rdx, %r10
movq%rdx, 8(%rdi)
ja  .L23
jae .L24
.L21:
callabort
.p2align 4,,7
.L24:
cmpq%rax, %rsi
.p2align 4,,6
jb  .L21
.p2align 4,,7
.L23:
popq%rbx
.p2align 4,,5
ret

There's not really a way to work around it with inline asm, either, because of
the branch on overflow that will most likely come right afterward...


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30315



[Bug c++/32186] New: -ggdb emits broken debug info

2007-06-02 Thread scovich at gmail dot com
Compiling with 'g++ -ggdb' confuses both gdb-6.5 and gdb-6.6 into thinking
they've got corrupted stacks. My programs all seem to execute properly, both
alone and inside gdb -- it's just hard to debug anything. Using 'g++ -g'
instead seems to work fine.

Consider 'g++ -ggdb foo.cpp' with the following code snippet:

void foo() {
  int i;
  i=1;
}

int main() {
  foo();
  return 0;
}

Below is the gdb output as I step through the resulting executable. 

Current directory is c:/cygwin/home/johnsory/experiments/
GNU gdb 6.6
Copyright (C) 2006 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as i686-pc-cygwin...
(gdb) start
Breakpoint 1 at 0x401071: file foo.cpp, line 6.
Starting program: /home/johnsory/experiments/a.exe 
Loaded symbols for /cygdrive/c/WINDOWS/system32/ntdll.dll
Loaded symbols for /cygdrive/c/WINDOWS/system32/kernel32.dll
Loaded symbols for /usr/bin/cygwin1.dll
Loaded symbols for /cygdrive/c/WINDOWS/system32/advapi32.dll
Loaded symbols for /cygdrive/c/WINDOWS/system32/rpcrt4.dll
main () at foo.cpp:6
(gdb) step
(gdb) bt
#0  main () at foo.cpp:7
(gdb) step
foo () at foo.cpp:1
(gdb) bt
#0  foo () at foo.cpp:1
#1  0x00401050 in mainCRTStartup ()
(gdb) step
(gdb) bt
#0  foo () at foo.cpp:3
#1  0x00401056 in foo () at foo.cpp:1
#2  0x00401056 in foo () at foo.cpp:1
#3  0x00401056 in foo () at foo.cpp:1
#4  0x00401056 in foo () at foo.cpp:1
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) step
(gdb) bt
#0  foo () at foo.cpp:4
#1  0x0040105d in foo () at foo.cpp:3
#2  0x0040105d in foo () at foo.cpp:3
#3  0x0040105d in foo () at foo.cpp:3
#4  0x0040105d in foo () at foo.cpp:3
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) step
main () at foo.cpp:8
(gdb) bt
#0  main () at foo.cpp:8
(gdb) step
(gdb) bt
#0  main () at foo.cpp:9
(gdb) step
0x61006198 in dll_crt0_1 () from /usr/bin/cygwin1.dll
(gdb) bt
#0  0x61006198 in dll_crt0_1 () from /usr/bin/cygwin1.dll
#1  0x61004416 in _cygtls::call2 () from /usr/bin/cygwin1.dll
#2  0x in ?? ()
(gdb) step
Single stepping until exit from function _Z10dll_crt0_1Pv, 
which has no line number information.

Program exited normally.
(gdb)


-- 
   Summary: -ggdb emits broken debug info
   Product: gcc
   Version: 4.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: scovich at gmail dot com
GCC target triplet: i686-pc-cygwin


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32186



[Bug c++/32186] -ggdb emits broken debug info

2007-06-02 Thread scovich at gmail dot com


--- Comment #1 from scovich at gmail dot com  2007-06-02 09:37 ---
It also appears that 'next' is broken and acts like 'step' (enter all
functions), while 'finish' acts like 'continue' (run to completion, barring a
breakpoint). 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32186



[Bug c++/32073] New: Loop unrolling does not exploit VRP for loop bound

2007-05-24 Thread scovich at gmail dot com
Loops with a bounded, small number of iterations unroll too much. They should
be peeled away instead. For example, if I compile the following function with
``-O3 -funroll-loops'':

void short_loop(int* dest, int* src, int count) {
  // same happens for assert(count = 4) and if(count  4) exit(-1)
  if(count  4)
count = 4;

  for(int i=0; i  count; i++)
dest[i] = src[i];
}

The assembly output (for i686-pc-cygwin) is an 8x duff's device, of which 75%
of the code will never execute (translated back to C++ here for readability):

void short_loop(int* dest, int* src, int count) {
  // same happens for assert(count = 4) and if(count  4) exit(-1)
  if(count  4)
count = 4;

  int mod = count % 8;
  switch(mod) {
  case 7:
// loop body
count--;
  case 6:
// loop body
count--;
  case 5:
// loop body
count--;
  case 4:
// loop body
count--;
  case 3:
// loop body
count--;
  case 2:
// loop body
count--;
  case 1:
// loop body
count--;
  default:
for(int i=0; i  count; i+=8)
  // 8x unrolled loop body
  }
}

We need 25% of that code:

void short_loop(int* dest, int* src, int count) {
  // same happens for assert(count = 4) and if(count  4) exit(-1)
  if(count  4)
count = 4;

  switch(count) {
  case 4:
// loop body
  case 3:
// loop body
  case 2:
// loop body
  case 1:
// loop body
  default:
break;
  }
}


-- 
   Summary: Loop unrolling does not exploit VRP for loop bound
   Product: gcc
   Version: 4.2.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: c++
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: scovich at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32073



[Bug c/32074] New: Optimizer does not exploit assertions

2007-05-24 Thread scovich at gmail dot com
It would be nice if the optimizer took advantage of assertions. I realize that
assertions may not be enabled for production code, but even when disabled they
are still explicit statements of the programmer's assumptions; the compiler
should be able to exploit those assumptions if it yields better code (or avoids
annoying warnings). 

To me, ``assert(!bad_thing)'' indicates that ``bad_thing'' should not be
allowed to happen; compiling with assertions disabled means that ''bad_thing''
is assumed not to happen. Therefore, code that breaks when ``bad_thing ==
true'' is my bug, not the compiler's, and not necessarily worse than the bug(s)
caused by return values or side effects of ``correct'' code after an enabled
assertion would have terminated the program.

For example, -funroll_loops on the following code results in an 8x duff's
device, even though no acceptable input will run more than twice. In this
particular case, ``if(bad thing) exit(-1)'' does the same thing.


void short_loop(int* dest, int* src, int count) {
  // same happens for if(count  2) exit(-1)
  assert(count = 2);

  for(int i=0; i  count; i++)
dest[i] = src[i];
}

As another example, compiling the following switch statement with -Wall causes
complaints about control reaching the end of a non-void function:

int limited_switch(int a, int b, int what) {
  switch(what) {
  case 0:
return a+b;
  case 1:
return a;
  case 2:
return b;
  case 3:
return a-b;
  default:
// unreachable
assert(false);
  }
}

The following variant of the previous switch statement, which also has an
undefined return value for (what  0 || what = 4), doesn't cause any warnings
at all, though it's arguably less correct -- at least with the first variant
the programmer indicated that she thought the matter through.


int limited_switch(int a, int b, int what) {
  int result;
  switch(what) {
  case 0:
result = a+b;
break;
  case 1:
result = a;
break;
  case 2:
result = b;
break;
  case 3:
result = a-b;
break;
  default:
break;
  }
  return result;
}


-- 
   Summary: Optimizer does not exploit  assertions
   Product: gcc
   Version: 4.2.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: scovich at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32074



<    1   2