[Bug middle-end/32661] New: __builtin_ia32_vec_ext suboptimal for pointer/ref args
Compiling the following with g++ -msse3 -O3: #include emmintrin.h int foo(__m128i* val) { return __builtin_ia32_vec_ext_v4si(*val, 1); } int bar(__m128i* val) { union vs { __m128i *_v; int* _s; } v = {val}; return v._s[1]; } yields the following assembler output. Ideally, both functions would be the same: _Z3fooPU8__vectorx: .LFB497: pshufd $85, (%rdi), %xmm0 movd%xmm0, %rax movq%xmm0, -8(%rsp) ret _Z3barPU8__vectorx: .LFB498: movl4(%rdi), %eax ret -- Summary: __builtin_ia32_vec_ext suboptimal for pointer/ref args Product: gcc Version: 4.1.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com GCC target triplet: x86_64-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32661
[Bug middle-end/32662] New: Significant extra code generation for 64x64=128-bit multiply
Consider the following functions: typedef unsigned long long int u64; void foo(u64* d, u64 const* s, u64 k) { *d = ((__uint128_t) *s*k) 64; } void foo(u64* d, u64 const* s, u64 k, u64 m) { *d = ((__uint128_t) (*sm)*k) 64; } void foo2(u64* d, u64 const* s, u64 k) { foo(d, s, k); foo(d+1,s+1,k); } void foo2(u64* d, u64 const* s, u64 k, u64 m) { foo(d, s, k, m); foo(d+1,s+1,k, m); } Compiling them with g++ -O3 gives: _Z3fooPyPKyy: movq%rdx, %rax mulq(%rsi) movq%rdx, (%rdi) ret _Z3fooPyPKyyy: andq(%rsi), %rcx movq%rcx, %rax mulq%rdx movq%rdx, (%rdi) ret _Z4foo2PyPKyy: movq(%rsi), %rax xorl%r9d, %r9d movq%rdx, %r8 movq%r9, %rcx imulq %rax, %rcx mulq%rdx leaq(%rcx,%rdx), %rdx movq%r9, %rcx movq%rdx, (%rdi) movq8(%rsi), %rax imulq %rax, %rcx mulq%r8 leaq(%rcx,%rdx), %rdx movq%rdx, 8(%rdi) ret _Z4foo2PyPKyyy: movq%rcx, %rax andq(%rsi), %rax movq%rdx, %r10 xorl%r11d, %r11d xorl%edx, %edx movq%rdx, %r8 movq%r11, %r9 imulq %r10, %r8 imulq %rax, %r9 mulq%r10 addq%r9, %r8 leaq(%r8,%rdx), %rdx movq%rdx, (%rdi) andq8(%rsi), %rcx xorl%edx, %edx movq%r11, %rsi movq%rcx, %rax movq%rdx, %rcx imulq %rax, %rsi imulq %r10, %rcx mulq%r10 addq%rsi, %rcx leaq(%rcx,%rdx), %rdx movq%rdx, 8(%rdi) ret The two versions of foo() do exactly what you would expect: AND+MUL, then store the high dword. The two versions of foo2(), on the other hand, perform two and four signed multiplies, in addition to the two unsigned multiplies that would be expected. In my debugger, at least, xorl %edx, %edx zeros out all 64 bits, so the two signed multiplies give zero for their result, making them completely redundant. Compiling without optimizations gives the IMUL+IMUL+MUL combination even for foo(), so it appears that the optimizer is missing something once it has more than one multiply to deal with. -- Summary: Significant extra code generation for 64x64=128-bit multiply Product: gcc Version: 4.1.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com GCC target triplet: x86_64-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32662
[Bug c++/32412] New: Passing struct as parameter breaks SRA for stack-allocated struct inside called function
sra-bug.C (below) contains a function which stack-allocates a local struct containing two small arrays. The function depends on SRA to eliminate repeated memory accesses to the two arrays as it streams over a large, third array. The performance of the executables resulting from g++ -Wall -O3 -msse3 -fpeel-loops sra-bug.C and g++ -Wall -O3 -msse3 -fpeel-loops sra-bug.C -DTRIGGER_BUG differs by exactly 2x on my machine (a 2.66GHz Core2 quad Xeon), with the runtime increasing from .395 ns/value/entry to .790 ns/value/entry. The only difference between the two versions is whether the array pointer and count are passed as separate arguments (fast) or wrapped in a struct (slow), even though the latter gets copied into local variables before use. Use of the __restrict keyword didn't seem to make a difference. The assembler output shows that excessive loads and stores nearly double the instruction count of the unrolled inner loop for the slower case. FYI gcc-4.2.0 shows similar behavior, though its output is slower than 4.1 for both cases (.420ns vs 1.10ns). gcc-4.3-20070617 performs equally badly on both versions of the code (.690 ns/value/entry). sra-bug.C: === #include emmintrin.h #include stdint.h #include cassert #include cstdio #include sys/time.h struct stopwatch_t { struct timeval tv; long long mark; stopwatch_t() { reset(); } double time_ns() { long long old_mark = mark; reset(); return 1e3*(mark - old_mark); } void reset() { gettimeofday(tv, NULL); mark = tv.tv_usec + tv.tv_sec*100ll; } }; templateint N, class T, class Action inline void unrolled_loop(T* entries, Action action) { for(int i=0; i N; i++) action(entries[i]); } static __m128i const ALL_ZEROS = {0ull, 0ull}; static __m128i const ALL_ONES = {~0ull, ~0ull}; static int const COUNT=4; struct Action16 { __m128i _results[COUNT]; __m128i _values[COUNT]; __m128i* _dest; Action16(__m128i* dest, uint64_t const* values) : _dest(dest) { for(int i=0; i COUNT; i++) { _results[i] = ALL_ZEROS; _values[i] = _mm_set1_epi16((short) values[i]); } } void operator()(__m128i const entry) { for(int i=0; i COUNT; i++) _results[i] |= _mm_cmpeq_epi16(_values[i], entry); } ~Action16() { for(int i=0; i COUNT; i++) _dest[i] = _mm_movemask_epi8(_results[i])? ALL_ONES : ALL_ZEROS; } }; struct wrapper { __m128i const* entries; int count; }; #ifdef TRIGGER_BUG void foo(__m128i* dest, uint64_t const* values, wrapper const w) { __m128i const* entries = w.entries; int count = w.count; #else void foo(__m128i* dest, uint64_t const* values, __m128i const* entries, int coun t) { #endif static int const unroll_count=16; Action16 action(dest, values); assert((count % unroll_count) == 0); for(int i=0; i+unroll_count count; i+=unroll_count) unrolled_loopunroll_count(entries[i], action); } int main() { int VALUE_COUNT = 100; int LIST_SIZE = 2048; uint64_t* values = new uint64_t[VALUE_COUNT]; __m128i* dest = (__m128i*) _mm_malloc(16*VALUE_COUNT, 16); __m128i entries[LIST_SIZE]; wrapper w = {entries, LIST_SIZE}; stopwatch_t timer; for(int j=0; j 5; j++) { for(int i=0; i VALUE_COUNT; i+= COUNT) { #ifdef TRIGGER_BUG foo(dest+i, values+i, w); #else foo(dest+i, values+i, entries, LIST_SIZE); #endif } printf(%.3lf ns/value/entry\n, timer.time_ns()/LIST_SIZE/VALUE_COUNT); } } -- Summary: Passing struct as parameter breaks SRA for stack- allocated struct inside called function Product: gcc Version: 4.1.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com GCC target triplet: x86_64-unknown-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32412
[Bug middle-end/32412] Passing struct as parameter breaks SRA for stack-allocated struct inside called function
--- Comment #2 from scovich at gmail dot com 2007-06-20 17:49 --- (In reply to comment #1) wrapper const w You are passing via reference which does not break SRA, just changes the ABI and such. This is a very very hard problem to solve without the whole program. I wondering if I should close it as won't fix. I'm not convinced the ABI change by itself is the culprit: 1. Passing w by value gives the same result. Granted, passing a struct at all changes the ABI, but the const ref part isn't an issue, at least. 2. You have to actually use the wrapper's 'entries' pointer for the problem to appear (diff for modified test case below). 3. The problem goes away if you convert Action16 to use scalars instead of arrays, so SRA for structs is unaffected. Why does passing a pointer inside a struct on the stack instead of passing it in a register suddenly require the whole program to analyze properly? There's no way stack-allocated arrays can alias with arrays passed into the function. I would have expected a few extra instructions in the function prologue to load the values into registers, followed by business as usual. $ diff sra-bug.C.orig sra-bug.C == 51a52,54 void foo(__m128i* dest, uint64_t const* values, __m128i const* _entries, int _count, wrapper w) { 53d55 void foo(__m128i* dest, uint64_t const* values, wrapper const w) { 56c58 void foo(__m128i* dest, uint64_t const* values, __m128i const* entries, int co unt) { --- __m128i const* entries = _entries; int count = _count; 75,79c77 #ifdef TRIGGER_BUG foo(dest+i, values+i, w); #else foo(dest+i, values+i, entries, LIST_SIZE); #endif --- foo(dest+i, values+i, entries, LIST_SIZE, w); -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32412
[Bug middle-end/32412] Passing struct as parameter breaks SRA for stack-allocated struct inside called function
--- Comment #3 from scovich at gmail dot com 2007-06-20 18:22 --- (In reply to comment #1) Sorry for the double post, but I just tried creating a wrapper_foo() that copies the values out of the struct, then passes them on to foo() as scalars. The problem only appears if foo() gets inlined into wrapper_foo(). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32412
[Bug c++/32291] New: -Wformat either too picky or not picky enough
Compiling the following snippet with -Wformat (or -Wall) causes the compiler to complain: wformat-bug.C:8: warning: format '%u' expects type 'unsigned int', but argument 2 has type 'uint32_t' The problem seems to be that stdint.h defines uint32_t as long in cygwin. I realize that int != long on some platforms, but i686 isn't one of them. Why should the user be forced to cast their uint32_t (read: 32-bit unsigned int) to unsigned int before passing it to printf() when they are logically identical? On the other hand, there's no complaint about passing a signed integer into an unsigned format or vice-versa, even though the output value might actually change because of the oversight in those cases. wformat.C: === #include cstdio #include stdint.h int main() { uint32_t a = ~0; unsigned int b = a; uint32_t c = b; printf(%u\n, c); // warning (?) int d = c; unsigned e = d; printf(%d\n, e); // no warning printf(%u\n, d); // no warning } -- Summary: -Wformat either too picky or not picky enough Product: gcc Version: 4.2.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com GCC target triplet: i686-pc-cygwin http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32291
[Bug c/32292] New: pthread_exit should have attribute __noreturn__
The following generates a spurious warning about control reaching the end of a non-void function: #include pthread.h void* foo(void*) { pthread_exit(1); } -- Summary: pthread_exit should have attribute __noreturn__ Product: gcc Version: 4.2.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com GCC target triplet: i686-pc-cygwin http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32292
[Bug target/30315] optimize unsigned-add overflow test on x86 to use cpu flags from addl
--- Comment #1 from scovich at gmail dot com 2007-06-06 03:39 --- Happens on x86_64-unknown-linux-gnu as well, for both 4.2.0 and 4.3 (20070605) The problem is even worse for 128-bit arithmetic because it has to check two registers (with associated branches) before making a decision. This in spite of the fact that sbb sets the flags properly AFAIK: bool sub128(__uint128_t dest, __uint128_t a, __uint128_t b) { dest = a - b; if(dest a) abort(); } _Z6sub128Rooo: .LFB557: movq%rsi, %rax movq%rdx, %r10 pushq %rbx .LCFI0: subq%rcx, %rax sbbq%r8, %rdx movq%rax, (%rdi) cmpq%rdx, %r10 movq%rdx, 8(%rdi) ja .L23 jae .L24 .L21: callabort .p2align 4,,7 .L24: cmpq%rax, %rsi .p2align 4,,6 jb .L21 .p2align 4,,7 .L23: popq%rbx .p2align 4,,5 ret There's not really a way to work around it with inline asm, either, because of the branch on overflow that will most likely come right afterward... -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30315
[Bug c++/32186] New: -ggdb emits broken debug info
Compiling with 'g++ -ggdb' confuses both gdb-6.5 and gdb-6.6 into thinking they've got corrupted stacks. My programs all seem to execute properly, both alone and inside gdb -- it's just hard to debug anything. Using 'g++ -g' instead seems to work fine. Consider 'g++ -ggdb foo.cpp' with the following code snippet: void foo() { int i; i=1; } int main() { foo(); return 0; } Below is the gdb output as I step through the resulting executable. Current directory is c:/cygwin/home/johnsory/experiments/ GNU gdb 6.6 Copyright (C) 2006 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as i686-pc-cygwin... (gdb) start Breakpoint 1 at 0x401071: file foo.cpp, line 6. Starting program: /home/johnsory/experiments/a.exe Loaded symbols for /cygdrive/c/WINDOWS/system32/ntdll.dll Loaded symbols for /cygdrive/c/WINDOWS/system32/kernel32.dll Loaded symbols for /usr/bin/cygwin1.dll Loaded symbols for /cygdrive/c/WINDOWS/system32/advapi32.dll Loaded symbols for /cygdrive/c/WINDOWS/system32/rpcrt4.dll main () at foo.cpp:6 (gdb) step (gdb) bt #0 main () at foo.cpp:7 (gdb) step foo () at foo.cpp:1 (gdb) bt #0 foo () at foo.cpp:1 #1 0x00401050 in mainCRTStartup () (gdb) step (gdb) bt #0 foo () at foo.cpp:3 #1 0x00401056 in foo () at foo.cpp:1 #2 0x00401056 in foo () at foo.cpp:1 #3 0x00401056 in foo () at foo.cpp:1 #4 0x00401056 in foo () at foo.cpp:1 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) step (gdb) bt #0 foo () at foo.cpp:4 #1 0x0040105d in foo () at foo.cpp:3 #2 0x0040105d in foo () at foo.cpp:3 #3 0x0040105d in foo () at foo.cpp:3 #4 0x0040105d in foo () at foo.cpp:3 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) step main () at foo.cpp:8 (gdb) bt #0 main () at foo.cpp:8 (gdb) step (gdb) bt #0 main () at foo.cpp:9 (gdb) step 0x61006198 in dll_crt0_1 () from /usr/bin/cygwin1.dll (gdb) bt #0 0x61006198 in dll_crt0_1 () from /usr/bin/cygwin1.dll #1 0x61004416 in _cygtls::call2 () from /usr/bin/cygwin1.dll #2 0x in ?? () (gdb) step Single stepping until exit from function _Z10dll_crt0_1Pv, which has no line number information. Program exited normally. (gdb) -- Summary: -ggdb emits broken debug info Product: gcc Version: 4.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com GCC target triplet: i686-pc-cygwin http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32186
[Bug c++/32186] -ggdb emits broken debug info
--- Comment #1 from scovich at gmail dot com 2007-06-02 09:37 --- It also appears that 'next' is broken and acts like 'step' (enter all functions), while 'finish' acts like 'continue' (run to completion, barring a breakpoint). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32186
[Bug c++/32073] New: Loop unrolling does not exploit VRP for loop bound
Loops with a bounded, small number of iterations unroll too much. They should be peeled away instead. For example, if I compile the following function with ``-O3 -funroll-loops'': void short_loop(int* dest, int* src, int count) { // same happens for assert(count = 4) and if(count 4) exit(-1) if(count 4) count = 4; for(int i=0; i count; i++) dest[i] = src[i]; } The assembly output (for i686-pc-cygwin) is an 8x duff's device, of which 75% of the code will never execute (translated back to C++ here for readability): void short_loop(int* dest, int* src, int count) { // same happens for assert(count = 4) and if(count 4) exit(-1) if(count 4) count = 4; int mod = count % 8; switch(mod) { case 7: // loop body count--; case 6: // loop body count--; case 5: // loop body count--; case 4: // loop body count--; case 3: // loop body count--; case 2: // loop body count--; case 1: // loop body count--; default: for(int i=0; i count; i+=8) // 8x unrolled loop body } } We need 25% of that code: void short_loop(int* dest, int* src, int count) { // same happens for assert(count = 4) and if(count 4) exit(-1) if(count 4) count = 4; switch(count) { case 4: // loop body case 3: // loop body case 2: // loop body case 1: // loop body default: break; } } -- Summary: Loop unrolling does not exploit VRP for loop bound Product: gcc Version: 4.2.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c++ AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32073
[Bug c/32074] New: Optimizer does not exploit assertions
It would be nice if the optimizer took advantage of assertions. I realize that assertions may not be enabled for production code, but even when disabled they are still explicit statements of the programmer's assumptions; the compiler should be able to exploit those assumptions if it yields better code (or avoids annoying warnings). To me, ``assert(!bad_thing)'' indicates that ``bad_thing'' should not be allowed to happen; compiling with assertions disabled means that ''bad_thing'' is assumed not to happen. Therefore, code that breaks when ``bad_thing == true'' is my bug, not the compiler's, and not necessarily worse than the bug(s) caused by return values or side effects of ``correct'' code after an enabled assertion would have terminated the program. For example, -funroll_loops on the following code results in an 8x duff's device, even though no acceptable input will run more than twice. In this particular case, ``if(bad thing) exit(-1)'' does the same thing. void short_loop(int* dest, int* src, int count) { // same happens for if(count 2) exit(-1) assert(count = 2); for(int i=0; i count; i++) dest[i] = src[i]; } As another example, compiling the following switch statement with -Wall causes complaints about control reaching the end of a non-void function: int limited_switch(int a, int b, int what) { switch(what) { case 0: return a+b; case 1: return a; case 2: return b; case 3: return a-b; default: // unreachable assert(false); } } The following variant of the previous switch statement, which also has an undefined return value for (what 0 || what = 4), doesn't cause any warnings at all, though it's arguably less correct -- at least with the first variant the programmer indicated that she thought the matter through. int limited_switch(int a, int b, int what) { int result; switch(what) { case 0: result = a+b; break; case 1: result = a; break; case 2: result = b; break; case 3: result = a-b; break; default: break; } return result; } -- Summary: Optimizer does not exploit assertions Product: gcc Version: 4.2.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: scovich at gmail dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32074