[Bug tree-optimization/111036] New: Code generation error in handling __builtin_constant_p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111036 Bug ID: 111036 Summary: Code generation error in handling __builtin_constant_p Product: gcc Version: 13.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com Target Milestone: --- Compile and run following code #include #define __align(n) __attribute__((aligned(n))) __attribute__((aligned(32))) static struct { unsigned long long available_cmd_ids_per_core[2]; } _rl2c_cmd_id_data; static inline void __attribute__((always_inline)) foo (void *base, size_t length) { unsigned long int p = (unsigned long int) base; if (__builtin_constant_p(p) && (p & 31) == 0) { printf("constant p && aligned to 32\n"); } else if (__builtin_constant_p(length)) { printf("constant length\n");} else { printf("else\n"); } } int main(int argc, char **argv) { foo(&_rl2c_cmd_id_data, sizeof(*(&_rl2c_cmd_id_data))); return 0; } With gcc 12.1.0 & gcc 13.1.0, I got segmentation fault. With 11.1.0 and below, I got correct result. I examined the dumped tree IR. In einline pass, a __builtin_unreachable is inserted for else if/else branches as the compiler probably thinks __builtin_constant_p(p) & (p&31) is always true. But the later passes think __builtin_constant_p(p) is always false. Therefore all code are optimized away.
[Bug tree-optimization/71264] [4.9/5 Regression] ICE in convert_move
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71264 --- Comment #17 from Bingfeng Mei --- OK, I will skip the vectorization check on our port then. Thanks.
[Bug tree-optimization/71264] [4.9/5 Regression] ICE in convert_move
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71264 Bingfeng Mei changed: What|Removed |Added CC||bmei at broadcom dot com --- Comment #15 from Bingfeng Mei --- Hi, Richard, I updated to the latest patches. But our target still failed in the same way as other people reported. footype gets V4QI instead of SI because we have it supported in vector_mode_supported_p. Thus the following error. not vectorized: vector stmt in loop:temp_14 = VIEW_CONVERT_EXPR(_8); I guess your patch in vect_init_vector is supposed to fix this. But the execution doesn't even hit vect_init_vector.
[Bug tree-optimization/71383] New: Misoptimized branch with inline assembly code.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71383 Bug ID: 71383 Summary: Misoptimized branch with inline assembly code. Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com Target Milestone: --- For the following example: include static int a, b; static void bar() { asm volatile ("" : : : "memory"); } void foo () { a = 0; bar (); if (a == 0) printf ("HERE\n"); } If compiles with: ~/work/install-x86/bin/gcc tst.c -O2 -S -fno-inline The conditional printf becomes unconditional. if (a==0) is optimized away. foo: .LFB1: .cfi_startproc subq$8, %rsp .cfi_def_cfa_offset 16 xorl%eax, %eax movl$0, a(%rip) callbar movl$.LC0, %edi addq$8, %rsp .cfi_def_cfa_offset 8 jmp puts .cfi_endproc However, if we compile with ~/work/install-x86/bin/gcc tst.c -O2 -S and allow inlining, gcc produces correct code. foo: .LFB12: .cfi_startproc movl$0, a(%rip) movla(%rip), %eax testl %eax, %eax je .L4 rep; ret .p2align 4,,10 .p2align 3 .L4: movl$.LC0, %edi jmp puts I guess it goes wrong in some of IPA passes. My compiler is GCC: (GNU) 7.0.0 20160602 (experimental) [trunk revision 14336]. I can also reproduce this issue on our port of gcc 6.1.
[Bug c/67769] New: VRP pass does wrong optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67769 Bug ID: 67769 Summary: VRP pass does wrong optimization Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com Target Milestone: --- #include static int clamp (int x, int lo, int hi) { return (x < lo) ? lo : ((x > hi) ? hi : x); } __attribute__((noinline)) short foo (int N) { short value = clamp (N, 0, 16); return value; } int main () { if (foo (-5) != 0) abort(); return 0; } Compile this simple code and run. bash:bmei:xl-cam-21:34271> ~/scratch/install-x86/bin/gcc tst.c -O2 bash:bmei:xl-cam-21:34272> ./a.out Aborted bash:bmei:xl-cam-21:34273> ~/scratch/install-x86/bin/gcc -v Using built-in specs. COLLECT_GCC=/home/bmei/scratch/install-x86/bin/gcc COLLECT_LTO_WRAPPER=/projects/firepath_tools1_scratch/bmei/install-x86/libexec/gcc/x86_64-unknown-linux-gnu/6.0.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ../trunk/configure --prefix=/projects/firepath_tools1_scratch/bmei/install-x86 --disable-nls --with-mpfr=/projects/firepath_tools/work/bmei/packages/mpfr/2.4.1/x86-64 --with-gmp=/projects/firepath_tools/work/bmei/packages/gmp/4.3.0/x86-64 --with-mpc=/projects/firepath_tools/work/bmei/packages/mpc/0.8.1/x86-64 --disable-libsanitizer --disable-target-libsanitizer CFLAGS='-O0 -g3' CXXFLAGS='-O0 -g3' --enable-languages=c --no-recursion --disable-bootstrap : (reconfigured) ../trunk/configure --prefix=/projects/firepath_tools1_scratch/bmei/install-x86 --disable-nls --with-mpfr=/projects/firepath_tools/work/bmei/packages/mpfr/2.4.1/x86-64 --with-gmp=/projects/firepath_tools/work/bmei/packages/gmp/4.3.0/x86-64 --with-mpc=/projects/firepath_tools/work/bmei/packages/mpc/0.8.1/x86-64 --disable-libsanitizer --disable-target-libsanitizer CFLAGS='-O0 -g3' CXXFLAGS='-O0 -g3' --disable-bootstrap --enable-languages=c,lto --no-create --no-recursion Thread model: posix gcc version 6.0.0 20150929 (experimental) [trunk revision 143368] (GCC) I looked into the tree dump, it seems that VRP2 pass. The second MAX_EXPR is folded. Folding statement: iftmp.0_3 = MIN_EXPR ; Not folded Folding statement: iftmp.0_6 = MAX_EXPR ; Folded into: iftmp.0_6 = iftmp.0_3; Folding statement: value_4 = (short int) iftmp.0_6; Not folded Folding statement: return value_4; Not folded foo (int N) [ noinline ] { short int value; int iftmp.0_3; int iftmp.0_6; : iftmp.0_3 = MIN_EXPR ; iftmp.0_6 = iftmp.0_3; value_4 = (short int) iftmp.0_6; return value_4; }
[Bug c/65219] New: GCC wrongly deletes a function which is not completely inlined.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65219 Bug ID: 65219 Summary: GCC wrongly deletes a function which is not completely inlined. Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com Compile the following code with gcc 5.0 ( Target: x86_64-unknown-linux-gnu gcc version 5.0.0 20150226 (experimental) [trunk revision 143368] (GCC)) ~/scratch/install-x86/bin/gcc tst.c -O2 -S #include inline int foo() { printf ("HERE\n"); printf ("HERE\n"); printf ("HERE\n"); printf ("HERE\n"); return 0; } int bar1 () { return foo(); } __attribute__((optimize("-funsafe-loop-optimizations"))) int bar2 () { return foo(); } Resulting assemble code: .file"tst.c" .section.rodata.str1.1,"aMS",@progbits,1 .LC0: .string"HERE" .section.text.unlikely,"ax",@progbits .LCOLDB1: .text .LHOTB1: .p2align 4,,15 .globlbar1 .typebar1, @function bar1: .LFB12: .cfi_startproc subq$8, %rsp .cfi_def_cfa_offset 16 movl$.LC0, %edi callputs movl$.LC0, %edi callputs movl$.LC0, %edi callputs movl$.LC0, %edi callputs xorl%eax, %eax addq$8, %rsp .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE12: .sizebar1, .-bar1 .section.text.unlikely .LCOLDE1: .text .LHOTE1: .section.text.unlikely .LCOLDB2: .text .LHOTB2: .p2align 4,,-1 .globlbar2 .typebar2, @function bar2: .LFB13: .cfi_startproc xorl%eax, %eax jmpfoo .cfi_endproc .LFE13: .sizebar2, .-bar2 .section.text.unlikely .LCOLDE2: .text .LHOTE2: .ident"GCC: (GNU) 5.0.0 20150226 (experimental) [trunk revision 143368]" .section.note.GNU-stack,"",@progbits The function body of foo is gone, but there is still a call to foo left in bar2. I did some initial investigation. The bar1 function inline foo in einline pass. But the bar2 cannot inline it because it has a function-specific optimize attribute. For some reason the body of foo is just removed anyway.
[Bug lto/61868] -frandom-seed always results in random_seed of 0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61868 Bingfeng Mei changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from Bingfeng Mei --- Fixed in r213321
[Bug lto/61868] -frandom-seed always results in random_seed of 0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61868 Bingfeng Mei changed: What|Removed |Added Component|driver |lto --- Comment #1 from Bingfeng Mei --- Change the component to lto as gcc should generate lto section name with specified random seed.
[Bug driver/61868] New: -frandom-seed always results in random_seed of 0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61868 Bug ID: 61868 Summary: -frandom-seed always results in random_seed of 0 Product: gcc Version: 4.10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: driver Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com Compile any simple file with -frandom-seed and -flto option. #include extern int foo (int); int bar (int a) { return a * 5; } int main () { printf("%d\n", foo (100)); return 0; } ~/scratch/install-x86/bin/gcc tst2.c -flto -c -frandom-seed=12345 objdump -D tst2.o|less You can see all the lto section has suffix of 0 instead of the random_seed specified. <.gnu.lto_.inline.0> This is because of the following code in toplev.c. If flag_random_seed is true, then init_random_seed is not called in get_random_seed despite the piece of code trying to generate random_seed if flag_random_seed is true. static void init_random_seed (void) { if (flag_random_seed) { char *endp; /* When the driver passed in a hex number don't crc it again */ random_seed = strtoul (flag_random_seed, &endp, 0); if (!(endp > flag_random_seed && *endp == 0)) random_seed = crc32_string (0, flag_random_seed); } else if (!random_seed) random_seed = local_tick ^ getpid (); /* Old racey fallback method */ } /* Obtain the random_seed. Unless NOINIT, initialize it if it's not provided in the command line. */ HOST_WIDE_INT get_random_seed (bool noinit) { if (!flag_random_seed && !noinit) init_random_seed (); return random_seed; }
[Bug tree-optimization/60012] New: Vectorizer generates unnecessary loop versioning for alias
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60012 Bug ID: 60012 Summary: Vectorizer generates unnecessary loop versioning for alias Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com typedef struct { short real; short imag; } complex16_t; void libvector_AccSquareNorm_ref (unsigned long long *acc, const complex16_t *x, unsigned len) { for (unsigned i = 0; i < len; i++) { acc[i] += ((unsigned long long)((int)x[i].real * x[i].real)) + ((unsigned long long)((int)x[i].imag * x[i].imag)); } } Compiler the code with ~/scratch/install-x86/bin/gcc tst.c -O2 -S -ftree-vectorize -fdump-tree-vect-details -std=c99 GCC generates unnecessary loop versioning because it cannot disambiguate mem accesses. tst.c:12:5: note: versioning for alias required: can't determine dependence between *_8 and _12->real tst.c:12:5: note: mark for run-time aliasing test between *_8 and _12->real This should be handled by TBAA info as acc & x clearly point to different data types. But unfortunately, TBAA doesn't handle Anti- & Output- dependencies.
[Bug tree-optimization/59651] [4.9 Regression] Vectorizer failing to spot dependence causes incorrect code generation.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651 --- Comment #5 from Bingfeng Mei --- Created attachment 31559 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31559&action=edit initial patch Hi, Tejas, vect_create_cond_for_alias_checks contains a bug in handling negative step. The computed data access range should be shifted by TYPE_SIZE_UNIT of bytes. Could you test the attached patch on aarch64 (I don't have simulation environment setup)? Meanwhile I will check whether there is any regression on x86-64. If everything is right, I am going to submit the patch. Thanks.
[Bug tree-optimization/59651] [4.9 Regression] Vectorizer failing to spot dependence causes incorrect code generation.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651 --- Comment #3 from Bingfeng Mei --- I can reproduce on aarch64. Still try to understand why. I constructed a similar test but with positive loop step. extern void abort (void); int a[] = { 6, 0, 0, 0 }; int b; int main () { for (;;) { b = 0; for (; b<3; b += 1) a[b] = a[0] > 1; break; } if (a[2] != 0) abort (); return 0; } Actually GCC behaves similarly during vectorization and does vectorize the loop. The only difference is around loop versioning. pr52943.c : if (1 != 0) goto ; else goto ; bb 11 leads to vectorized version. So scalar version gets optimized out. Above example: : if (0 != 0) goto ; else goto ; So vectorized version goes away and only scalar version remains.
[Bug tree-optimization/59651] Vectorizer failing to spot dependence causes incorrect code generation.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651 --- Comment #1 from Bingfeng Mei --- That is interesting. On x86-64, GCC does say it cannot determine dist vector between a[3] & a[b] and needs run-time aliasing test. In the end it gives up due to too few iterations. note: === vect_analyze_data_ref_dependences === (compute_affine_dependence stmt_a: _5 = a[3]; stmt_b: a[b.0_16] = _7; (analyze_overlapping_iterations (chrec_a = 3) (chrec_b = {3, +, -1}_1) (analyze_siv_subscript ) (overlap_iterations_a = [0]) (overlap_iterations_b = [0])) (Dependence relation cannot be represented by distance vector.) ) (compute_affine_dependence stmt_a: _5 = a[3]; stmt_b: _5 = a[3]; (analyze_overlapping_iterations (chrec_a = 3) (chrec_b = 3) (overlap_iterations_a = [0]) (overlap_iterations_b = [0])) ) (compute_affine_dependence stmt_a: a[b.0_16] = _7; stmt_b: a[b.0_16] = _7; (analyze_overlapping_iterations (chrec_a = {3, +, -1}_1) (chrec_b = {3, +, -1}_1) (overlap_iterations_a = [0]) (overlap_iterations_b = [0])) ) /projects/firepath_tools1_scratch/bmei/trunk/gcc/testsuite/gcc.dg/torture/pr52943.c:13:7: note: versioning for alias required: bad dist vector for a[3] and a[b.0_16] /projects/firepath_tools1_scratch/bmei/trunk/gcc/testsuite/gcc.dg/torture/pr52943.c:13:7: note: mark for run-time aliasing test between a[3] and a[b.0_16]
[Bug tree-optimization/59544] Vectorizing store with negative step
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544 Bingfeng Mei changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #2 from Bingfeng Mei --- Patch checked in at r206148. It triggers pr59569 that is fixed by a separate patch (r206179).
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 59544, which changed state. Bug 59544 Summary: Vectorizing store with negative step http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug middle-end/59569] [4.9 Regression] r206148 causes internal compiler error: in vect_create_destination_var, at tree-vect-data-refs.c:4294
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59569 --- Comment #9 from Bingfeng Mei --- Seems simple patch is to just bypass permutation for constant operand as vec_oprnd is a constant vector with identical elements. Index: tree-vect-stmts.c === --- tree-vect-stmts.c (revision 206176) +++ tree-vect-stmts.c (working copy) @@ -5353,7 +5353,8 @@ vectorizable_store (gimple stmt, gimple_ set_ptr_info_alignment (get_ptr_info (dataref_ptr), align, misalign); - if (negative) + if (negative + && !CONSTANT_CLASS_P (gimple_assign_rhs1 (stmt))) { tree perm_mask = perm_mask_for_reverse (vectype); tree perm_dest
[Bug middle-end/59569] [4.9 Regression] r206148 causes internal compiler error: in vect_create_destination_var, at tree-vect-data-refs.c:4294
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59569 --- Comment #8 from Bingfeng Mei --- Sorry for the regression. The assertion happens if storing a constant value with negative step. Doing permutation of constant is not the best optimization here. So the easy way to fix is to skip vectorizing this statement in the same way as before the patch. Or maybe better way is to form a constant vector to store.
[Bug tree-optimization/59544] New: Vectorizing store with negative stop
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544 Bug ID: 59544 Summary: Vectorizing store with negative stop Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com Created attachment 31467 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31467&action=edit The patch against r206016 I was looking at some loops that can be vectorized by LLVM, but not GCC. One type of loop is with store of negative step. void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__ z) { int i; for (i=127; i>=0; i--) { x[i] = y[127-i] + z[127-i]; } } I don't know why GCC only implements negative step for load, but not store. I implemented a patch (attached), very similar to code in vectorizable_load. ~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx Without patch: test1: .LFB0: addq$254, %rdi xorl%eax, %eax .p2align 4,,10 .p2align 3 .L2: movzwl(%rsi,%rax), %ecx subq$2, %rdi addw(%rdx,%rax), %cx addq$2, %rax movw%cx, 2(%rdi) cmpq$256, %rax jne.L2 rep; ret With patch: test1: .LFB0: vmovdqa.LC0(%rip), %xmm1 xorl%eax, %eax .p2align 4,,10 .p2align 3 .L2: vmovdqu(%rsi,%rax), %xmm0 movq%rax, %rcx negq%rcx vpaddw(%rdx,%rax), %xmm0, %xmm0 vpshufb%xmm1, %xmm0, %xmm0 addq$16, %rax cmpq$256, %rax vmovups%xmm0, 240(%rdi,%rcx) jne.L2 rep; ret Performance is definitely improved here. It is bootstrapped for x86_64-unknown-linux-gnu, and has no additional regressions on my machine. For reference, LLVM seems to use different instructions and slightly worse code. I am not so familiar with x86 assemble code. The patch is originally for our private port. test1: # @test1 .cfi_startproc # BB#0: # %entry addq$240, %rdi xorl%eax, %eax .align 16, 0x90 .LBB0_1:# %vector.body # =>This Inner Loop Header: Depth=1 movdqu (%rsi,%rax,2), %xmm0 movdqu (%rdx,%rax,2), %xmm1 paddw %xmm0, %xmm1 shufpd $1, %xmm1, %xmm1# xmm1 = xmm1[1,0] pshuflw $27, %xmm1, %xmm0 # xmm0 = xmm1[3,2,1,0,4,5,6,7] pshufhw $27, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,7,6,5,4] movdqu %xmm0, (%rdi) addq$8, %rax addq$-16, %rdi cmpq$128, %rax jne .LBB0_1 # BB#2: # %for.end ret
[Bug tree-optimization/59249] if-conversion doesn't handle basic-blocks with only critical predecessor edges
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59249 --- Comment #4 from Bingfeng Mei --- Even I split one critical predecessor edge, predicate of BB6 is still ORed result of two conditions from BB4 & BB5. ORing two conditions results in a sequence of statements that cannot be vectorized. Vectorizer complains of "bit-precision arithmetic not supported" because of boolean operations. Not sure how to transform the code except reverting back to a form similar to pre jump-threading.
[Bug tree-optimization/59249] if-conversion doesn't handle basic-blocks with only critical predecessor edges
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59249 --- Comment #3 from Bingfeng Mei --- Richard, I am not sure I understand about how to split edge. BB4 / \ / \ BB5| |\| | \ | | \ | | BB6 | / | / BB7 Compiler (HEAD) complains "only critical predecessors of BB6" (its predcessor BB5 has more than one successor). Do you suggest to split edge between BB5 & BB6 and insert an empty BB? In the email thread, you blame poor implementation of tree-level if-conversion. But RTL-level CE passes cannot handle that too.
[Bug tree-optimization/59249] New: Jump threading makes if-conversion and following vectorization impossible.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59249 Bug ID: 59249 Summary: Jump threading makes if-conversion and following vectorization impossible. Product: gcc Version: 4.8.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com I am doing some investigation on loops can be vectorized by LLVM, but not GCC. One example is loop that contains more than one if-else constructs. typedef signed char int8; #define FFT 128 typedef struct { int8 exp[FFT]; } feq_t; void test(feq_t *feq) { int k; int feqMinimum = 15; int8 *exp = feq->exp; for (k=0;k15) exp[k] = 15; } } Compile it with 4.8.2 on x86_64 ~/install-4.8/bin/gcc ghs-algorithms_380.c -O2 -fdump-tree-ifcvt-details -ftree-vectorize -save-temps It is not vectorized because if-else constructs inside the loop cannot be if-converted. Looking into .ifcvt file, this is due to bad if-else structure (ifcvt pass complains "only critical predecessors"). One branch jumps directly into another branch. Digging a bit deeper, I found such structure is generated by dom1 pass doing jump threading optimization. So recompile with ~/install-4.8/bin/gcc ghs-algorithms_380.c -O2 -fdump-tree-ifcvt-details -ftree-vectorize -save-temps -fno-tree-dominator-opts It is magically if-converted and vectorized! Same on our target, performance is improved greatly in this example. It seems to me that doing jump threading for architectures support if-conversion is not a good idea. Original if-else structures are damaged so that if-conversion cannot proceed, so are vectorization and maybe other optimizations. Should we try to identify those "bad" jump threading and skip them for such architectures? Andrew Pinski slightly modified the code and -fno-tree-dominator-opts trick won't work any more. #define FFT 128 typedef struct { signed char exp[FFT]; } feq_t; void test(feq_t *feq) { int k; int feqMinimum = 15; signed char *exp = feq->exp; for (k=0;k15) temp = 15; exp[k] = temp; } } But this time is due to jump threading in VRP pass that causes the trouble. With -fno-tree-vrp, the code can be if-converted and vectorized again.
[Bug tree-optimization/57512] Vectorizer: cannot handle accumulation loop of signed char type
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57512 --- Comment #1 from Bingfeng Mei --- Created attachment 30250 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30250&action=edit Vectorized assembly code with unsigned char type
[Bug tree-optimization/57512] New: Vectorizer: cannot handle accumulation loop of signed char type
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57512 Bug ID: 57512 Summary: Vectorizer: cannot handle accumulation loop of signed char type Product: gcc Version: 4.7.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bmei at broadcom dot com Created attachment 30249 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30249&action=edit Unvectorized with signed char type. GCC (I used 4.7.2 x86-64 target) cannot vectorize this accumulation loop. gcc tst.c -O2 -S -ftree-vectorize -fdump-tree-vect-details signed short mac_char (signed char * __restrict__ in1, signed char * __restrict__ in2) { unsigned i; signed short sum = 0; for (i = 0; i < 256; i++) { signed char d1 = in1[i]; signed char d2 = in2[i]; sum += ((signed short)d1 * (signed short)d2); } return sum; } If I change signed char to unsigned char, vectorization does work. unsigned short mac_uchar (unsigned char * __restrict__ in1, unsigned char * __restrict__ in2) { unsigned i; unsigned short sum = 0; for (i = 0; i < 256; i++) { unsigned char d1 = in1[i]; unsigned char d2 = in2[i]; sum += ((unsigned short)d1 * d2); } return sum; } Looking into .vect file, I think the problem is with handling following gimple stmts. GCC converts short additions to unsigned short additions and then converts result back to short because of integer promotion. This confuses vectorizer so it cannot find correct vector reduction patterns. D.3015_14 = (short unsigned int) D.3014_13; sum.0_15 = (short unsigned int) sum_25; D.3017_16 = D.3015_14 + sum.0_15; sum_17 = (short int) D.3017_16;
[Bug rtl-optimization/47258] Extra instruction generated in 4.5.2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258 --- Comment #7 from Bingfeng Mei 2011-12-15 10:18:06 UTC --- Yes, the patch fixes the bug. Thanks.
[Bug rtl-optimization/49157] New: Unnecessary stack save/restore code generated for a leaf function (arm-elf-gcc)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49157 Summary: Unnecessary stack save/restore code generated for a leaf function (arm-elf-gcc) Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: b...@broadcom.com For the following example: struct Complex16{ short a; short b; }; short foo (struct Complex16 s) { return s.a + s.b; } Compile with: arm-elf-gcc tst.c -O2 -S -mstructure-size-boundary=8 It produces: foo: @ args = 0, pretend = 0, frame = 4 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. movr3, r0, asl #16 movr3, r3, lsr #16 addr0, r3, r0, lsr #16 movr0, r0, asl #16 subsp, sp, #4 movr0, r0, asr #16 addsp, sp, #4 bxlr The problem is with struct-size-boundary=8, the structure has BLKmode and mapped to memory after RTL expand. However, memory accesses are optimized away later. But GCC records a stack item anyway and generates stack frame save/restore code for this leaf function. If we compile without -mstructure-size-boundary=8 (default is 32), it generates much better code. foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. addr0, r0, r0, asr #16 movr0, r0, asl #16 movr0, r0, asr #16 bxlr This is not limited to ARM gcc. Our target has the same issue because STRUCTURE_SIZE_BOUNDARY = 8 to save data memory size. Though I only tested gcc 4.6, I believe trunk gcc probably has the same problem.
[Bug middle-end/45416] [4.5/4.6/4.7 Regression] Code size regression between 4.6/4.7(4.5) and 4.4 for ARM
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45416 --- Comment #8 from Bingfeng Mei 2011-04-28 15:22:26 UTC --- I am currently on vacation until 4/5/2011. I may access my mail irregularly. Cheers, Bingfeng Mei
[Bug rtl-optimization/47258] Extra instruction generated in 4.5.2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258 --- Comment #5 from Bingfeng Mei 2011-01-13 15:49:23 UTC --- It works. But I have no idea about the debug info issue in your other comment. > (In reply to comment #2) > > After tried patches one-by-one, I believe the misoptimization is down to the > > following patch. > > Which is a correctness patch. You can try dumbing it down somewhat with > > if (TYPE_MAIN_VARIANT (TREE_TYPE (root1)) != TYPE_MAIN_VARIANT (TREE_TYPE > (root2)) > || !types_compatible_p (TREE_TYPE (root1), TREE_TYPE (root2))) > > and see if that helps.
[Bug rtl-optimization/47258] Extra instruction generated in 4.5.2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258 --- Comment #2 from Bingfeng Mei 2011-01-11 16:16:28 UTC --- After tried patches one-by-one, I believe the misoptimization is down to the following patch. Index: tree-ssa-copyrename.c === RCS file: /cvs/dev/tools/src/fp_gcc/gcc/tree-ssa-copyrename.c,v retrieving revision 1.1.2.5.2.1 retrieving revision 1.1.2.5.2.2 diff -u -r1.1.2.5.2.1 -r1.1.2.5.2.2 --- tree-ssa-copyrename.c12 Apr 2010 13:15:43 -1.1.2.5.2.1 +++ tree-ssa-copyrename.c13 Dec 2010 05:51:45 -1.1.2.5.2.2 @@ -225,11 +225,11 @@ ign2 = false; } - /* Don't coalesce if the two variables aren't type compatible. */ - if (!types_compatible_p (TREE_TYPE (root1), TREE_TYPE (root2))) + /* Don't coalesce if the two variables are not of the same type. */ + if (TREE_TYPE (root1) != TREE_TYPE (root2)) { if (debug) -fprintf (debug, " : Incompatible types. No coalesce.\n"); +fprintf (debug, " : Different types. No coalesce.\n"); return false; }
[Bug rtl-optimization/47258] Extra instruction generated in 4.5.2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258 --- Comment #1 from Bingfeng Mei 2011-01-11 13:38:13 UTC --- Created attachment 22944 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22944 Preprocessed test case
[Bug rtl-optimization/47258] New: Extra instruction generated in 4.5.2
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258 Summary: Extra instruction generated in 4.5.2 Product: gcc Version: 4.5.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: b...@broadcom.com I encounter a performance regression in 4.5.2 (4.6 as well) compared with 4.5.1. The code is from Core Mark. Compile the attached .i file. ~/work/install-x86-452/bin/gcc core_matrix.i -O2 -S -o x86-452.s ... .L5: movl%r8d, %r10d .L3: mov%r9d, %r8d movswl(%rcx,%rax), %r11d addq$2, %rax movswl(%rdx,%r8,2), %r8d addl$1, %r9d imull%r11d, %r8d addl%r10d, %r8d cmpq%rbx, %rax jne.L5 ... ~/work/install-x86-451/bin/gcc core_matrix.i -O2 -S -o x86-451.s ... .L3: mov%r9d, %r8d movswl(%rcx,%rax), %r11d addq$2, %rax movswl(%rdx,%r8,2), %r8d addl$1, %r9d imull%r11d, %r8d addl%r8d, %r10d cmpq%rbx, %rax jne.L3 ... The performance hit is even worse on our architecture because zero-overhead loop instruction cannot be used in such irregular loop produced by 4.5.2 The configuration used is: ../gcc-4.5.1/configure --prefix=/projects/firepath/tools/work/bmei/install-x86-451 --with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/x86-64 --with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/x86-64 --with-mpc=/projects/firepath/tools/work/bmei/packages/mpc/0.8.1/x86-64 --with-elf=/projects/firepath/tools/work/bmei/packages/libelf/x86-64 --disable-bootstrap --enable-languages=c --no-create --no-recursion The difference between 4.5.1 and 4.5.2 seems to occur in RTL expand pass.
[Bug c/45834] Redundant inter-loop edges in DDG
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45834 --- Comment #5 from Bingfeng Mei 2010-10-18 13:53:37 UTC --- > > Sure, but we have other means of dealing with that (MEM_ALIAS_SET == 0). Do you mean this check is redundant here ? I dig out the ancient code (from 1997) /* If both references are struct references, or both are not, nothing is known about aliasing. If either reference is QImode or BLKmode, ANSI C permits aliasing. If both addresses are constant, or both are not, nothing is known about aliasing. */ if (MEM_IN_STRUCT_P (x) == MEM_IN_STRUCT_P (mem) || mem_mode == QImode || mem_mode == BLKmode || GET_MODE (x) == QImode || GET_MODE (mem) == BLKmode || varies (x_addr) == varies (mem_addr)) return 1; The comment indicates that the check for QImode is for meeting aliasing rule of char type. > > > But I am not sure whether a > > restrict qualifier will override that rule. > > restrict is a different concept from type-based aliasing. > Sure, but in this example, on one hand, char type pointer is supposed to alias any other data type, on the other hand, all the char pointers have restrict qualifiers. What is correct behaviour, alias or not?
[Bug c/45834] Redundant inter-loop edges in DDG
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45834 --- Comment #3 from Bingfeng Mei 2010-10-18 12:16:59 UTC --- I think that standard specifies that char * may refer to an alias of any object, that's why QImode is different here. But I am not sure whether a restrict qualifier will override that rule.
[Bug c/45834] Redundant inter-loop edges in DDG
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45834 Bingfeng Mei changed: What|Removed |Added CC||richard.guenther at gmail ||dot com --- Comment #1 from Bingfeng Mei 2010-10-18 11:33:23 UTC --- Before using rtx_refs_may_alias_p in may_alias_p, following statement is executed. /* We cannot use aliases_everything_p to test MEM, since we must look at MEM_ADDR, rather than XEXP (mem, 0). */ if (GET_MODE (mem) == QImode || GET_CODE (mem_addr) == AND) return 1; Basically, it means that the memory access of a QImode always aliases everything else. That explains why char data type doesn't work here. The code in may_alias_p is mostly copied from true_dependence_1. The comment is not very clear to me. Richard, could you cast a light on this? Why do we need to treat QImode differently?
[Bug c/45416] Code size regression between 4.6(4.5) and 4.4
--- Comment #3 from bmei at broadcom dot com 2010-08-26 12:55 --- I found I can reproduce the bug with ARM ARM trunk -Os: foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mov r2, #1024 mov r3, #0 and r2, r2, r0 and r3, r3, r1 orrsr1, r2, r3 moveq r0, #0 movne r0, #1 mov pc, lr .size foo, .-foo Arm 4.40 -Os: foo: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. mov r0, r0, lsr #10 and r0, r0, #1 bx lr -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45416
[Bug c/45416] Code size regression between 4.6(4.5) and 4.4
--- Comment #2 from bmei at broadcom dot com 2010-08-26 12:47 --- Sorry, I first observed this on our target. Then I tried to reproduce on x86, but I forgot to turn on optimization flags. It does work for x86. Please delete this report. I will figure out what happen with my target. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45416
[Bug c/45416] New: Code size regression between 4.6(4.5) and 4.4
This is a performance/size regression between 4.6 (4.5) and 4.4. The C code: int foo(long long a) { if (a & (long long) 0x400) return 1; return 0; } Assemble code generated by 4.6 trunk: foo: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 movq%rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 movq%rdi, -8(%rbp) movq-8(%rbp), %rax andl$1024, %eax testq %rax, %rax je .L2 movl$1, %eax jmp .L3 .L2: movl$0, %eax .L3: popq%rbp .cfi_def_cfa 7, 8 ret .cfi_endproc Assemble code generated by 4.4.0: foo: .LFB0: .cfi_startproc shrq$10, %rdi movl%edi, %eax andl$1, %eax ret .cfi_endproc After tree optimizations, both compilers produce different but essentially same forms. RTL expander and later passes then go on to do different optimizations and generate very different code. -- Summary: Code size regression between 4.6(4.5) and 4.4 Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: bmei at broadcom dot com GCC host triplet: x86_64-unknown-linux http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45416
[Bug c/45176] restrict qualifier is not used in a manually unrolled loop
--- Comment #5 from bmei at broadcom dot com 2010-08-05 13:44 --- I tried to apply the patches (this one alone is not enough) Richard suggested. It becomes a chain of too many patches in the end. I am confident any more to apply them to 4.5. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45176
[Bug c/45176] New: restrict qualifier is not used in a manually unrolled loop
void foo (int * restrict a, int * restrict b, int * restrict c) { int i; for(i = 0; i < 100; i+=4) { a[i] = b[i] * c[i]; a[i+1] = b[i+1] * c[i+1]; a[i+2] = b[i+2] * c[i+2]; a[i+3] = b[i+3] * c[i+3]; } } Trunk x86-64 compiler (162821) produces code that later load instructions are not scheduled before the previous store instructions as expected. Clearly, restrict qualifier is not used here. ~/work/install-x86/bin/gcc tst3.c -O2 -S -std=c99 -da -fschedule-insns -frename-registers .L2: movl(%rdx,%rax), %r10d imull (%rsi,%rax), %r10d movl%r10d, (%rdi,%rax) movl4(%rdx,%rax), %r9d imull 4(%rsi,%rax), %r9d movl%r9d, 4(%rdi,%rax) movl8(%rdx,%rax), %r8d imull 8(%rsi,%rax), %r8d movl%r8d, 8(%rdi,%rax) movl12(%rdx,%rax), %ecx imull 12(%rsi,%rax), %ecx movl%ecx, 12(%rdi,%rax) addq$16, %rax cmpq$400, %rax Richard has a patch and it seems to work for this example. Index: expr.c === --- expr.c (revision 162841) +++ expr.c (working copy) @@ -8665,7 +8665,7 @@ expand_expr_real_1 (tree exp, rtx target set_mem_addr_space (temp, as); base = get_base_address (TMR_ORIGINAL (exp)); if (base - && INDIRECT_REF_P (base) + && (INDIRECT_REF_P (base) || TREE_CODE (base) == MEM_REF) && TMR_BASE (exp) && TREE_CODE (TMR_BASE (exp)) == SSA_NAME && POINTER_TYPE_P (TREE_TYPE (TMR_BASE (exp The code generated: .L2: movl(%rdx,%rax), %r10d movl4(%rdx,%rax), %r9d imull (%rsi,%rax), %r10d imull 4(%rsi,%rax), %r9d movl8(%rdx,%rax), %r8d movl12(%rdx,%rax), %ecx imull 8(%rsi,%rax), %r8d imull 12(%rsi,%rax), %ecx movl%r10d, (%rdi,%rax) movl%r9d, 4(%rdi,%rax) movl%r8d, 8(%rdi,%rax) movl%ecx, 12(%rdi,%rax) addq$16, %rax cmpq$400, %rax jne .L2 -- Summary: restrict qualifier is not used in a manually unrolled loop Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: bmei at broadcom dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45176
[Bug c/44365] New: ICE with -fdump-tree-all
GCC produces the ICE for the following code with -fdump-tree-all. This happens in both 4.4.x as well as 4.5.0. It is caused by infinitely recursive call to dump_generic_node (tree-pretty-print.c) gcc t.c -fdump-tree-all int main(int argc, char *argv[]){ int n; if(argc==2) n=atoi(argv[1]); else{ exit(1); } #define offset(x,y) ((char *)&(x->y))-((char *)x) struct { int a[n]; char b[n]; char c; }*bar; printf("%d %d %d %d \n",offset(bar,a[0]),offset(bar,b[0]),offset(bar,c),sizeof(*bar)); return 0; } -- Summary: ICE with -fdump-tree-all Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: bmei at broadcom dot com GCC target triplet: x86_64-unknown-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44365
[Bug lto/41376] collect2 does not handle static libraries
--- Comment #10 from bmei at broadcom dot com 2010-05-24 13:29 --- annotating functions with externally_visible sounds a bit difficult to maintain. Programmer needs to know whether a function is used outside of LTO objects. This can change over time and extra efforts are needed to keep it correct. It would be better if GCC can derive that info with -fwhole-program, whether it is deal with LTO-object file only or LTO/Regular object files, since it should have all the symbol reference information by then. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41376
[Bug lto/41376] collect2 does not handle static libraries
--- Comment #8 from bmei at broadcom dot com 2010-05-24 09:31 --- I integrated Dave's patch into LD with some modification (only emit those with LTO sections) and hacked collect2 to support that. The size gain of LTO, our main concern, is quite limited for our application. Large amount of functions called only once cannot be inlined across files because compiler doesn't know whether they are referred in non-LTO compiled code (mostly hand-code assembly in our cases). We really need full resolution file, especially LDPR_PREVAILING_DEF_IRONLY type. I will try next to make LD emit full resolution file. Since GNU LD doesn't have plugin support like GOLD. Won't any changes here be too invasive/specific for LTO purpose to be accepted by LD? We are fine to live with that in our private port. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41376
[Bug lto/41376] collect2 does not handle static libraries
--- Comment #6 from bmei at broadcom dot com 2010-05-04 16:54 --- > So this is a rough first draft of the-kind-of-thing-i-was-thinking-of. We get > collect2 to run a dummy link early, and extract the output from the > --lto-assist flag to get a list of archive members that we need lto to > recompile for us. > Well I spent some time to read into collect2/lto code and understand pro/cons of different approaches. So far, adding --lto-assist to ld/hacking collect2 approach looks reasonable to me, though it does require gnu ld. What extra info should be in a complete symbol resolution file? -- bmei at broadcom dot com changed: What|Removed |Added CC| |bmei at broadcom dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41376
[Bug middle-end/34668] [4.3 Regression] ICE in find_compatible_field with -combine
--- Comment #12 from bmei at broadcom dot com 2010-03-09 14:20 --- It seems that this bug still fails on my build: ~/work/install-x86/bin/gcc /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-1.c --combine -O2 /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-2.c -S -o pr34668-1.s /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-2.c: In function 'set_conv_libfunc': /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-2.c:5:15: error: type mismatch in array reference struct optab struct optab # .MEM_3 = VDEF <.MEM_1(D)> optab_table[0].code = 57005; /projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-2.c:5:15: internal compiler error: verify_stmts failed ... My build is revision 143368, target x86_64-unknown-linux-gnu. ../trunk/configure --prefix=/home/bmei/work/install-x86 --with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/x86-64 --with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/x86-64 --with-mpc=/projects/firepath/tools/work/bmei/packages/mpc/0.8.1/x86-64 --enable-languages=c,c++ --disable-bootstrap : (reconfigured) ../trunk/configure --prefix=/home/bmei/work/install-x86 --with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/x86-64 --with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/x86-64 --with-mpc=/projects/firepath/tools/work/bmei/packages/mpc/0.8.1/x86-64 --enable-languages=c --disable-bootstrap : (reconfigured) ../trunk/configure --prefix=/home/bmei/work/install-x86 --with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/x86-64 --with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/x86-64 --with-mpc=/projects/firepath/tools/work/bmei/packages/mpc/0.8.1/x86-64 --disable-bootstrap CC='gcc -static' CFLAGS='-g -O0' --enable-languages=c --no-create --no-recursion -- bmei at broadcom dot com changed: What|Removed |Added CC| |bmei at broadcom dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34668
[Bug tree-optimization/43220] New: Paritially optimized __builtin_save_stack/__builtin_restore_stack causes segmentation fault
I encountered a segmentation fault when executing an unrolled version of 20040811-1.c (tested with -O2) void *volatile p; int main (void) { int n = 0; lab:; { int x[n % 1000 + 1]; x[0] = 1; x[n % 1000] = 2; p = x; n++; } { int x[n % 1000 + 1]; x[0] = 1; x[n % 1000] = 2; p = x; n++; } if (n < 100) goto lab; return 0; } The problem is that the first pair of __builtin_stack_save/__builtin_satck_restore of the unrolled loop is optimized out in optimize_stack_restore (tree-ssa-ccp.c) of fab pass. Consequently, the dynamic memory allocated grows bigger and bigger and causes segfault. The following is from tst.c.139t.optimized lab: saved_stack.1_3 = 0B; D.2723_4 = n_1 % 1000; D.2724_5 = D.2723_4 + 1; D.2728_15 = (long unsigned int) D.2724_5; D.2730_16 = D.2728_15 * 4; D.2732_17 = __builtin_alloca (D.2730_16); x.0_18 = (int[0:D.2727] *) D.2732_17; (*x.0_18)[0] = 1; (*x.0_18)[D.2723_4] = 2; p ={v} x.0_18; D.2770_66 = (unsigned int) n_1; D.2771_65 = D.2770_66 + 1; n_64 = (int) D.2771_65; GIMPLE_NOP saved_stack.3_21 = __builtin_stack_save (); D.2723_22 = n_64 % 1000; D.2734_23 = D.2723_22 + 1; D.2738_33 = (long unsigned int) D.2734_23; D.2740_34 = D.2738_33 * 4; D.2742_35 = __builtin_alloca (D.2740_34); x.2_36 = (int[0:D.2737] *) D.2742_35; (*x.2_36)[0] = 1; (*x.2_36)[D.2723_22] = 2; p ={v} x.2_36; D.2773_62 = D.2770_66 + 2; n_61 = (int) D.2773_62; __builtin_stack_restore (saved_stack.3_21); if (n_61 != 100) goto (lab); else goto ; -- Summary: Paritially optimized __builtin_save_stack/__builtin_restore_stack causes segmentation fault Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: bmei at broadcom dot com GCC target triplet: x86_64-unknown-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43220
[Bug c/43098] New: ICE in tree-sra.c with floating point exception
GCC (156804, x86_64-unknown-linux-gnu) generates an ICE in compiling the following code. typedef __builtin_va_list va_list; struct __attribute__((aligned (4))) S238 { struct{}a[24]; short b; } ; struct __attribute__((aligned (4))) S238 a238[5]; extern int fails; void foo (int z, ...) { struct __attribute__((aligned (4))) S238 arg, *p; va_list ap; int i; __builtin_va_start(ap,z); for (i = 0; i < 5; ++i) { p = ((void *)0); p = &a238[2]; arg = __builtin_va_arg(ap,struct __attribute__((aligned (4))) S238); if (p->b != arg.b) ++fails; } __builtin_va_end(ap); } ~/work/install-x86/bin/gcc t001_y.c -O2 -w t001_y.c: In function 'foo': t001_y.c:24:1: internal compiler error: Floating point exception Please submit a full bug report, with preprocessed source if appropriate. See <http://gcc.gnu.org/bugs.html> for instructions. The error happens in tree-sra.c:1445, where el_size is 0 offset = offset % el_size; It is likely caused by the following change: if (lacc && racc && (sra_mode == SRA_MODE_EARLY_INTRA || sra_mode == SRA_MODE_INTRA) && !lacc->grp_unscalarizable_region @@ -1288,7 +1398,12 @@ if (!tr_size || !host_integerp (tr_size, 1)) continue; size = tree_low_cst (tr_size, 1); - if (pos > offset || (pos + size) <= offset) + if (size == 0) + { + if (pos != offset) + continue; + } + else if (pos > offset || (pos + size) <= offset) continue; Here, size = 0, pos = 0, offset = 0. So "continue" is executed in past, but not with this patch, which causes the ICE later. I am not sure what is intention of the patch, so would leave others to fix it. -- Summary: ICE in tree-sra.c with floating point exception Product: gcc Version: 4.5.0 Status: UNCONFIRMED Severity: major Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: bmei at broadcom dot com GCC target triplet: x86_64-unknown-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43098
[Bug rtl-optimization/36712] Inefficient loop unrolling
--- Comment #6 from bmei at broadcom dot com 2009-05-21 08:38 --- I only submitted small patch before. To add a pass (may need new command-line option, disabling the old rtl-level unrolling) seems to be a big issue to me. Don't know what's procedure. My code also contains my own implementation of #pragma unroll. I need to clean it up for the public patch. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712
[Bug rtl-optimization/36712] Inefficient loop unrolling
--- Comment #4 from bmei at broadcom dot com 2009-05-20 14:17 --- I implemented a tree-level loop-unrolling pass in our private porting, which takes advantage of later tree ivopt pass. It produces much better code than rtl-level loop unrolling in such scenarios. Not sure whether should submit for 4.5. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712
[Bug rtl-optimization/36712] New: Inefficient loop unrolling
are/install/bin/arm-elf-as" LD_FOR_TARGET="/home/aashley/work/sourceware/install/bin/arm-elf-ld" ../src/configure --prefix=/home/bmei/work/trunck-arm --enable-languages=c --disable-nls --target=arm-elf --disable-shared --with-mpfr=/projects/firepath/tools/team/packages/x86_64-rhel3-32/mpfr/2.3.0 --with-gmp=/projects/firepath/tools/team/packages/x86_64-rhel3-32/gmp/4.2.2 --disable-libssp -- Summary: Inefficient loop unrolling Product: gcc Version: 4.4.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: bmei at broadcom dot com GCC target triplet: arm-elf-gcc http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712