from:"bmei at broadcom dot com"

[Bug tree-optimization/111036] New: Code generation error in handling __builtin_constant_p

2023-08-16 Thread bmei at broadcom dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111036

Bug ID: 111036
   Summary: Code generation error in handling __builtin_constant_p
   Product: gcc
   Version: 13.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com
  Target Milestone: ---

Compile and run following code


#include 
#define __align(n) __attribute__((aligned(n)))

__attribute__((aligned(32))) static struct {

unsigned long long available_cmd_ids_per_core[2];
} _rl2c_cmd_id_data;

static inline void __attribute__((always_inline))
foo (void *base, size_t length)
{
unsigned long int p = (unsigned long int) base;
if (__builtin_constant_p(p) && (p & 31) == 0) { printf("constant p &&
aligned to 32\n"); } 
else if (__builtin_constant_p(length)) { printf("constant length\n");} 
else { printf("else\n"); }
}

int main(int argc, char **argv)
{
foo(&_rl2c_cmd_id_data, sizeof(*(&_rl2c_cmd_id_data)));
return 0;
}


With gcc 12.1.0 & gcc 13.1.0,  I got segmentation fault. With 11.1.0 and below,
I got correct result. I examined the dumped tree IR. In einline pass, a
__builtin_unreachable is inserted for else if/else branches as the compiler
probably thinks __builtin_constant_p(p) & (p&31) is always true. But the later
passes think __builtin_constant_p(p) is always false. Therefore all code are
optimized away.

[Bug tree-optimization/71264] [4.9/5 Regression] ICE in convert_move

2016-07-08 Thread bmei at broadcom dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71264

--- Comment #17 from Bingfeng Mei  ---
OK, I will skip the vectorization check on our port then. Thanks.

[Bug tree-optimization/71264] [4.9/5 Regression] ICE in convert_move

2016-07-08 Thread bmei at broadcom dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71264

Bingfeng Mei  changed:

   What|Removed |Added

 CC||bmei at broadcom dot com

--- Comment #15 from Bingfeng Mei  ---
Hi, Richard, I updated to the latest patches. But our target still failed in
the same way as other people reported. footype gets V4QI instead of SI because
we have it supported in vector_mode_supported_p. Thus the following error.

 not vectorized: vector stmt in loop:temp_14 = VIEW_CONVERT_EXPR(_8);

I guess your patch in vect_init_vector is supposed to fix this. But the
execution doesn't even hit vect_init_vector.

[Bug tree-optimization/71383] New: Misoptimized branch with inline assembly code.

2016-06-02 Thread bmei at broadcom dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71383

Bug ID: 71383
   Summary: Misoptimized branch with inline assembly code.
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com
  Target Milestone: ---

For the following example:

include 
static int a, b;

static void bar()
{
  asm volatile ("" : : : "memory");
}
void foo ()
{
  a = 0;
  bar ();
  if (a == 0)
printf ("HERE\n");
}

If compiles with:
~/work/install-x86/bin/gcc  tst.c -O2 -S -fno-inline

The conditional printf becomes unconditional. if (a==0) is optimized away.
foo:
.LFB1:
.cfi_startproc
subq$8, %rsp
.cfi_def_cfa_offset 16
xorl%eax, %eax
movl$0, a(%rip)
callbar
movl$.LC0, %edi
addq$8, %rsp
.cfi_def_cfa_offset 8
jmp puts
.cfi_endproc

However, if we compile with
~/work/install-x86/bin/gcc  tst.c -O2 -S and allow inlining, gcc produces
correct code. 

foo:
.LFB12:
.cfi_startproc
movl$0, a(%rip)
movla(%rip), %eax
testl   %eax, %eax
je  .L4
rep; ret
.p2align 4,,10
.p2align 3
.L4:
movl$.LC0, %edi
jmp puts

I guess it goes wrong in some of IPA passes.

My compiler is GCC: (GNU) 7.0.0 20160602 (experimental) [trunk revision 14336].
I can also reproduce this issue on our port of gcc 6.1.

[Bug c/67769] New: VRP pass does wrong optimization

2015-09-29 Thread bmei at broadcom dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67769

Bug ID: 67769
   Summary: VRP pass does wrong optimization
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com
  Target Milestone: ---

#include 

static int
clamp (int x, int lo, int hi)
{
return (x < lo) ? lo : ((x > hi) ? hi : x);
}


__attribute__((noinline))
short
foo (int N)
{
short value =
clamp (N, 0, 16);


return value;
}

int main ()
{
  if (foo (-5) != 0)
abort();
  return 0;
}


Compile this simple code and run. 

bash:bmei:xl-cam-21:34271> ~/scratch/install-x86/bin/gcc tst.c -O2
bash:bmei:xl-cam-21:34272> ./a.out
Aborted
bash:bmei:xl-cam-21:34273> ~/scratch/install-x86/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/home/bmei/scratch/install-x86/bin/gcc
COLLECT_LTO_WRAPPER=/projects/firepath_tools1_scratch/bmei/install-x86/libexec/gcc/x86_64-unknown-linux-gnu/6.0.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../trunk/configure
--prefix=/projects/firepath_tools1_scratch/bmei/install-x86 --disable-nls
--with-mpfr=/projects/firepath_tools/work/bmei/packages/mpfr/2.4.1/x86-64
--with-gmp=/projects/firepath_tools/work/bmei/packages/gmp/4.3.0/x86-64
--with-mpc=/projects/firepath_tools/work/bmei/packages/mpc/0.8.1/x86-64
--disable-libsanitizer --disable-target-libsanitizer CFLAGS='-O0 -g3'
CXXFLAGS='-O0 -g3' --enable-languages=c --no-recursion --disable-bootstrap :
(reconfigured) ../trunk/configure
--prefix=/projects/firepath_tools1_scratch/bmei/install-x86 --disable-nls
--with-mpfr=/projects/firepath_tools/work/bmei/packages/mpfr/2.4.1/x86-64
--with-gmp=/projects/firepath_tools/work/bmei/packages/gmp/4.3.0/x86-64
--with-mpc=/projects/firepath_tools/work/bmei/packages/mpc/0.8.1/x86-64
--disable-libsanitizer --disable-target-libsanitizer CFLAGS='-O0 -g3'
CXXFLAGS='-O0 -g3' --disable-bootstrap --enable-languages=c,lto --no-create
--no-recursion
Thread model: posix
gcc version 6.0.0 20150929 (experimental) [trunk revision 143368] (GCC)


I looked into the tree dump, it seems that VRP2 pass. The second MAX_EXPR is
folded.

Folding statement: iftmp.0_3 = MIN_EXPR ;
Not folded
Folding statement: iftmp.0_6 = MAX_EXPR ;
Folded into: iftmp.0_6 = iftmp.0_3;

Folding statement: value_4 = (short int) iftmp.0_6;
Not folded
Folding statement: return value_4;
Not folded
foo (int N)
[ noinline ]
{
  short int value;
  int iftmp.0_3;
  int iftmp.0_6;

  :
  iftmp.0_3 = MIN_EXPR ;
  iftmp.0_6 = iftmp.0_3;
  value_4 = (short int) iftmp.0_6;
  return value_4;

}

[Bug c/65219] New: GCC wrongly deletes a function which is not completely inlined.

2015-02-26 Thread bmei at broadcom dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65219

Bug ID: 65219
   Summary: GCC wrongly deletes a function which is not completely
inlined.
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com

Compile the following code with gcc 5.0 (
Target: x86_64-unknown-linux-gnu gcc version 5.0.0 20150226 (experimental)
[trunk revision 143368] (GCC))

~/scratch/install-x86/bin/gcc tst.c -O2 -S

#include 
inline int foo()
{
  printf ("HERE\n");
  printf ("HERE\n");
  printf ("HERE\n");
  printf ("HERE\n");
  return 0;
}


int bar1 ()
{
  return foo();
}


__attribute__((optimize("-funsafe-loop-optimizations")))
int bar2 ()
{
 return foo();
}

Resulting assemble code:

.file"tst.c"
.section.rodata.str1.1,"aMS",@progbits,1
.LC0:
.string"HERE"
.section.text.unlikely,"ax",@progbits
.LCOLDB1:
.text
.LHOTB1:
.p2align 4,,15
.globlbar1
.typebar1, @function
bar1:
.LFB12:
.cfi_startproc
subq$8, %rsp
.cfi_def_cfa_offset 16
movl$.LC0, %edi
callputs
movl$.LC0, %edi
callputs
movl$.LC0, %edi
callputs
movl$.LC0, %edi
callputs
xorl%eax, %eax
addq$8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE12:
.sizebar1, .-bar1
.section.text.unlikely
.LCOLDE1:
.text
.LHOTE1:
.section.text.unlikely
.LCOLDB2:
.text
.LHOTB2:
.p2align 4,,-1
.globlbar2
.typebar2, @function
bar2:
.LFB13:
.cfi_startproc
xorl%eax, %eax
jmpfoo
.cfi_endproc
.LFE13:
.sizebar2, .-bar2
.section.text.unlikely
.LCOLDE2:
.text
.LHOTE2:
.ident"GCC: (GNU) 5.0.0 20150226 (experimental) [trunk revision
143368]"
.section.note.GNU-stack,"",@progbits


The function body of foo is gone, but there is still a call to foo left in
bar2. I did some initial investigation. The bar1 function inline foo in einline
pass. But the bar2 cannot inline it because it has a function-specific optimize
attribute. For some reason the body of foo is just removed anyway.

[Bug lto/61868] -frandom-seed always results in random_seed of 0

2014-07-31 Thread bmei at broadcom dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61868

Bingfeng Mei  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from Bingfeng Mei  ---
Fixed in r213321

[Bug lto/61868] -frandom-seed always results in random_seed of 0

2014-07-29 Thread bmei at broadcom dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61868

Bingfeng Mei  changed:

   What|Removed |Added

  Component|driver  |lto

--- Comment #1 from Bingfeng Mei  ---
Change the component to lto as gcc should generate lto section name with
specified random seed.

[Bug driver/61868] New: -frandom-seed always results in random_seed of 0

2014-07-21 Thread bmei at broadcom dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61868

Bug ID: 61868
   Summary: -frandom-seed always results in random_seed of 0
   Product: gcc
   Version: 4.10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: driver
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com

Compile any simple file with -frandom-seed and -flto option. 

#include 
extern int foo (int);
int bar (int a)
{
  return a * 5;
}

int main ()
{
  printf("%d\n", foo (100));
  return 0;
}

 ~/scratch/install-x86/bin/gcc tst2.c -flto -c -frandom-seed=12345
objdump -D tst2.o|less

You can see all the lto section has suffix of 0 instead of the random_seed
specified.
<.gnu.lto_.inline.0>

This is because of the following code in toplev.c. If flag_random_seed is true,
then init_random_seed is not called in get_random_seed despite the piece of
code trying to generate random_seed if flag_random_seed is true.

static void
init_random_seed (void)
{
  if (flag_random_seed)
{
  char *endp;

  /* When the driver passed in a hex number don't crc it again */
  random_seed = strtoul (flag_random_seed, &endp, 0);
  if (!(endp > flag_random_seed && *endp == 0))
random_seed = crc32_string (0, flag_random_seed);
}
  else if (!random_seed)
random_seed = local_tick ^ getpid ();  /* Old racey fallback method */
}

/* Obtain the random_seed.  Unless NOINIT, initialize it if
   it's not provided in the command line.  */

HOST_WIDE_INT
get_random_seed (bool noinit)
{
  if (!flag_random_seed && !noinit)
init_random_seed ();
  return random_seed;
}

[Bug tree-optimization/60012] New: Vectorizer generates unnecessary loop versioning for alias

2014-01-31 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60012

Bug ID: 60012
   Summary: Vectorizer generates unnecessary loop versioning for
alias
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com

typedef struct
{
   short real;
   short imag;
} complex16_t;

void
libvector_AccSquareNorm_ref (unsigned long long  *acc,
 const complex16_t *x, unsigned len)
{
for (unsigned i = 0; i < len; i++)
{
acc[i] +=
((unsigned long long)((int)x[i].real * x[i].real)) +
((unsigned long long)((int)x[i].imag * x[i].imag));
}
}

Compiler the code with 
~/scratch/install-x86/bin/gcc tst.c -O2 -S -ftree-vectorize
-fdump-tree-vect-details -std=c99

GCC generates unnecessary loop versioning because it cannot disambiguate mem
accesses. 

tst.c:12:5: note: versioning for alias required: can't determine dependence
between *_8 and _12->real
tst.c:12:5: note: mark for run-time aliasing test between *_8 and _12->real

This should be handled by TBAA info as acc & x clearly point to different data
types. But unfortunately, TBAA doesn't handle Anti- & Output- dependencies.

[Bug tree-optimization/59651] [4.9 Regression] Vectorizer failing to spot dependence causes incorrect code generation.

2014-01-02 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651

--- Comment #5 from Bingfeng Mei  ---
Created attachment 31559
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31559&action=edit
initial patch

Hi, Tejas, vect_create_cond_for_alias_checks contains a bug in handling
negative step. The computed data access range should be shifted by
TYPE_SIZE_UNIT of bytes. Could you test the attached patch on aarch64 (I don't
have simulation environment setup)? Meanwhile I will check whether there is any
regression on x86-64. If everything is right, I am going to submit the patch.
Thanks.

[Bug tree-optimization/59651] [4.9 Regression] Vectorizer failing to spot dependence causes incorrect code generation.

2013-12-31 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651

--- Comment #3 from Bingfeng Mei  ---
I can reproduce on aarch64. Still try to understand why. I constructed a
similar test but with positive loop step.

extern void abort (void);
int a[] = { 6, 0, 0, 0 };

int b;
int
main ()
{
  for (;;)
{
  b = 0;
  for (; b<3; b += 1)
a[b] = a[0] > 1;
  break;
}
  if (a[2] != 0)
abort ();
  return 0;
}

Actually GCC behaves similarly during vectorization and does vectorize the
loop. The only difference is around loop versioning. 

pr52943.c
  :
  if (1 != 0)
goto ;
  else
goto ;

bb 11 leads to vectorized version. So scalar version gets optimized out.

Above example:
  :
  if (0 != 0)
goto ;
  else
goto ;
So vectorized version goes away and only scalar version remains.

[Bug tree-optimization/59651] Vectorizer failing to spot dependence causes incorrect code generation.

2013-12-31 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59651

--- Comment #1 from Bingfeng Mei  ---
That is interesting. On x86-64, GCC does say it cannot determine dist vector
between a[3] & a[b] and needs run-time aliasing test. In the end it gives up
due to too few iterations. 

note: === vect_analyze_data_ref_dependences ===
(compute_affine_dependence
  stmt_a: _5 = a[3];
  stmt_b: a[b.0_16] = _7;
(analyze_overlapping_iterations 
  (chrec_a = 3)
  (chrec_b = {3, +, -1}_1)
(analyze_siv_subscript 
)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
(Dependence relation cannot be represented by distance vector.) 
)
(compute_affine_dependence
  stmt_a: _5 = a[3];
  stmt_b: _5 = a[3];
(analyze_overlapping_iterations 
  (chrec_a = 3)
  (chrec_b = 3)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)
(compute_affine_dependence
  stmt_a: a[b.0_16] = _7;
  stmt_b: a[b.0_16] = _7;
(analyze_overlapping_iterations 
  (chrec_a = {3, +, -1}_1)
  (chrec_b = {3, +, -1}_1)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)
/projects/firepath_tools1_scratch/bmei/trunk/gcc/testsuite/gcc.dg/torture/pr52943.c:13:7:
note: versioning for alias required: bad dist vector for a[3] and a[b.0_16]
/projects/firepath_tools1_scratch/bmei/trunk/gcc/testsuite/gcc.dg/torture/pr52943.c:13:7:
note: mark for run-time aliasing test between a[3] and a[b.0_16]

[Bug tree-optimization/59544] Vectorizing store with negative step

2013-12-30 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544

Bingfeng Mei  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #2 from Bingfeng Mei  ---
Patch checked in at r206148. It triggers pr59569 that is fixed by a separate
patch  (r206179).

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2013-12-30 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947

Bug 53947 depends on bug 59544, which changed state.

Bug 59544 Summary: Vectorizing store with negative step
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

[Bug middle-end/59569] [4.9 Regression] r206148 causes internal compiler error: in vect_create_destination_var, at tree-vect-data-refs.c:4294

2013-12-23 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59569

--- Comment #9 from Bingfeng Mei  ---
Seems simple patch is to just bypass permutation for constant operand as
vec_oprnd is a constant vector with identical elements.

Index: tree-vect-stmts.c
===
--- tree-vect-stmts.c   (revision 206176)
+++ tree-vect-stmts.c   (working copy)
@@ -5353,7 +5353,8 @@ vectorizable_store (gimple stmt, gimple_
set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
misalign);

- if (negative)
+ if (negative
+ && !CONSTANT_CLASS_P (gimple_assign_rhs1 (stmt)))
{
  tree perm_mask = perm_mask_for_reverse (vectype);
  tree perm_dest

[Bug middle-end/59569] [4.9 Regression] r206148 causes internal compiler error: in vect_create_destination_var, at tree-vect-data-refs.c:4294

2013-12-23 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59569

--- Comment #8 from Bingfeng Mei  ---
Sorry for the regression. The assertion happens if storing a constant value
with negative step. Doing permutation of constant is not the best optimization
here. So the easy way to fix is to skip vectorizing this statement in the same
way as before the patch. Or maybe better way is to form a constant vector to
store.

[Bug tree-optimization/59544] New: Vectorizing store with negative stop

2013-12-18 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59544

Bug ID: 59544
   Summary: Vectorizing store with negative stop
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com

Created attachment 31467
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31467&action=edit
The patch against r206016

I was looking at some loops that can be vectorized by LLVM, but not GCC. One
type of loop is with store of negative step. 

void test1(short * __restrict__ x, short * __restrict__ y, short * __restrict__
z)
{
int i;
for (i=127; i>=0; i--) {
x[i] = y[127-i] + z[127-i];
}
}

I don't know why GCC only implements negative step for load, but not store. I
implemented a patch (attached), very similar to code in vectorizable_load. 

~/scratch/install-x86/bin/gcc ghs-dec.c -ftree-vectorize -S -O2 -mavx

Without patch:
test1:
.LFB0:
addq$254, %rdi
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L2:
movzwl(%rsi,%rax), %ecx
subq$2, %rdi
addw(%rdx,%rax), %cx
addq$2, %rax
movw%cx, 2(%rdi)
cmpq$256, %rax
jne.L2
rep; ret

With patch:
test1:
.LFB0:
vmovdqa.LC0(%rip), %xmm1
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L2:
vmovdqu(%rsi,%rax), %xmm0
movq%rax, %rcx
negq%rcx
vpaddw(%rdx,%rax), %xmm0, %xmm0
vpshufb%xmm1, %xmm0, %xmm0
addq$16, %rax
cmpq$256, %rax
vmovups%xmm0, 240(%rdi,%rcx)
jne.L2
rep; ret

Performance is definitely improved here. It is bootstrapped for
x86_64-unknown-linux-gnu, and has no additional regressions on my machine.

For reference, LLVM seems to use different instructions and slightly worse
code. I am not so familiar with x86 assemble code. The patch is originally for
our private port. 
test1:  # @test1
.cfi_startproc
# BB#0: # %entry
addq$240, %rdi
xorl%eax, %eax
.align  16, 0x90
.LBB0_1:# %vector.body
# =>This Inner Loop Header: Depth=1
movdqu  (%rsi,%rax,2), %xmm0
movdqu  (%rdx,%rax,2), %xmm1
paddw   %xmm0, %xmm1
shufpd  $1, %xmm1, %xmm1# xmm1 = xmm1[1,0]
pshuflw $27, %xmm1, %xmm0   # xmm0 = xmm1[3,2,1,0,4,5,6,7]
pshufhw $27, %xmm0, %xmm0   # xmm0 = xmm0[0,1,2,3,7,6,5,4]
movdqu  %xmm0, (%rdi)
addq$8, %rax
addq$-16, %rdi
cmpq$128, %rax
jne .LBB0_1
# BB#2: # %for.end
ret

[Bug tree-optimization/59249] if-conversion doesn't handle basic-blocks with only critical predecessor edges

2013-11-26 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59249

--- Comment #4 from Bingfeng Mei  ---
Even I split one critical predecessor edge, predicate of BB6 is still ORed
result of two conditions from BB4 & BB5. ORing two conditions results in a
sequence of statements that cannot be vectorized. Vectorizer complains of
"bit-precision arithmetic not supported" because of boolean operations.

Not sure how to transform the code except reverting back to a form similar to
pre jump-threading.

[Bug tree-optimization/59249] if-conversion doesn't handle basic-blocks with only critical predecessor edges

2013-11-25 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59249

--- Comment #3 from Bingfeng Mei  ---
Richard, I am not sure I understand about how to split edge.

 BB4 
 / \
/   \
  BB5|
   |\|
   | \   |
   |  \  |
   |   BB6
   |   /
   |  /
   BB7


Compiler (HEAD) complains "only critical predecessors of BB6" (its predcessor
BB5 has more than one successor). Do you suggest to split edge between BB5 &
BB6 and insert an empty BB? 

In the email thread, you blame poor implementation of tree-level if-conversion.
But RTL-level CE passes cannot handle that too.

[Bug tree-optimization/59249] New: Jump threading makes if-conversion and following vectorization impossible.

2013-11-22 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59249

Bug ID: 59249
   Summary: Jump threading makes if-conversion and following
vectorization impossible.
   Product: gcc
   Version: 4.8.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com

I am doing some investigation on loops can be vectorized by LLVM, but not GCC.
One example is loop that contains more than one if-else constructs.

typedef signed char int8;
#define FFT 128

typedef struct {
int8   exp[FFT];
} feq_t;

void test(feq_t *feq)
{
int k;
int feqMinimum = 15;
int8 *exp = feq->exp;

for (k=0;k15) exp[k]  = 15;
}
}

Compile it with 4.8.2 on x86_64
~/install-4.8/bin/gcc ghs-algorithms_380.c -O2 -fdump-tree-ifcvt-details
-ftree-vectorize  -save-temps

It is not vectorized because if-else constructs inside the loop cannot be
if-converted. Looking into .ifcvt file, this is due to bad if-else structure
(ifcvt pass complains "only critical predecessors"). One branch jumps directly
into another branch. Digging a bit deeper, I found such structure is generated
by dom1 pass doing jump threading optimization. 

So recompile with 

~/install-4.8/bin/gcc ghs-algorithms_380.c -O2 -fdump-tree-ifcvt-details
-ftree-vectorize  -save-temps -fno-tree-dominator-opts

It is magically if-converted and vectorized! Same on our target, performance is
improved greatly in this example.

It seems to me that doing jump threading for architectures support
if-conversion is not a good idea. Original if-else structures are damaged so
that if-conversion cannot proceed, so are vectorization and maybe other
optimizations. Should we try to identify those "bad" jump threading and skip
them for such architectures? 

Andrew Pinski slightly modified the code and -fno-tree-dominator-opts trick
won't work any more. 

#define FFT 128

typedef struct {
signed char   exp[FFT];
} feq_t;

void test(feq_t *feq)
{
int k;
int feqMinimum = 15;
signed char *exp = feq->exp;

for (k=0;k15) temp  = 15;
exp[k] = temp;
}
}

But this time is due to jump threading in VRP pass that causes the trouble.
With -fno-tree-vrp, the code can be if-converted and vectorized again.

[Bug tree-optimization/57512] Vectorizer: cannot handle accumulation loop of signed char type

2013-06-03 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57512

--- Comment #1 from Bingfeng Mei  ---
Created attachment 30250
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30250&action=edit
Vectorized assembly code with unsigned char type

[Bug tree-optimization/57512] New: Vectorizer: cannot handle accumulation loop of signed char type

2013-06-03 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57512

Bug ID: 57512
   Summary: Vectorizer: cannot handle accumulation loop of signed
char type
   Product: gcc
   Version: 4.7.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bmei at broadcom dot com

Created attachment 30249
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30249&action=edit
Unvectorized with signed char type.

GCC (I used 4.7.2 x86-64 target) cannot vectorize this accumulation loop.
gcc tst.c -O2 -S -ftree-vectorize -fdump-tree-vect-details

signed short mac_char (signed char * __restrict__ in1, signed char *
__restrict__ in2)
{
  unsigned i;
  signed short sum = 0;
  for (i = 0; i < 256; i++)
  {
signed char d1 = in1[i];
signed char d2 = in2[i];

sum += ((signed short)d1 * (signed short)d2);
  }
  return sum;
}

If I change signed char to unsigned char, vectorization does work.

unsigned short mac_uchar (unsigned char * __restrict__ in1, unsigned char *
__restrict__ in2)
{
  unsigned i;
  unsigned short sum = 0;
  for (i = 0; i < 256; i++)
  {
unsigned char d1 = in1[i];
unsigned char d2 = in2[i];

sum += ((unsigned short)d1 * d2);
  }
  return sum;
}


Looking into .vect file, I think the problem is with handling following gimple
stmts. GCC converts short additions to unsigned short additions and then
converts result back to short because of integer promotion. This confuses
vectorizer so it cannot find correct vector reduction patterns. 

  D.3015_14 = (short unsigned int) D.3014_13;
  sum.0_15 = (short unsigned int) sum_25;
  D.3017_16 = D.3015_14 + sum.0_15;
  sum_17 = (short int) D.3017_16;

[Bug rtl-optimization/47258] Extra instruction generated in 4.5.2

2011-12-15 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258

--- Comment #7 from Bingfeng Mei  2011-12-15 10:18:06 
UTC ---
Yes, the patch fixes the bug. Thanks.

[Bug rtl-optimization/49157] New: Unnecessary stack save/restore code generated for a leaf function (arm-elf-gcc)

2011-05-25 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49157

   Summary: Unnecessary stack save/restore code generated for a
leaf function (arm-elf-gcc)
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: b...@broadcom.com


For the following example:
struct  Complex16{
  short a;
  short b;
};


short foo (struct Complex16 s)
{
  return s.a + s.b;
}


Compile with:

arm-elf-gcc tst.c -O2 -S -mstructure-size-boundary=8

It produces:
foo:
@ args = 0, pretend = 0, frame = 4
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
movr3, r0, asl #16
movr3, r3, lsr #16
addr0, r3, r0, lsr #16
movr0, r0, asl #16
subsp, sp, #4
movr0, r0, asr #16
addsp, sp, #4
bxlr

The problem is with struct-size-boundary=8, the structure has BLKmode and
mapped to memory after RTL expand. However, memory accesses are optimized away
later. But GCC records a stack item anyway and generates stack frame
save/restore code for this leaf function. 

If we compile without -mstructure-size-boundary=8 (default is 32), it generates
much better code.

foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
addr0, r0, r0, asr #16
movr0, r0, asl #16
movr0, r0, asr #16
bxlr

This is not limited to ARM gcc. Our target has the same issue because
STRUCTURE_SIZE_BOUNDARY = 8 to save data memory size.

Though I only tested gcc 4.6, I believe trunk gcc probably has the same
problem.

[Bug middle-end/45416] [4.5/4.6/4.7 Regression] Code size regression between 4.6/4.7(4.5) and 4.4 for ARM

2011-04-28 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45416

--- Comment #8 from Bingfeng Mei  2011-04-28 15:22:26 
UTC ---
I am currently on vacation until 4/5/2011. I may access my mail irregularly.
Cheers,
Bingfeng Mei

[Bug rtl-optimization/47258] Extra instruction generated in 4.5.2

2011-01-13 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258

--- Comment #5 from Bingfeng Mei  2011-01-13 15:49:23 
UTC ---
It works. But I have no idea about the debug info issue in your other comment. 

> (In reply to comment #2)
> > After tried patches one-by-one, I believe the misoptimization is down to the
> > following patch.
> 
> Which is a correctness patch.  You can try dumbing it down somewhat with
> 
> if (TYPE_MAIN_VARIANT (TREE_TYPE (root1)) != TYPE_MAIN_VARIANT (TREE_TYPE
> (root2))
> || !types_compatible_p (TREE_TYPE (root1), TREE_TYPE (root2)))
> 
> and see if that helps.

[Bug rtl-optimization/47258] Extra instruction generated in 4.5.2

2011-01-11 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258

--- Comment #2 from Bingfeng Mei  2011-01-11 16:16:28 
UTC ---
After tried patches one-by-one, I believe the misoptimization is down to the
following patch.

Index: tree-ssa-copyrename.c
===
RCS file: /cvs/dev/tools/src/fp_gcc/gcc/tree-ssa-copyrename.c,v
retrieving revision 1.1.2.5.2.1
retrieving revision 1.1.2.5.2.2
diff -u -r1.1.2.5.2.1 -r1.1.2.5.2.2
--- tree-ssa-copyrename.c12 Apr 2010 13:15:43 -1.1.2.5.2.1
+++ tree-ssa-copyrename.c13 Dec 2010 05:51:45 -1.1.2.5.2.2
@@ -225,11 +225,11 @@
   ign2 = false;
 }

-  /* Don't coalesce if the two variables aren't type compatible.  */
-  if (!types_compatible_p (TREE_TYPE (root1), TREE_TYPE (root2)))
+  /* Don't coalesce if the two variables are not of the same type.  */
+  if (TREE_TYPE (root1) != TREE_TYPE (root2))
 {
   if (debug)
-fprintf (debug, " : Incompatible types.  No coalesce.\n");
+fprintf (debug, " : Different types.  No coalesce.\n");
   return false;
 }

[Bug rtl-optimization/47258] Extra instruction generated in 4.5.2

2011-01-11 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258

--- Comment #1 from Bingfeng Mei  2011-01-11 13:38:13 
UTC ---
Created attachment 22944
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22944
Preprocessed test case

[Bug rtl-optimization/47258] New: Extra instruction generated in 4.5.2

2011-01-11 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47258

   Summary: Extra instruction generated in 4.5.2
   Product: gcc
   Version: 4.5.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: b...@broadcom.com


I encounter a performance regression in 4.5.2 (4.6 as well) compared with
4.5.1.

The code is from Core Mark. 

Compile the attached .i file. 

~/work/install-x86-452/bin/gcc core_matrix.i -O2 -S -o x86-452.s
...
.L5:
movl%r8d, %r10d
.L3:
mov%r9d, %r8d
movswl(%rcx,%rax), %r11d
addq$2, %rax
movswl(%rdx,%r8,2), %r8d
addl$1, %r9d
imull%r11d, %r8d
addl%r10d, %r8d
cmpq%rbx, %rax
jne.L5
...

~/work/install-x86-451/bin/gcc core_matrix.i -O2 -S -o x86-451.s
...
.L3:
mov%r9d, %r8d
movswl(%rcx,%rax), %r11d
addq$2, %rax
movswl(%rdx,%r8,2), %r8d
addl$1, %r9d
imull%r11d, %r8d
addl%r8d, %r10d
cmpq%rbx, %rax
jne.L3
...

The performance hit is even worse on our architecture because zero-overhead
loop instruction cannot be used in such irregular loop produced by 4.5.2

The configuration used is:
../gcc-4.5.1/configure
--prefix=/projects/firepath/tools/work/bmei/install-x86-451
--with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/x86-64
--with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/x86-64
--with-mpc=/projects/firepath/tools/work/bmei/packages/mpc/0.8.1/x86-64
--with-elf=/projects/firepath/tools/work/bmei/packages/libelf/x86-64
--disable-bootstrap --enable-languages=c --no-create --no-recursion


The difference between 4.5.1 and 4.5.2 seems to occur in RTL expand pass.

[Bug c/45834] Redundant inter-loop edges in DDG

2010-10-18 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45834

--- Comment #5 from Bingfeng Mei  2010-10-18 13:53:37 
UTC ---
> 
> Sure, but we have other means of dealing with that (MEM_ALIAS_SET == 0).

Do you mean this check is redundant here ? I dig out the ancient code (from
1997)

  /* If both references are struct references, or both are not, nothing
 is known about aliasing.

 If either reference is QImode or BLKmode, ANSI C permits aliasing.

 If both addresses are constant, or both are not, nothing is known
 about aliasing.  */
  if (MEM_IN_STRUCT_P (x) == MEM_IN_STRUCT_P (mem)
  || mem_mode == QImode || mem_mode == BLKmode
  || GET_MODE (x) == QImode || GET_MODE (mem) == BLKmode
  || varies (x_addr) == varies (mem_addr))
return 1;

The comment indicates that the check for QImode is for meeting aliasing rule of
char type.

> 
> > But I am not sure whether a
> > restrict qualifier will override that rule.
> 
> restrict is a different concept from type-based aliasing.
> 
Sure, but in this example, on one hand, char type pointer is supposed to alias
any other data type, on the other hand, all the char pointers have restrict
qualifiers. What is correct behaviour, alias or not?

[Bug c/45834] Redundant inter-loop edges in DDG

2010-10-18 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45834

--- Comment #3 from Bingfeng Mei  2010-10-18 12:16:59 
UTC ---
I think that standard specifies that char * may refer to an alias of any
object, that's why QImode is different here. But I am not sure whether a
restrict qualifier will override that rule.

[Bug c/45834] Redundant inter-loop edges in DDG

2010-10-18 Thread bmei at broadcom dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45834

Bingfeng Mei  changed:

   What|Removed |Added

 CC||richard.guenther at gmail
   ||dot com

--- Comment #1 from Bingfeng Mei  2010-10-18 11:33:23 
UTC ---
Before using rtx_refs_may_alias_p in may_alias_p, following statement is
executed. 


  /* We cannot use aliases_everything_p to test MEM, since we must look
 at MEM_ADDR, rather than XEXP (mem, 0).  */
  if (GET_MODE (mem) == QImode || GET_CODE (mem_addr) == AND)
return 1;

Basically, it means that the memory access of a QImode always aliases
everything else. That explains why char data type doesn't work here. The code
in may_alias_p is mostly copied from true_dependence_1. The comment is not very
clear to me. Richard, could you cast a light on this? Why do we need to treat
QImode differently?

[Bug c/45416] Code size regression between 4.6(4.5) and 4.4

2010-08-26 Thread bmei at broadcom dot com



--- Comment #3 from bmei at broadcom dot com  2010-08-26 12:55 ---
I found I can reproduce the bug with ARM

ARM trunk -Os:
foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
mov r2, #1024
mov r3, #0
and r2, r2, r0
and r3, r3, r1
orrsr1, r2, r3
moveq   r0, #0
movne   r0, #1
mov pc, lr
.size   foo, .-foo
Arm 4.40 -Os:

foo:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
mov r0, r0, lsr #10
and r0, r0, #1
bx  lr


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45416

[Bug c/45416] Code size regression between 4.6(4.5) and 4.4

2010-08-26 Thread bmei at broadcom dot com



--- Comment #2 from bmei at broadcom dot com  2010-08-26 12:47 ---
Sorry, I first observed this on our target.  Then I tried to reproduce on x86,
but I forgot to turn on optimization flags. It does work for x86. Please delete
this report. I will figure out what happen with my target. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45416

[Bug c/45416] New: Code size regression between 4.6(4.5) and 4.4

2010-08-26 Thread bmei at broadcom dot com

This is a performance/size regression between 4.6 (4.5) and 4.4. 

The C code:
int foo(long long a)
{
   if (a & (long long) 0x400)
  return 1;
   return 0;
}

Assemble code generated by 4.6 trunk:
foo:
.LFB0:
.cfi_startproc
pushq   %rbp
.cfi_def_cfa_offset 16
movq%rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movq%rdi, -8(%rbp)
movq-8(%rbp), %rax
andl$1024, %eax
testq   %rax, %rax
je  .L2
movl$1, %eax
jmp .L3
.L2:
movl$0, %eax
.L3:
popq%rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc

Assemble code generated  by 4.4.0:
foo:
.LFB0:
.cfi_startproc
shrq$10, %rdi
movl%edi, %eax
andl$1, %eax
ret
.cfi_endproc

After tree optimizations, both compilers produce different
but essentially same forms. RTL expander and later passes 
then go on to do different optimizations and generate very
different code.


-- 
   Summary: Code size regression between 4.6(4.5) and 4.4
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: bmei at broadcom dot com
  GCC host triplet: x86_64-unknown-linux


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45416

[Bug c/45176] restrict qualifier is not used in a manually unrolled loop

2010-08-05 Thread bmei at broadcom dot com



--- Comment #5 from bmei at broadcom dot com  2010-08-05 13:44 ---
I tried to apply the patches (this one alone is not enough) Richard suggested.
It becomes a chain of too many patches in the end. I am confident any more to
apply them to 4.5. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45176

[Bug c/45176] New: restrict qualifier is not used in a manually unrolled loop

2010-08-04 Thread bmei at broadcom dot com

void foo (int * restrict a, int * restrict b, int * restrict c)
{
   int i;
   for(i = 0; i < 100; i+=4)
 {
   a[i] = b[i] * c[i];
   a[i+1] = b[i+1] * c[i+1];
   a[i+2] = b[i+2] * c[i+2];
   a[i+3] = b[i+3] * c[i+3];
 }
}   

Trunk x86-64 compiler (162821) produces code that later load instructions are
not scheduled before the previous store instructions as expected. Clearly,
restrict qualifier is not used here. 

 ~/work/install-x86/bin/gcc tst3.c -O2 -S -std=c99 -da -fschedule-insns
-frename-registers
.L2:
movl(%rdx,%rax), %r10d
imull   (%rsi,%rax), %r10d
movl%r10d, (%rdi,%rax)
movl4(%rdx,%rax), %r9d
imull   4(%rsi,%rax), %r9d
movl%r9d, 4(%rdi,%rax)
movl8(%rdx,%rax), %r8d
imull   8(%rsi,%rax), %r8d
movl%r8d, 8(%rdi,%rax)
movl12(%rdx,%rax), %ecx
imull   12(%rsi,%rax), %ecx
movl%ecx, 12(%rdi,%rax)
addq$16, %rax
cmpq$400, %rax

Richard has a patch and it seems to work for this example. 
Index: expr.c
===
--- expr.c  (revision 162841)
+++ expr.c  (working copy)
@@ -8665,7 +8665,7 @@ expand_expr_real_1 (tree exp, rtx target
set_mem_addr_space (temp, as);
base = get_base_address (TMR_ORIGINAL (exp));
if (base
-   && INDIRECT_REF_P (base)
+   && (INDIRECT_REF_P (base) || TREE_CODE (base) == MEM_REF)
&& TMR_BASE (exp)
&& TREE_CODE (TMR_BASE (exp)) == SSA_NAME
&& POINTER_TYPE_P (TREE_TYPE (TMR_BASE (exp

The code generated:
.L2:
movl(%rdx,%rax), %r10d
movl4(%rdx,%rax), %r9d
imull   (%rsi,%rax), %r10d
imull   4(%rsi,%rax), %r9d
movl8(%rdx,%rax), %r8d
movl12(%rdx,%rax), %ecx
imull   8(%rsi,%rax), %r8d
imull   12(%rsi,%rax), %ecx
movl%r10d, (%rdi,%rax)
movl%r9d, 4(%rdi,%rax)
movl%r8d, 8(%rdi,%rax)
movl%ecx, 12(%rdi,%rax)
addq$16, %rax
cmpq$400, %rax
jne .L2


-- 
   Summary: restrict qualifier is not used in a manually unrolled
loop
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: bmei at broadcom dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45176

[Bug c/44365] New: ICE with -fdump-tree-all

2010-06-01 Thread bmei at broadcom dot com

GCC produces the ICE for the following code with -fdump-tree-all. This happens
in both 4.4.x as well as 4.5.0. It is caused by infinitely recursive call to
dump_generic_node (tree-pretty-print.c)

gcc t.c -fdump-tree-all


int main(int argc, char *argv[]){
int n;
if(argc==2)
n=atoi(argv[1]);
else{
exit(1);
}

#define offset(x,y) ((char *)&(x->y))-((char *)x)

struct {
int a[n];
char b[n];
char c;
}*bar;
printf("%d %d %d %d
\n",offset(bar,a[0]),offset(bar,b[0]),offset(bar,c),sizeof(*bar));
return 0;
}


-- 
   Summary: ICE with -fdump-tree-all
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c
AssignedTo: unassigned at gcc dot gnu dot org
    ReportedBy: bmei at broadcom dot com
GCC target triplet: x86_64-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44365

[Bug lto/41376] collect2 does not handle static libraries

2010-05-24 Thread bmei at broadcom dot com



--- Comment #10 from bmei at broadcom dot com  2010-05-24 13:29 ---
annotating functions with externally_visible sounds a bit difficult to
maintain. Programmer needs to know whether a function is used outside of LTO
objects. This can change over time and extra efforts are needed to keep it
correct.  It would be better if GCC can derive that info with -fwhole-program,
whether it is deal with LTO-object file only or LTO/Regular object files, since
it should have all the symbol reference information by then. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41376

[Bug lto/41376] collect2 does not handle static libraries

2010-05-24 Thread bmei at broadcom dot com



--- Comment #8 from bmei at broadcom dot com  2010-05-24 09:31 ---
I integrated Dave's patch into LD with some modification (only emit those with
LTO sections) and hacked collect2 to support that. The size gain of LTO, our
main concern, is quite limited for our application. Large amount of functions
called only once cannot be inlined across files because compiler doesn't know
whether they are referred in non-LTO compiled code (mostly hand-code assembly
in our cases). We really need full resolution file, especially
LDPR_PREVAILING_DEF_IRONLY type. I will try next to make LD emit full
resolution file. 

Since GNU LD doesn't have plugin support like GOLD. Won't any changes here be
too invasive/specific for LTO purpose to be accepted by LD? We are fine to live
with that in our private port. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41376

[Bug lto/41376] collect2 does not handle static libraries

2010-05-04 Thread bmei at broadcom dot com



--- Comment #6 from bmei at broadcom dot com  2010-05-04 16:54 ---
> So this is a rough first draft of the-kind-of-thing-i-was-thinking-of.  We get
> collect2 to run a dummy link early, and extract the output from the
> --lto-assist flag to get a list of archive members that we need lto to
> recompile for us.
> 

Well I spent some time to read into collect2/lto code and understand pro/cons
of different approaches. So far, adding --lto-assist to ld/hacking collect2
approach looks reasonable to me, though it does require gnu ld. What extra info
should be in a complete symbol resolution file?


-- 

bmei at broadcom dot com changed:

   What|Removed |Added

 CC|        |bmei at broadcom dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41376

[Bug middle-end/34668] [4.3 Regression] ICE in find_compatible_field with -combine

2010-03-09 Thread bmei at broadcom dot com



--- Comment #12 from bmei at broadcom dot com  2010-03-09 14:20 ---
It seems that this bug still fails on my build:

~/work/install-x86/bin/gcc 
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-1.c
--combine -O2
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-2.c
-S -o pr34668-1.s

/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-2.c:
In function 'set_conv_libfunc':
/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-2.c:5:15:
error: type mismatch in array reference
struct optab

struct optab

# .MEM_3 = VDEF <.MEM_1(D)>
optab_table[0].code = 57005;

/projects/firepath/tools/work/bmei/gcc-head/src/gcc/testsuite/gcc.dg/pr34668-2.c:5:15:
internal compiler error: verify_stmts failed
...

My build is revision 143368, target x86_64-unknown-linux-gnu. 
../trunk/configure --prefix=/home/bmei/work/install-x86
--with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/x86-64
--with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/x86-64
--with-mpc=/projects/firepath/tools/work/bmei/packages/mpc/0.8.1/x86-64
--enable-languages=c,c++ --disable-bootstrap : (reconfigured)
../trunk/configure --prefix=/home/bmei/work/install-x86
--with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/x86-64
--with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/x86-64
--with-mpc=/projects/firepath/tools/work/bmei/packages/mpc/0.8.1/x86-64
--enable-languages=c --disable-bootstrap : (reconfigured) ../trunk/configure
--prefix=/home/bmei/work/install-x86
--with-mpfr=/projects/firepath/tools/work/bmei/packages/mpfr/2.4.1/x86-64
--with-gmp=/projects/firepath/tools/work/bmei/packages/gmp/4.3.0/x86-64
--with-mpc=/projects/firepath/tools/work/bmei/packages/mpc/0.8.1/x86-64
--disable-bootstrap CC='gcc -static' CFLAGS='-g -O0' --enable-languages=c
--no-create --no-recursion


-- 

bmei at broadcom dot com changed:

   What|Removed |Added

 CC|            |bmei at broadcom dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34668

[Bug tree-optimization/43220] New: Paritially optimized __builtin_save_stack/__builtin_restore_stack causes segmentation fault

2010-03-01 Thread bmei at broadcom dot com

I encountered a segmentation fault when executing an unrolled version of
20040811-1.c (tested with -O2)

void *volatile p;

int
main (void)
{
  int n = 0;
 lab:;
  {
  int x[n % 1000 + 1];
  x[0] = 1;
  x[n % 1000] = 2;
  p = x;
  n++;
  }


  {
  int x[n % 1000 + 1];
  x[0] = 1;
  x[n % 1000] = 2;
  p = x;
  n++;
  }

  if (n < 100)
goto lab;

  return 0;
}

The problem is that the first pair of
__builtin_stack_save/__builtin_satck_restore of the unrolled loop is optimized
out in optimize_stack_restore (tree-ssa-ccp.c) of fab pass. Consequently, the
dynamic memory allocated grows bigger and bigger and causes segfault. The
following is from tst.c.139t.optimized


lab:
  saved_stack.1_3 = 0B;
  D.2723_4 = n_1 % 1000;
  D.2724_5 = D.2723_4 + 1;
  D.2728_15 = (long unsigned int) D.2724_5;
  D.2730_16 = D.2728_15 * 4;
  D.2732_17 = __builtin_alloca (D.2730_16);
  x.0_18 = (int[0:D.2727] *) D.2732_17;
  (*x.0_18)[0] = 1;
  (*x.0_18)[D.2723_4] = 2;
  p ={v} x.0_18;
  D.2770_66 = (unsigned int) n_1;
  D.2771_65 = D.2770_66 + 1;
  n_64 = (int) D.2771_65;
  GIMPLE_NOP
  saved_stack.3_21 = __builtin_stack_save ();
  D.2723_22 = n_64 % 1000;
  D.2734_23 = D.2723_22 + 1;
  D.2738_33 = (long unsigned int) D.2734_23;
  D.2740_34 = D.2738_33 * 4;
  D.2742_35 = __builtin_alloca (D.2740_34);
  x.2_36 = (int[0:D.2737] *) D.2742_35;
  (*x.2_36)[0] = 1;
  (*x.2_36)[D.2723_22] = 2;
  p ={v} x.2_36;
  D.2773_62 = D.2770_66 + 2;
  n_61 = (int) D.2773_62;
  __builtin_stack_restore (saved_stack.3_21);
  if (n_61 != 100)
goto  (lab);
  else
goto ;


-- 
   Summary: Paritially optimized
__builtin_save_stack/__builtin_restore_stack causes
segmentation fault
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: bmei at broadcom dot com
GCC target triplet: x86_64-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43220

[Bug c/43098] New: ICE in tree-sra.c with floating point exception

2010-02-16 Thread bmei at broadcom dot com

GCC (156804, x86_64-unknown-linux-gnu) generates an ICE in compiling the
following code. 

typedef __builtin_va_list va_list;


struct __attribute__((aligned (4))) S238 { struct{}a[24]; short b; } ; 
struct __attribute__((aligned (4))) S238 a238[5]; 

extern int fails;

void foo (int z, ...) { 
  struct __attribute__((aligned (4))) S238 arg, *p; 
  va_list ap; 
  int i; 

__builtin_va_start(ap,z);
for (i = 0; i < 5; ++i) 
{ 
  p = ((void *)0); 
  p = &a238[2]; 
  arg = __builtin_va_arg(ap,struct __attribute__((aligned (4))) S238); 
  if (p->b != arg.b) ++fails; 
}
 __builtin_va_end(ap); 
}

~/work/install-x86/bin/gcc t001_y.c -O2 -w
t001_y.c: In function 'foo':
t001_y.c:24:1: internal compiler error: Floating point exception
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.

The error happens in tree-sra.c:1445, where el_size is 0
offset = offset % el_size;

It is likely caused by the following change:

   if (lacc && racc
   && (sra_mode == SRA_MODE_EARLY_INTRA || sra_mode == SRA_MODE_INTRA)
   && !lacc->grp_unscalarizable_region
@@ -1288,7 +1398,12 @@
  if (!tr_size || !host_integerp (tr_size, 1))
continue;
  size = tree_low_cst (tr_size, 1);
- if (pos > offset || (pos + size) <= offset)
+ if (size == 0)
+   {
+ if (pos != offset)
+   continue;
+   }
+ else if (pos > offset || (pos + size) <= offset)
continue;


Here, size = 0, pos = 0, offset = 0. So "continue" is executed in past,
but not with this patch, which causes the ICE later. I am not sure what
is intention of the patch, so would leave others to fix it.


-- 
   Summary: ICE in tree-sra.c  with floating point exception
   Product: gcc
   Version: 4.5.0
Status: UNCONFIRMED
  Severity: major
  Priority: P3
 Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: bmei at broadcom dot com
GCC target triplet: x86_64-unknown-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43098

[Bug rtl-optimization/36712] Inefficient loop unrolling

2009-05-21 Thread bmei at broadcom dot com



--- Comment #6 from bmei at broadcom dot com  2009-05-21 08:38 ---
I only submitted small patch before. To add a pass (may need new command-line
option, disabling the old rtl-level unrolling) seems to be a big issue to me.
Don't know what's procedure. 

My code also contains my own implementation of #pragma unroll. I need to clean
it up for the public patch. 


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712

[Bug rtl-optimization/36712] Inefficient loop unrolling

2009-05-20 Thread bmei at broadcom dot com



--- Comment #4 from bmei at broadcom dot com  2009-05-20 14:17 ---
I implemented a tree-level loop-unrolling pass in our private porting, which 
takes advantage of later tree ivopt pass. It produces much better code than 
rtl-level loop unrolling in such scenarios. Not sure whether should submit for 
4.5.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712

[Bug rtl-optimization/36712] New: Inefficient loop unrolling

2008-07-03 Thread bmei at broadcom dot com

are/install/bin/arm-elf-as"
LD_FOR_TARGET="/home/aashley/work/sourceware/install/bin/arm-elf-ld"
../src/configure --prefix=/home/bmei/work/trunck-arm --enable-languages=c
--disable-nls --target=arm-elf  --disable-shared
--with-mpfr=/projects/firepath/tools/team/packages/x86_64-rhel3-32/mpfr/2.3.0
--with-gmp=/projects/firepath/tools/team/packages/x86_64-rhel3-32/gmp/4.2.2
--disable-libssp


-- 
   Summary: Inefficient loop unrolling
   Product: gcc
   Version: 4.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
        AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: bmei at broadcom dot com
GCC target triplet: arm-elf-gcc


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36712

48 matches

Mail list logo