from:"siarhei.siamashka at gmail dot com"

[Bug d/102765] [11 Regression] GDC11 stopped inlining library functions and lambdas used by a binary search one-liner code

2022-01-31 Thread siarhei.siamashka at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102765

--- Comment #4 from Siarhei Siamashka  ---
First of all, it's my own fault for not just bisecting the GDC code from the
day one to figure out all the relevant details many months earlier. The code is
large and takes a lot of time to compile, so I was lazy. And I apologise for
this.

Now comments from 
https://forum.dlang.org/thread/sspkdp$1m4n$1...@digitalmars.com
provided some missing bits of important information. I may be still wrong, so
please correct me if necessary, but the root cause of this performance
regression appears to be an attempt to fix the actual problem PR104317 in GDC11
via some excessively invasive PR99914 that ended up evolving GDC in a wrong
direction.

Just imagine someone encountering something like the examples from
https://stackoverflow.com/questions/3691835/why-uninitialized-global-variable-is-weak-symbol
and then suddenly making a strange conclusion that all template functions
should be non-inlineable in a C++ compiler (unless LTO is enabled). Looks like
that's exactly what happened to GDC. The D language standard documentation is
incomplete and this isn't helping. But the developers of the other D compilers
seem to have an opinion that inlining template functions is okay (due to the
same or at least similar ODR rules as in C++).

[Bug d/104317] D language: rt.config module doesn't work as expected in GDC 9/10 (multiple definition linker error)

2022-01-31 Thread siarhei.siamashka at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104317

Siarhei Siamashka  changed:

   What|Removed |Added

 CC||siarhei.siamashka at gmail dot 
com

--- Comment #2 from Siarhei Siamashka  ---
Created attachment 52322
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52322&action=edit
proof of concept patch for gdc10

The attached proof of concept patch for GDC10 fixes the problem in a much less
invasive way. The idea is to just use weak attributes for global variables in
druntime instead of enclosing them in a "template {}" block.

A preliminary pull request for upstream druntime is tracked here:
https://github.com/dlang/druntime/pull/3716

The same simple fix also works fine for GDC11 if we undo PR99914:
https://gist.github.com/ssvb/d8a67fb445e96f9e66d0516a3ba62475

I first tried to toggle "flag_weak_templates" in "gcc/d/lang.opt" from 1 to 0
in GDC11 instead of reverting PR99914, but the resulting toolchain was unable
to compile and link even the most simple applications due to missing symbols
from Phobos.

The part preventing undesirable removal of cmdline arguments is cherry picked
from:
https://github.com/dlang/druntime/commit/ae9581c1e4b96de6707c71eb45dcc9c10dd4d402

[Bug d/104317] D language: rt.config module doesn't work as expected in GDC 9/10 (multiple definition linker error)

2022-01-31 Thread siarhei.siamashka at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104317

--- Comment #1 from Siarhei Siamashka  ---
An attempted fix for the linker error had been introduced in GDC11 via:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99914

But it made function templates non-inlineable as a side effect:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102765

Also cmdline arguments with "--DRT-" prefix are still incorrectly filtered out:

$ gdc-11.2.0 test.d && ./a.out
--DRT-this-cmdline-argument-should-not-be-filtered-out
["./a.out"]

It would be useful to have a better fix for this problem.

[Bug d/104317] New: D language: rt.config module doesn't work as expected in GDC 9/10 (multiple definition linker error)

2022-01-31 Thread siarhei.siamashka at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104317

Bug ID: 104317
   Summary: D language: rt.config module doesn't work as expected
in GDC 9/10 (multiple definition linker error)
   Product: gcc
   Version: 10.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: siarhei.siamashka at gmail dot com
  Target Milestone: ---

The rt.config module provides a set of configuration variables with various
ways to override them as documented here:
   https://dlang.org/phobos/rt_config.html

The following small application can be used to test it:

import std.stdio;
extern(C) __gshared bool rt_cmdline_enabled = false;
void main(string[] args) { writeln(args); }

== Expected correct result: ==

$ gdc test.d && ./a.out --DRT-this-cmdline-argument-should-not-be-filtered-out
["./a.out", "--DRT-this-cmdline-argument-should-not-be-filtered-out"]

== Got: ==

$ gdc-9.3.0 test.d
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/../../../../x86_64-pc-linux-gnu/bin/ld:
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/libgdruntime.a(lt8-config.o):/var/tmp/portage/sys-devel/gcc-9.3.0-r1/work/gcc-9.3.0/libphobos/libdruntime/rt/config.d:48:
multiple definition of `rt_cmdline_enabled'; /tmp/ccvDzGs7.o:(.bss+0x0): first
defined here
collect2: error: ld returned 1 exit status

$ gdc-12.0.1 test.d && ./a.out
--DRT-this-cmdline-argument-should-not-be-filtered-out
["./a.out", "--DRT-this-cmdline-argument-should-not-be-filtered-out"]

[Bug d/102765] [11 Regression] GDC11 stopped inlining library functions and lambdas used by a binary search one-liner code

2021-12-08 Thread siarhei.siamashka at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102765

--- Comment #3 from Siarhei Siamashka  ---
Thanks for the explanations. Is there a small example, which demonstrates
templates inlining causing a real practical problem for older versions of GDC?
A link to a bugtracker, commit message, post in a mailing list, forum or any
other source of information would be very much welcome. How is LDC able to
workaround this without sacrificing templates inlining and without enforcing
the use of LTO?

Also it's good to know about `-fno-weak-templates`. If it just reverts to the
old behaviour, then it's probably somewhat less risky than `-flto` for those,
who are just upgrading from the older versions of GDC and don't want any
unexpected surprises.

[Bug tree-optimization/103615] New: [8/9 Regression] wrong code with "-O3" or "-O1 -ftree-vectorize" on x86_64-pc-linux-gnu

2021-12-08 Thread siarhei.siamashka at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103615

Bug ID: 103615
   Summary: [8/9 Regression] wrong code with "-O3" or "-O1
-ftree-vectorize" on x86_64-pc-linux-gnu
   Product: gcc
   Version: 9.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: siarhei.siamashka at gmail dot com
  Target Milestone: ---

A reduced testcase from https://codeforces.com/blog/entry/97433

$ cat test.c
int z = 5;
int a[6] = {0, 0, 0, 0, 0, 1};
int main() {
  for (int x = 5; x; x--)
for (int y = z; y >= x; y--)
  a[y - x] += a[y];
  if (a[0] != 7)
__builtin_abort ();
  return 0;
}

$ gcc-7.3.0 -O3 test.c && ./a.out

$ gcc-8.3.0 -O3 test.c && ./a.out
Aborted

$ gcc-9.3.0 -O3 test.c && ./a.out
Aborted

$ gcc-9.4.1 -O3 test.c && ./a.out
Aborted

$ gcc-10.3.0 -O3 test.c && ./a.out

Only GCC versions 8.x and 9.x are affected and the bug is triggered by "-O3" or
"-O1 -ftree-vectorize" optimization option.

[Bug d/102765] New: [11 Regression] GDC11 stopped inlining library functions and lambdas used by a binary search one-liner code

2021-10-14 Thread siarhei.siamashka at gmail dot com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102765

Bug ID: 102765
   Summary: [11 Regression] GDC11 stopped inlining library
functions and lambdas used by a binary search
one-liner code
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: siarhei.siamashka at gmail dot com
  Target Milestone: ---

The performance of the following simple binary search code regressed a lot
starting from GDC11:

/***/
import std.algorithm, std.range, std.stdio, std.stdint;

// calculate integer square root using binary search
int64_t isqrt(int64_t x) {
  return iota(0, min(x, 3037000499) + 1)
 .map!(v => (v * v > x))
 .assumeSorted.lowerBound(true)
 .length - 1;
}

// print the sum of 20M square roots
void main() { 2000.iota.map!isqrt.sum.writeln; }
/***/

$ gdc-6.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out 
59618479180

real0m1.924s
user0m1.924s
sys 0m0.000s

$ gdc-9.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out 
59618479180

real0m2.100s
user0m2.099s
sys 0m0.000s

$ gdc-10.3.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out 
59618479180

real0m1.776s
user0m1.776s
sys 0m0.000s

$ gdc-11.2.0 -g -O3 -frelease -fno-bounds-check test.d && time ./a.out 
59618479180

real0m6.889s
user0m6.887s
sys 0m0.000s


My expectation is that the compilers should inline everything here and generate
code for a small and efficient binary search loop. But GDC11 stopped doing
this, as can be confirmed by running "perf record ./a.out && perf report":

27.86%  a.outa.out [.]
_D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc__T18getTransitionIndexVEQGrQGq12SearchPolicyi3SQHoQHn__TQHkTQHaVQDha5_61203c2062ZQIj3geqTbZQDlMFNaNbNiNfbZm
15.02%  a.outa.out [.]
_D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc__T3geqTbTbZQjMFNaNbNiNfbbZb
10.34%  a.outa.out [.]
_D3std9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQCm5range__T4iotaTiTlZQkFilZ6ResultZQCv7opIndexMFNaNbNiNfmZb
10.31%  a.outa.out [.]
_D3std10functional__T9binaryFunVAyaa5_61203c2062VQra1_61VQza1_62Z__TQBvTbTbZQCdFNaNbNiNfKbKbZb
 3.03%  a.outa.out [.]
_D3std5range__T4iotaTiTlZQkFilZ6Result7opIndexMNgFNaNbNiNfmZNgl
 2.34%  a.outa.out [.] 0x00031a09
 2.28%  a.outa.out [.]
_D4core6atomic__T7casImplTmTxmTmZQqFNaNbNiNePOmxmmZb
 2.11%  a.outa.out [.]
_D3std5range__T11SortedRangeTSQBc9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQDnQDm__T4iotaTiTlZQkFilZ6ResultZQCsVAyaa5_61203c2062ZQFc7opSliceMFNaNbNiNfmmZSQGoQGn__TQGkTQGaVQCha5_61203c2062ZQHj
 2.02%  a.outa.out [.]
_D3std5range__T12assumeSortedVAyaa5_61203c2062TSQBu9algorithm9iteration__T9MapResultS4test5isqrtFlZ9__lambda2TSQEfQEe__T4iotaTiTlZQkFilZ6ResultZQCsZQFdFNaNbNiNfQEjZSQGhQGg__T11SortedRangeTQFlVQGga5_61203c2062ZQBj


Using either -fwhole-program or -flto cmdline options resolves the performance
problem and allows all of these functions to be inlined again:

$ gdc-11.2.0 -g -O3 -frelease -fno-bounds-check -flto test.d && time ./a.out 
59618479180

real0m2.085s
user0m2.085s
sys 0m0.000s


But is this expected? Does GDC now require using -flto option for getting
reasonable performance starting from version 11? Or is this a real performance
regression and something can be done to improve the inlining behaviour?

[Bug c/93893] New: MIPS32r2: GCC is unable to figure out that it can use a single INS instruction instead of SLL+OR

2020-02-23 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93893

Bug ID: 93893
   Summary: MIPS32r2: GCC is unable to figure out that it can use
a single INS instruction instead of SLL+OR
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: siarhei.siamashka at gmail dot com
  Target Milestone: ---

Created attachment 47891
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47891&action=edit
Testcase for a single INS instruction vs. SLL+OR

$ mipsel-unknown-linux-gnu-gcc -c -Os -march=mips32r2 testcase.c 
$ mipsel-unknown-linux-gnu-objdump -d testcase.o

testcase.o: file format elf32-tradlittlemips


Disassembly of section .text:

 :
   0:   8c82lw  v0,0(a0)
   4:   94a3lhu v1,0(a1)
   8:   00021400sll v0,v0,0x10
   c:   00431025or  v0,v0,v1
  10:   03e8jr  ra
  14:   acc2sw  v0,0(a2)

0018 :
  18:   8ca2lw  v0,0(a1)
  1c:   8c83lw  v1,0(a0)
  20:   7c62fc04ins v0,v1,0x10,0x10
  24:   03e8jr  ra
  28:   acc2sw  v0,0(a2)
  2c:   nop


The C implementation uses an extra instruction compared to the inline assembly
variant of the same function.

[Bug target/53659] ARM: Using -mcpu=cortex-a9 option results in bad performance for Cortex-A9 processor in C-Ray phoronix benchmark

2017-01-25 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53659

--- Comment #8 from Siarhei Siamashka  ---
Since my report predates bug 68664 by several years, shouldn't bug 68664 be a
duplicate? In addition, my report was much more detailed, since it also
provided a practical use case, showcasing the importance of this problem.

Also if I understand it correctly, you have still not fixed the issue. So
closing it seems to be a bit premature. I'll keep a watch on bug 68664 and will
be sure to reopen my bugreport in the case if the fix does not help on ARM
Cortex A9.

Thanks for generating some sort of activity anyway. It's surely better than
nothing.

[Bug rtl-optimization/64208] New: [4.9 Regression][iwmmxt] ICE: internal compiler error: Max. number of generated reload insns per insn is achieved (90)

2014-12-06 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64208

Bug ID: 64208
   Summary: [4.9 Regression][iwmmxt] ICE: internal compiler error:
Max. number of generated reload insns per insn is
achieved (90)
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: siarhei.siamashka at gmail dot com

GCC 4.9.2

$ arm-none-linux-gnueabi-gcc -c -O1 -march=iwmmxt test.c
test.c: In function 'x2':
test.c:16:1: internal compiler error: Max. number of generated reload insns per
insn is achieved (90)

The test program itself (generated by creduce):

//
long long x6(void);
void x7(long long, long long);
void x8(long long);

int x0;
long long *x1;

void x2(void) {
  long long *x3 = x1;
  while (x1) {
long long x4 = x0, x5 = x6();
x7(x4, x5);
x8(x5);
*x3 = 0;
  }
}
//

[Bug target/64172] [4.9/5 Regression] Wrong code with GCC vector extensions on ARM when compiled without NEON

2014-12-04 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64172

--- Comment #6 from Siarhei Siamashka  ---
(In reply to ktkachov from comment #4)
> I can't reproduce with -O2 and -mfpu=neon.
> Can you please give the exact configuration of your GCC?
> The output of 'arm-none-linux-gnueabi-gcc -v' should be good

My apologies, you are right. Adding -march=armv7-a option appears to be also
needed ("-O2 -march=armv7-a") and I had it as part of my GCC configuration:

Using built-in specs.
COLLECT_GCC=arm-none-linux-gnueabi-gcc-4.9.2
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/arm-none-linux-gnueabi/4.9.2/lto-wrapper
Target: arm-none-linux-gnueabi
Configured with:
/var/tmp/portage/cross-arm-none-linux-gnueabi/gcc-4.9.2/work/gcc-4.9.2/configure
--host=x86_64-pc-linux-gnu --target=arm-none-linux-gnueabi
--build=x86_64-pc-linux-gnu --prefix=/usr
--bindir=/usr/x86_64-pc-linux-gnu/arm-none-linux-gnueabi/gcc-bin/4.9.2
--includedir=/usr/lib/gcc/arm-none-linux-gnueabi/4.9.2/include
--datadir=/usr/share/gcc-data/arm-none-linux-gnueabi/4.9.2
--mandir=/usr/share/gcc-data/arm-none-linux-gnueabi/4.9.2/man
--infodir=/usr/share/gcc-data/arm-none-linux-gnueabi/4.9.2/info
--with-gxx-include-dir=/usr/lib/gcc/arm-none-linux-gnueabi/4.9.2/include/g++-v4
--with-python-dir=/share/gcc-data/arm-none-linux-gnueabi/4.9.2/python
--enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror
--with-system-zlib --disable-nls --enable-checking=release
--with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.9.2'
--enable-libstdcxx-time --enable-poison-system-directories
--with-sysroot=/usr/arm-none-linux-gnueabi --disable-bootstrap
--enable-__cxa_atexit --enable-clocale=gnu --disable-multilib --disable-altivec
--disable-fixed-point --disable-libgcj --disable-libgomp --disable-libmudflap
--disable-libssp --disable-libquadmath --enable-lto --without-cloog
--enable-libsanitizer --with-arch=armv7-a --with-float=hard
Thread model: posix
gcc version 4.9.2 (Gentoo 4.9.2)

[Bug target/64172] [4.9/5 Regression] Wrong code with GCC vector extensions on ARM when compiled without NEON

2014-12-04 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64172

--- Comment #3 from Siarhei Siamashka  ---
(In reply to Richard Biener from comment #2)
> So it works with GCC 4.8?

Yes, the testcase works with GCC 4.8. It started to fail only with GCC 4.9 and
only on ARM hardware. Originally reported at
https://bugs.freedesktop.org/show_bug.cgi?id=81229

This looks like some sort of stack corruption, because the assert catches
modification of a part of data which is not supposed to be changed in this
particular testcase.

[Bug target/64172] [4.9 Regression] Wrong code with GCC vector extensions on ARM when compiled without NEON

2014-12-03 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64172

Siarhei Siamashka  changed:

   What|Removed |Added

 CC||siarhei.siamashka at gmail dot 
com

--- Comment #1 from Siarhei Siamashka  ---
Created attachment 34183
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=34183&action=edit
partially reduced testcase

GCC 4.9.2

== Fails: ==

$ arm-none-linux-gnueabi-gcc -O2 preliminary-testcase-pr64172.c && ./a.out
a.out: preliminary-testcase-pr64172.c:91: main: Assertion `prng_state.d[3] ==
0xA7223834' failed.

== Works: ==

$ arm-none-linux-gnueabi-gcc -O1 preliminary-testcase-pr64172.c && ./a.out

$ arm-none-linux-gnueabi-gcc -O2 -mfpu=neon preliminary-testcase-pr64172.c &&
./a.out

[Bug target/64172] New: [4.9 Regression] Wrong code with GCC vector extensions on ARM when compiled without NEON

2014-12-03 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64172

Bug ID: 64172
   Summary: [4.9 Regression] Wrong code with GCC vector extensions
on ARM when compiled without NEON
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: siarhei.siamashka at gmail dot com

The attached partially reduced testcase misbehaves at runtime if it is compiled
using GCC 4.9.2 on ARM with -O2 optimizations and without -mfpu=neon. Reducing
optimizations to -O1 helps.

[Bug tree-optimization/61299] [4.9/4.10 Regression] Performance regression for the SIMD rotate operation with GCC vector extensions

2014-05-24 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61299

--- Comment #3 from Siarhei Siamashka  ---
(In reply to Marc Glisse from comment #2)
> That's PR 57233 I believe.

Oh, sorry for the duplicate. Don't know how I missed it when searching for
similar bugs.

[Bug tree-optimization/61299] New: [4.9 Regression] Performance regression for the SIMD rotate operation with GCC vector extensions

2014-05-23 Thread siarhei.siamashka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61299

Bug ID: 61299
   Summary: [4.9 Regression] Performance regression for the SIMD
rotate operation with GCC vector extensions
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: siarhei.siamashka at gmail dot com

A small test:

/**/
typedef unsigned int uint32x4 __attribute__ ((vector_size(16)));
typedef struct { uint32x4 a, b; } prng_t;
void foo(prng_t *x)
{
x->a ^= ((x->b << 17) ^ (x->b >> (32 - 17)));
}
/**/

Gets compiled into the following slow code with GCC 4.9 (CFLAGS="-O3"):

 :
   0:   66 0f 6f 47 10  movdqa 0x10(%rdi),%xmm0
   5:   66 0f 70 c8 55  pshufd $0x55,%xmm0,%xmm1
   a:   66 0f 7e c0 movd   %xmm0,%eax
   e:   c1 c8 0fror$0xf,%eax
  11:   89 44 24 e8 mov%eax,-0x18(%rsp)
  15:   66 0f 7e c8 movd   %xmm1,%eax
  19:   66 0f 6f c8 movdqa %xmm0,%xmm1
  1d:   c1 c8 0fror$0xf,%eax
  20:   66 0f 6a c8 punpckhdq %xmm0,%xmm1
  24:   89 44 24 ec mov%eax,-0x14(%rsp)
  28:   66 0f 70 c0 ff  pshufd $0xff,%xmm0,%xmm0
  2d:   66 0f 6e 5c 24 ec   movd   -0x14(%rsp),%xmm3
  33:   66 0f 7e c8 movd   %xmm1,%eax
  37:   c1 c8 0fror$0xf,%eax
  3a:   89 44 24 f0 mov%eax,-0x10(%rsp)
  3e:   66 0f 7e c0 movd   %xmm0,%eax
  42:   66 0f 6e 44 24 e8   movd   -0x18(%rsp),%xmm0
  48:   66 0f 6e 4c 24 f0   movd   -0x10(%rsp),%xmm1
  4e:   c1 c8 0fror$0xf,%eax
  51:   66 0f 62 c3 punpckldq %xmm3,%xmm0
  55:   89 44 24 f4 mov%eax,-0xc(%rsp)
  59:   66 0f 6e 54 24 f4   movd   -0xc(%rsp),%xmm2
  5f:   66 0f 62 ca punpckldq %xmm2,%xmm1
  63:   66 0f 6c c1 punpcklqdq %xmm1,%xmm0
  67:   66 0f ef 07 pxor   (%rdi),%xmm0
  6b:   0f 29 07movaps %xmm0,(%rdi)
  6e:   c3  retq   

It used to be a lot better with GCC 4.8 (CFLAGS="-O3"):

 :
   0:   66 0f 6f 4f 10  movdqa 0x10(%rdi),%xmm1
   5:   66 0f 6f c1 movdqa %xmm1,%xmm0
   9:   66 0f 72 d1 0f  psrld  $0xf,%xmm1
   e:   66 0f 72 f0 11  pslld  $0x11,%xmm0
  13:   66 0f ef c1 pxor   %xmm1,%xmm0
  17:   66 0f ef 07 pxor   (%rdi),%xmm0
  1b:   66 0f 7f 07 movdqa %xmm0,(%rdi)
  1f:   c3  retq

[Bug rtl-optimization/29294] 4.1, 4.2 (possibly 4.0?) not finding postmodify address mode on ARM

2012-12-19 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29294



--- Comment #10 from Siarhei Siamashka  
2012-12-20 05:47:30 UTC ---

(In reply to comment #9)



And some performance measurements (for working with L1 cache):



> $ arm-none-eabi-gcc-4.7.2 -O2 -mcpu=cortex-a8 -c test.c

> $ objdump -d test.o

> 

>  :

>0:e2511010 subsr1, r1, #16

>4:412fff1e bxmilr

>8:e2511010 subsr1, r1, #16

>c:e1c020f0 strdr2, [r0]

>   10:e1c020f8 strdr2, [r0, #8]

>   14:e2800010 addr0, r0, #16

>   18:5afa bpl8 

>   1c:e12fff1e bxlr



Cortex-A8  - 5   cycles per iteration

Cortex-A9  - 4.5 cycles per iteration

Cortex-A15 - 3   cycles per iteration



> $ arm-none-eabi-gcc-4.8.0 -O2 -mcpu=cortex-a8 -c test.c

> $ objdump -d test.o

> 

>  :

>0:e351000f cmpr1, #15

>4:d12fff1e bxlelr

>8:e2411010 subr1, r1, #16

>c:e280c010 addip, r0, #16

>   10:e3c1100f bicr1, r1, #15

>   14:e08c1001 addr1, ip, r1

>   18:e1c020f0 strdr2, [r0]

>   1c:e2800010 addr0, r0, #16

>   20:e14020f8 strdr2, [r0, #-8]

>   24:e151 cmpr0, r1

>   28:1afa bne18 

>   2c:e12fff1e bxlr



Cortex-A8  - 6 cycles per iteration

Cortex-A9  - 4 cycles per iteration

Cortex-A15 - 3 cycles per iteration



While we could have expected something like the following code for the inner

loop:



1:  strdV, [BUF], #8

subsN, N, #16

strdV, [BUF], #8

bpl1b



Cortex-A8  - 4 cycles per iteration

Cortex-A9  - 4 cycles per iteration

Cortex-A15 - 2.5 cycles per iteration

[Bug rtl-optimization/29294] 4.1, 4.2 (possibly 4.0?) not finding postmodify address mode on ARM

2012-12-19 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29294



Siarhei Siamashka  changed:



   What|Removed |Added



 CC||siarhei.siamashka at gmail

   ||dot com



--- Comment #9 from Siarhei Siamashka  
2012-12-20 04:45:10 UTC ---

(In reply to comment #3)

> Actually this case should not be using post modify at all except how many bits

> does ARM have to use for an offset? I thought 16bits which means you don't 
> need

> that at all and GCC should generate it without an increment.  Oh and this is a

> RTL opt issue.



Seems like gcc 4.7.2 and 4.8.0 20121219 (experimental) are already doing this,

which hides the postincrement issue for the currently attached testcase.



However postincrement is still a performance problem for ARM. The code I'm

having troubles with is the following:



/***/



typedef unsigned long long T;



void fill(T *buf, int n, T v)

{

while ((n -= 16) >= 0)

{

*buf++ = v;

*buf++ = v;

}

}



/***/



$ arm-none-eabi-gcc-4.7.2 -O2 -mcpu=cortex-a8 -c test.c

$ objdump -d test.o



 :

   0:e2511010 subsr1, r1, #16

   4:412fff1e bxmilr

   8:e2511010 subsr1, r1, #16

   c:e1c020f0 strdr2, [r0]

  10:e1c020f8 strdr2, [r0, #8]

  14:e2800010 addr0, r0, #16

  18:5afa bpl8 

  1c:e12fff1e bxlr





$ arm-none-eabi-gcc-4.8.0 -O2 -mcpu=cortex-a8 -c test.c

$ objdump -d test.o



 :

   0:e351000f cmpr1, #15

   4:d12fff1e bxlelr

   8:e2411010 subr1, r1, #16

   c:e280c010 addip, r0, #16

  10:e3c1100f bicr1, r1, #15

  14:e08c1001 addr1, ip, r1

  18:e1c020f0 strdr2, [r0]

  1c:e2800010 addr0, r0, #16

  20:e14020f8 strdr2, [r0, #-8]

  24:e151 cmpr0, r1

  28:1afa bne18 

  2c:e12fff1e bxlr

[Bug target/43364] Suboptimal code for the use of ARM NEON intrinsic "vset_lane_f32"

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43364



Siarhei Siamashka  changed:



   What|Removed |Added



 Status|NEW |RESOLVED

 Resolution||FIXED

  Known to fail||



--- Comment #5 from Siarhei Siamashka  
2012-12-10 02:12:05 UTC ---

This seems to have improved a lot. Thanks for your hard work.



.cpu cortex-a8

.eabi_attribute 27, 3

.eabi_attribute 28, 1

.fpu neon

.eabi_attribute 20, 1

.eabi_attribute 21, 1

.eabi_attribute 23, 3

.eabi_attribute 24, 1

.eabi_attribute 25, 1

.eabi_attribute 26, 1

.eabi_attribute 30, 2

.eabi_attribute 34, 1

.eabi_attribute 18, 4

.file"test.c"

.text

.align2

.globalneon_add

.typeneon_add, %function

neon_add:

@ args = 0, pretend = 0, frame = 0

@ frame_needed = 0, uses_anonymous_args = 0

@ link register save eliminated.

vmov.f32d16, #0.0  @ v2sf

vmovd17, d16  @ v2sf

vld1.32{d16[0]}, [r1]

vld1.32{d17[0]}, [r2]

vadd.f32d16, d16, d17

vst1.32{d16[0]}, [r0]

bxlr

.sizeneon_add, .-neon_add

.ident"GCC: (GNU) 4.8.0 20121209 (experimental)"

[Bug target/39469] Calculated values replaced with constants even if the constants cost more than the calculations

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39469



--- Comment #6 from Siarhei Siamashka  
2012-12-10 00:24:12 UTC ---

(In reply to comment #5)

> (In reply to comment #4)

> > The ARM backend should do a splitter just like the rs6000 back-end does if 
> > it

> > is faster/smaller to load a constant via the instructions.

> 

> I'm not sure if rs6000 is any better. It looks just as bad as ARM, based on my

> experience trying to optimize

> http://lists.freedesktop.org/archives/pixman/2012-December/002394.html



And the testcase attached to this bug compiles to the following code with

powerpc-unknown-linux-gnu-gcc (-O2 optimizations):



.file"test.c"

.section".text"

.align 2

.globl foo

.typefoo, @function

foo:

lis 8,0x5f5

lis 10,array@ha

ori 8,8,57600

la 9,array@l(10)

stw 8,array@l(10)

lis 10,0xbeb

ori 10,10,49664

stw 10,4(9)

lis 10,0x17d7

ori 10,10,33792

stw 10,8(9)

lis 10,0x2faf

ori 10,10,2048

stw 10,12(9)

blr

.sizefoo, .-foo

.align 2

.globl bar

.typebar, @function

bar:

lis 10,array@ha

slwi 6,3,1

la 9,array@l(10)

slwi 7,3,2

slwi 8,3,3

stw 3,array@l(10)

stw 6,4(9)

stw 7,8(9)

stw 8,12(9)

blr

.sizebar, .-bar

.ident"GCC: (GNU) 4.8.0 20121209 (experimental)"

.section.note.GNU-stack,"",@progbits



That's 15 instructions in "foo" vs. 10 in "bar". For MIPS the difference is 16

instructions vs. 11 (I'm not posting the code because it is rather similar).



Is this really an ARM target bug?

[Bug target/39469] Calculated values replaced with constants even if the constants cost more than the calculations

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39469



--- Comment #5 from Siarhei Siamashka  
2012-12-09 23:31:33 UTC ---

(In reply to comment #4)

> The ARM backend should do a splitter just like the rs6000 back-end does if it

> is faster/smaller to load a constant via the instructions.



I'm not sure if rs6000 is any better. It looks just as bad as ARM, based on my

experience trying to optimize

http://lists.freedesktop.org/archives/pixman/2012-December/002394.html

[Bug tree-optimization/55614] [4.6/4.7 Regression] vector extensions cause movdqa to be generated for memcpy on unaligned buffer

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55614



--- Comment #9 from Siarhei Siamashka  
2012-12-09 22:25:17 UTC ---

*** Bug 55454 has been marked as a duplicate of this bug. ***

[Bug target/55454] [PPC] unaligned memory accesses do not work correctly for vector extensions when using altivec

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55454



Siarhei Siamashka  changed:



   What|Removed |Added



 Status|UNCONFIRMED |RESOLVED

 Resolution||DUPLICATE



--- Comment #5 from Siarhei Siamashka  
2012-12-09 22:25:17 UTC ---

Appears that this is a duplicate of

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55614



As for memcpy, it looks like this is indeed the preferable "portable" way of

storing vectors to unaligned memory (albeit somewhat buggy at the moment).



And ARM just happens to have a performance issue related to memcpy, but it can

be tracked elsewhere: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55634



*** This bug has been marked as a duplicate of bug 55614 ***

[Bug target/55634] New: ARM: gcc vector extensions: storing vector to unaligned memory location does not use VST1.8 NEON instruction

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55634



 Bug #: 55634

   Summary: ARM: gcc vector extensions: storing vector to

unaligned memory location does not use VST1.8 NEON

instruction

Classification: Unclassified

   Product: gcc

   Version: 4.7.2

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: target

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: siarhei.siamas...@gmail.com





The following test program tries to use GCC vector extensions to add two

vectors together and store the result to unaligned memory location in a

"portable" way with memcpy:



/***/



#include 



typedef unsigned int T __attribute__ ((vector_size (16)));



void foo (void *result, T *a, T *b)

{

  T tmp = *a + *b;

  memcpy (result, &tmp, sizeof(tmp));

}



/***/



Compiling with gcc 4.7.2:



$ arm-none-linux-gnueabi-gcc -O2 -mcpu=cortex-a8 -mfpu=neon -c test.c

$ objdump -d test.o



 :

   0:e52d4004 push{r4}; (str r4, [sp, #-4]!)

   4:ecd12b04 vldmiar1, {d18-d19}

   8:e24dd014 subsp, sp, #20

   c:ecd20b04 vldmiar2, {d16-d17}

  10:e28dc010 addip, sp, #16

  14:f26208e0 vadd.i32q8, q9, q8

  18:ed6c0b04 vstmdbip!, {d16-d17}

  1c:e1a0c00d movip, sp

  20:e1a04000 movr4, r0

  24:e8bc000f ldmip!, {r0, r1, r2, r3}

  28:e584 strr0, [r4]

  2c:e5841004 strr1, [r4, #4]

  30:e5842008 strr2, [r4, #8]

  34:e584300c strr3, [r4, #12]

  38:e28dd014 addsp, sp, #20

  3c:e8bd0010 pop{r4}

  40:e12fff1e bxlr



The same test program results in the following code if compiled for x86-64:



 :

   0:66 0f 6f 06  movdqa (%rsi),%xmm0

   4:66 0f fe 02  paddd  (%rdx),%xmm0

   8:f3 0f 7f 07  movdqu %xmm0,(%rdi)

   c:c3   retq   



So x86-64 target is able to use MOVDQU instruction. Hence ARM target should be

able to use VST1.8 as well.

[Bug tree-optimization/55614] [4.6/4.7 Regression] vector extensions cause movdqa to be generated for memcpy on unaligned buffer

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55614



--- Comment #8 from Siarhei Siamashka  
2012-12-09 20:59:42 UTC ---

FWIW, the current gentoo patchset for gcc-4.7.2 contains

10_all_default-fortify-source.patch intended to "Enable -D_FORTIFY_SOURCE=2 by

default". It makes this bug non-reproducible there (or just latent?).

[Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623



--- Comment #4 from Siarhei Siamashka  
2012-12-09 11:21:42 UTC ---

Created attachment 28905

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28905

badschedmul.c



The testcase, converted to use multiplications. Can be used to demonstrates the

problem on all architectures, even including x86-64.

[Bug middle-end/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623



--- Comment #3 from Siarhei Siamashka  
2012-12-09 11:18:56 UTC ---

(In reply to comment #2)

> This is an ARM (both arm32 and arm64) specific issue due to the shifts being

> "free".  If you look at the mips assembly, it looks good for a dual issue

> processor as it is scheduled as an add followed by a shift.

> 

> I think the issue is reassocdoes not know that shifts are free on arm.



This does not look like only an ARM issue. To properly demonstrate it on MIPS

and even without dual-issue, all the additions can be just changed with

multiplications (because it is a long latency instruction). In this case we

get:



unsigned int f1(unsigned int x)

{

unsigned int a, b;

a = x >> 1;

b = x >> 2;

a *= x >> 3;

b *= x >> 4;

a *= x >> 5;

b *= x >> 6;

a *= x >> 7;

b *= x >> 8;

a *= x >> 9;

b *= x >> 10;

a *= x >> 11;

b *= x >> 12;

a *= x >> 13;

b *= x >> 14;

a *= x >> 15;

b *= x >> 16;

a *= x >> 17;

b *= x >> 18;

a *= x >> 19;

b *= x >> 20;

a *= x >> 21;

b *= x >> 22;

a *= x >> 23;

b *= x >> 24;

return a * b;

}



unsigned int f2(unsigned int x)

{

unsigned int a, b;

a = x >> 1;

b = x >> 2;

a *= x >> 3;

b *= x >> 4;

a *= x >> 5;

b *= x >> 6;

a *= x >> 7;

b *= x >> 8;

a *= x >> 9;

b *= x >> 10;

a *= x >> 11;

b *= x >> 12;

a *= x >> 13;

b *= x >> 14;

a *= x >> 15;

b *= x >> 16;

a *= x >> 17;

b *= x >> 18;

a *= x >> 19;

b *= x >> 20;

a *= x >> 21;

b *= x >> 22;

a *= x >> 23;

b *= x >> 24;

asm ("" : "+r" (a));

return a * b;

}



And the benchmark run on MIPS 74K:



$ gcc -O2 -march=mips32r2 -mtune=74kc -o badschedmul badschedmul.c

$ time ./badchedmul 1



real0m34.934s

user0m34.689s

sys0m0.073s



$ time ./badchedmul 2



real0m19.261s

user0m19.122s

sys0m0.050s



The symptoms are still the same. GCC just merges two independent calculations

into a single dependency chain. While I would have expected it to be the other

way around (breaking dependency chains to run faster on the target CPU).

[Bug tree-optimization/55623] [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623



--- Comment #1 from Siarhei Siamashka  
2012-12-09 10:00:59 UTC ---

Created attachment 28904

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28904

badsched.c

[Bug tree-optimization/55623] New: [ARM] GCC should not prefer long dependency chains, they inhibit performance on superscalar processors

2012-12-09 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55623



 Bug #: 55623

   Summary: [ARM] GCC should not prefer long dependency chains,

they inhibit performance on superscalar processors

Classification: Unclassified

   Product: gcc

   Version: 4.7.2

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: tree-optimization

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: siarhei.siamas...@gmail.com





This is a missing optimization. Or in this particular case, it's more like GCC

is reversing an attempt of a programmer to optimize the code for superscalar

dual-issue processors.



$ arm-none-linux-gnueabi-gcc -O2 -mcpu=cortex-a8 -o badsched badsched.c

$ objdump -d badsched



 :

   0:e1a03120 lsrr3, r0, #2

   4:e08330a0 addr3, r3, r0, lsr #1

   8:e08331a0 addr3, r3, r0, lsr #3

   c:e0833220 addr3, r3, r0, lsr #4

  10:e08332a0 addr3, r3, r0, lsr #5

  14:e0833320 addr3, r3, r0, lsr #6

  18:e08333a0 addr3, r3, r0, lsr #7

  1c:e0833420 addr3, r3, r0, lsr #8

  20:e08334a0 addr3, r3, r0, lsr #9

  24:e0833520 addr3, r3, r0, lsr #10

  28:e08335a0 addr3, r3, r0, lsr #11

  2c:e0833620 addr3, r3, r0, lsr #12

  30:e08336a0 addr3, r3, r0, lsr #13

  34:e0833720 addr3, r3, r0, lsr #14

  38:e08337a0 addr3, r3, r0, lsr #15

  3c:e0833820 addr3, r3, r0, lsr #16

  40:e08338a0 addr3, r3, r0, lsr #17

  44:e0833920 addr3, r3, r0, lsr #18

  48:e08339a0 addr3, r3, r0, lsr #19

  4c:e0833a20 addr3, r3, r0, lsr #20

  50:e0833aa0 addr3, r3, r0, lsr #21

  54:e0833b20 addr3, r3, r0, lsr #22

  58:e0833ba0 addr3, r3, r0, lsr #23

  5c:e0830c20 addr0, r3, r0, lsr #24

  60:e12fff1e bxlr



0064 :

  64:e1a031a0 lsrr3, r0, #3

  68:e1a02220 lsrr2, r0, #4

  6c:e08330a0 addr3, r3, r0, lsr #1

  70:e0822120 addr2, r2, r0, lsr #2

  74:e08332a0 addr3, r3, r0, lsr #5

  78:e0822320 addr2, r2, r0, lsr #6

  7c:e08333a0 addr3, r3, r0, lsr #7

  80:e0822420 addr2, r2, r0, lsr #8

  84:e08334a0 addr3, r3, r0, lsr #9

  88:e0822520 addr2, r2, r0, lsr #10

  8c:e08335a0 addr3, r3, r0, lsr #11

  90:e0822620 addr2, r2, r0, lsr #12

  94:e08336a0 addr3, r3, r0, lsr #13

  98:e0822720 addr2, r2, r0, lsr #14

  9c:e08337a0 addr3, r3, r0, lsr #15

  a0:e0822820 addr2, r2, r0, lsr #16

  a4:e08338a0 addr3, r3, r0, lsr #17

  a8:e0822920 addr2, r2, r0, lsr #18

  ac:e08339a0 addr3, r3, r0, lsr #19

  b0:e0822a20 addr2, r2, r0, lsr #20

  b4:e0833aa0 addr3, r3, r0, lsr #21

  b8:e0822b20 addr2, r2, r0, lsr #22

  bc:e0833ba0 addr3, r3, r0, lsr #23

  c0:e0820c20 addr0, r2, r0, lsr #24

  c4:e083 addr0, r0, r3

  c8:e12fff1e bxlr



Guess which one of these two functions will be faster?



=== Cortex-A8 @1000MHz ===



$ time ./badsched 1



real0m2.512s

user0m2.500s

sys0m0.000s



$ time ./badsched 2



real0m2.064s

user0m2.008s

sys0m0.008s



=== Cortex-A15 @1700MHz ===



real0m2.786s

user0m2.770s

sys0m0.005s



real0m1.451s

user0m1.440s

sys0m0.005s



There is a function call and loop overhead which prevents Cortex-A8 from

showing ~2x better performance in the case of using "f2" function. We can try

to mark these function as static in order to get them inlined, but in this case

the asm workaround becomes ineffective in a rather interesting way, which also

demonstrates instructions scheduling issues.

[Bug c/55457] Having some predefined macros to get more information about gcc vector extensions capabilities would be nice

2012-12-08 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55457



--- Comment #2 from Siarhei Siamashka  
2012-12-09 02:33:24 UTC ---

(In reply to comment #1)

> The whole point of the gcc vector extensions is that you don't need to depend

> on what the hardware can do under neath as it should produce good code in

> either case.



I think that you are a bit too idealistic and overlooking some practical

implications:

1. GCC just does not generate really good code with vector extensions when

there is no real SIMD backend.

2. There may be some high level algorithmic optimizations possible. For

example, branches to skip calculations for some special cases. We really want

to have these branches in scalar code, while SIMD can sacrifice the branches

and gain a lot more performance from parallel processing.



So right now GCC is forcing me and the other users to infest the code with

ifdefs checking for __x86__, __amd64__, __arm__, __powerpc__, __SSE2__,

__ALTIVEC__, __ARM_NEON__, etc. to disable the use of vector extensions when

there is actually no real SIMD. It kinda defeats the purpose, because turns out

that I need to know about the existence of all the CPU architectures supported

by GCC and their SIMD implementations before I can expect that the code will

work reasonably fast everywhere.



Relying just on GCC vector extensions means non-portable code, which will not

work with the other compilers. So in any case, everyone is likely to already

have an alternative implementation written in standard C. Having two

alternative implementations to select from, we want to be able to make a

reasonably good guess about which implementation is going to be preferable for

this particular build.

[Bug target/46128] There is no mechanism for detecting VFP revisions in ARM GCC.

2012-12-04 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46128



--- Comment #6 from Siarhei Siamashka  
2012-12-05 00:06:39 UTC ---

(In reply to comment #5)

> This is really an enhancement request...



Is there anything that can be done with this enhancement request?



I can see that __ARM_FEATURE_DSP and __ARM_FEATURE_UNALIGNED have been added:

http://gcc.gnu.org/ml/gcc-patches/2011-06/msg01849.html

http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00878.html



This is a step in the right direction, but still not enough.

[Bug target/55454] [PPC] unaligned memory accesses do not work correctly for vector extensions when using altivec

2012-11-25 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55454



--- Comment #4 from Siarhei Siamashka  
2012-11-25 21:16:53 UTC ---

(In reply to comment #3)

> Also fails with GCC trunk (gcc version 4.8.0 20120518 (experimental))

 ^^

Sorry, I accidentally compiled GCC from the stale old directory. The recent

trunk 4.8.0 20121120 (experimental) has memcpy issue fixed. Still the STVX

problem is there:



 :

   0:7c 00 18 ce lvx v0,r0,r3

   4:3d 40 00 00 lis r10,0

   8:39 20 00 0a li  r9,10

   c:39 4a 00 00 addir10,r10,0

  10:7c 0a 49 ce stvxv0,r10,r9

  14:4e 80 00 20 blr

[Bug target/55454] [PPC] unaligned memory accesses do not work correctly for vector extensions when using altivec

2012-11-25 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55454



--- Comment #3 from Siarhei Siamashka  
2012-11-25 19:32:02 UTC ---

Also fails with GCC trunk (gcc version 4.8.0 20120518 (experimental))



The disassembly listing for "init_buffer" function:



 :

   0:7d 80 42 a6 mfvrsave r12

   4:94 21 ff e0 stwur1,-32(r1)

   8:91 81 00 1c stw r12,28(r1)

   c:65 8c 80 00 orisr12,r12,32768

  10:7d 80 43 a6 mtvrsave r12

  14:3d 40 00 00 lis r10,0

  18:7c 00 18 ce lvx v0,r0,r3

  1c:39 20 00 0a li  r9,10

  20:39 4a 00 00 addir10,r10,0

  24:7c 0a 49 ce stvxv0,r10,r9



Here it happily tries to use STVX instruction. And using this instruction just

silently aligns the address down to 16 byte boundary, effectively doing the

write at &buffer[0] instead of &buffer[10].



  28:81 81 00 1c lwz r12,28(r1)

  2c:7d 80 43 a6 mtvrsave r12

  30:38 21 00 20 addir1,r1,32

  34:4e 80 00 20 blr





And by the way, the memcpy workaround mentioned above is also broken in GCC

4.8, because it tries to be clever and generates exactly the same code relying

on STVX :)





With GCC 4.7.2, at least memcpy variant used to work correctly:



 :

   0:3d 40 00 00 lis r10,0

   4:80 a3 00 00 lwz r5,0(r3)

   8:80 c3 00 04 lwz r6,4(r3)

   c:80 e3 00 08 lwz r7,8(r3)

  10:39 2a 00 0a addir9,r10,10

  14:81 03 00 0c lwz r8,12(r3)

  18:90 aa 00 0a stw r5,10(r10)

  1c:90 c9 00 04 stw r6,4(r9)

  20:90 e9 00 08 stw r7,8(r9)

  24:91 09 00 0c stw r8,12(r9)

  28:4e 80 00 20 blr

[Bug target/55454] [PPC] unaligned memory accesses do not work correctly for vector extensions when using altivec

2012-11-25 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55454



--- Comment #2 from Siarhei Siamashka  
2012-11-25 18:18:16 UTC ---

(In reply to comment #1)

> Besides from whether the testcase is valid



According to http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html



"packed - This attribute, attached to struct or union type definition,

specifies that each member (other than zero-width bit-fields) of the structure

or union is placed to minimize the memory required. When attached to an enum

definition, it indicates that the smallest integral type should be used."



Is it safe to assume that the size of this "foo" struct is always expected to

be 17 bytes in the testcase? If yes, then it must be safe to use any alignment

for this struct because an array of "foo" will have elements with addresses at

any possible alignments. As such, any memory location can be safely casted to

foo* and used. Is there anything wrong with these assumptions?





But in fact what I want is just to somehow tell gcc that I'm going to write

this vector data type at an unaligned memory location. For example, x86 SSE2

and ARM NEON have unaligned load/store instructions. PPC Altivec can't do it

easily, but that's a headache for GCC and the application developer (me) should

not care. After all, if running out of options, one can always use



memcpy(buffer + 10, a, sizeof(*a));



instead of



((foo *)(buffer + 9))->data = *a;



The performance goes down the toilet though. Which would be in fact an

acceptable solution for PPC, but x86 and ARM can definitely do much better.



> 4.8 should do a better job here.



Thanks, I'll check GCC 4.8 a bit later.

[Bug c/55457] New: Having some predefined macros to get more information about gcc vector extensions capabilities would be nice

2012-11-24 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55457



 Bug #: 55457

   Summary: Having some predefined macros to get more information

about gcc vector extensions capabilities would be nice

Classification: Unclassified

   Product: gcc

   Version: 4.7.2

Status: UNCONFIRMED

  Severity: enhancement

  Priority: P3

 Component: c

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: siarhei.siamas...@gmail.com





One practical problem is how to identify whether vector extensions are

beneficial or a fallback to the standard C code is better to be taken. In the

case of OpenCL, there are param values such as

CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT, CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT,

etc.



   

http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html



If gcc could have some sort of predefined macro telling that "preferred vector

width is 1", it could be used in the code to avoid getting performance penalty

by just using normal C code when compiling for non-SIMD capable platforms.

[Bug target/55454] New: [PPC] unaligned memory accesses do not work correctly for vector extensions when using altivec

2012-11-23 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55454



 Bug #: 55454

   Summary: [PPC] unaligned memory accesses do not work correctly

for vector extensions when using altivec

Classification: Unclassified

   Product: gcc

   Version: 4.7.2

Status: UNCONFIRMED

  Severity: normal

  Priority: P3

 Component: target

AssignedTo: unassig...@gcc.gnu.org

ReportedBy: siarhei.siamas...@gmail.com





The following test program reproduces the problem:



/***/



#include 

#include 



typedef uint8_t uint8x16 __attribute__ ((vector_size(16)));

typedef struct { char dummy; uint8x16 data; } __attribute__((packed)) foo;



char __attribute__((aligned(16))) buffer[32];



void __attribute__((noinline)) init_buffer(const uint8x16 *a)

{

((foo *)(buffer + 9))->data = *a;

}



int main (void)

{

const uint8x16 a = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 };

assert(sizeof(foo) == 17);

init_buffer(&a);

assert(buffer[0] == 0);

return 0;

}



/***/



$ gcc -O2 -maltivec -o test test.c

$ ./test

test: test.c:19: main: Assertion `buffer[0] == 0' failed.

Aborted

[Bug tree-optimization/54965] [4.6 Regression] sorry, unimplemented: inlining failed in call to 'foo': function not considered for inlining

2012-10-18 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54965



--- Comment #5 from Siarhei Siamashka  
2012-10-19 00:17:13 UTC ---

(In reply to comment #4)

> In the above case you probably want big_function_a to have all

> calls inlined.  You can then conveniently use the flatten attribute:

> 

> void __attribute__((flatten)) big_function_b (...)

> {

>   big_function_template(..., per_pixel_operation_b);

> }

> 

> GCC will then inline all calls in that function but not ICE

> when it fails to inline one case for some weird reason.



That's nice, but "flatten" attribute does not seem to be widely supported by

the compilers. For example, clang-3.1 does not support it yet and the

enhancement request is still open since 2010 -

http://llvm.org/bugs/show_bug.cgi?id=7559



As far as I know, a few different compilers are currently in real use for

building pixman for various systems: GCC, Clang, Solaris Studio and MSVC. All

of them have some sort of "always_inline" attribute support, which makes it

more universal than "flatten".



> Don't use always-inline or don't use indirect function calls to

> always-inline functions.  It makes always-inline function calls

> survive until IPA inlining where we seem to honor limits even

> though we say we should disregard them.



Is it too intrusive to fix GCC so that it would disregard limits in this case?

Or maybe introduce one more attribute which would be a strong inlining hint,

but still not cause compilation failure if some function can't be really

inlined?

[Bug tree-optimization/54965] [4.6 Regression] sorry, unimplemented: inlining failed in call to 'foo': function not considered for inlining

2012-10-18 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54965



--- Comment #3 from Siarhei Siamashka  
2012-10-18 10:47:51 UTC ---

(In reply to comment #2)

> void combine_conjoint_xor_ca_float ()

> {

> combine_channel_t j = pd_combine_conjoint_xor, k = 
> pd_combine_conjoint_xor;

> a[0] = k (0, b, 0, a[0]);

> a[0] = k (0, b, 0, a[0]);

> a[0] = k (0, b, 0, a[0]);

> a[0] = j (0, c[0], 0, a[0]);

> a[0] = k (0, c[0], 0, a[0]);

> a[0] = k (0, c[0], 0, a[0]);

> a[0] = k (0, c[0], 0, a[0]);

> 

> you are using indirect function calls here, GCC in 4.6 is not smart enough

> to transform them to direct calls before inlining.  Inlining of

> always-inline indirect function calls is not going to work reliably.



Does this only apply to GCC 4.6?



> Don't use always-inline or don't use indirect function calls to always-inline

> functions.



This looks like it might be really inconvenient. Pixman relies on this

functionality in a number of places by doing something like this:



void always_inline per_pixel_operation_a(...)

{

...

}



void always_inline per_pixel_operation_b(...)

{

...

}



void always_inline big_function_template(..., per_pixel_operation_ptr foo)

{

...

/* do some calls to foo() in an inner loop */

...

}



void big_function_a(...)

{

big_function_template(..., per_pixel_operation_a);

}



void big_function_b(...)

{

big_function_template(..., per_pixel_operation_b);

}



Needless to say that we want to be absolutely sure that per-pixel operations

are always inlined. Otherwise the performance gets really bad if the compiler

ever makes a bad inlining decision.



The same functionality can be probably achieved by replacing always_inline

functions with macros. But the code becomes less readable, more error prone and

somewhat more difficult to maintain.

[Bug tree-optimization/54965] [4.6] sorry, unimplemented: inlining failed in call to 'foo': function not considered for inlining

2012-10-17 Thread siarhei.siamashka at gmail dot com



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54965



--- Comment #1 from Siarhei Siamashka  
2012-10-18 01:56:34 UTC ---

Created attachment 28476

  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28476

pixman-combine-float.i.gz - the original full preprocessed source



Applying the changes from

http://gcc.gnu.org/git/?p=gcc.git;a=commit;h=526b36a8a249c8c8698ca48ffeb8bff552f5a6fd

to 'passes.c' in GCC 4.6 branch "fixes" the reduced testcase. But pixman still

can't be compiled successfully, so also attaching the original full

preprocessed source.

[Bug tree-optimization/54965] New: [4.6] sorry, unimplemented: inlining failed in call to 'foo': function not considered for inlining

2012-10-17 Thread siarhei.siamashka at gmail dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54965

 Bug #: 54965
   Summary: [4.6] sorry, unimplemented: inlining failed in call to
'foo': function not considered for inlining
Classification: Unclassified
   Product: gcc
   Version: 4.6.4
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: siarhei.siamas...@gmail.com


Created attachment 28474
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28474
reduced.i

GCC 4.6 fails when compiling current git versions of pixman:
https://bugs.freedesktop.org/show_bug.cgi?id=55630

Bisecting shows that this problem started occurring in 4.6 branch after the
following commit:
http://gcc.gnu.org/git/?p=gcc.git;a=commit;h=3d5f815b529fe4b8b79d4f2a04e6eb670faee04d

3d5f815b529fe4b8b79d4f2a04e6eb670faee04d is the first bad commit
commit 3d5f815b529fe4b8b79d4f2a04e6eb670faee04d
Author: hubicka 
Date:   Thu Nov 11 22:08:26 2010 +

PR tree-optimize/40436
* gcc.dg/tree-ssa/inline-5.c: New testcase.
* gcc.dg/tree-ssa/inline-6.c: New testcase.

* ipa-inline.c (likely_eliminated_by_inlining_p): Rename to ...
(eliminated_by_inlining_prob): ... this one; return 50% probability for
SRA.
(estimate_function_body_sizes): Update use of
eliminated_by_inlining_prob;
estimate static function size for 2 instructions.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@166624
138bc75d-0d04-0410-961f-82ee72b054a4


The problem disappears in 4.7 branch after:
http://gcc.gnu.org/git/?p=gcc.git;a=commit;h=526b36a8a249c8c8698ca48ffeb8bff552f5a6fd

526b36a8a249c8c8698ca48ffeb8bff552f5a6fd is the first bad commit
commit 526b36a8a249c8c8698ca48ffeb8bff552f5a6fd
Author: rguenth 
Date:   Fri Mar 25 11:59:19 2011 +

2011-03-25  Richard Guenther  

* passes.c (init_optimization_passes): Add FRE pass after
early SRA.

* g++.dg/tree-ssa/pr41186.C: Scan the appropriate FRE dump.
* g++.dg/tree-ssa/pr8781.C: Likewise.
* gcc.dg/ipa/ipa-pta-13.c: Likewise.
* gcc.dg/ipa/ipa-pta-3.c: Likewise.
* gcc.dg/ipa/ipa-pta-4.c: Likewise.
* gcc.dg/tree-ssa/20041122-1.c: Likewise.
* gcc.dg/tree-ssa/alias-18.c: Likewise.
* gcc.dg/tree-ssa/foldstring-1.c: Likewise.
* gcc.dg/tree-ssa/forwprop-10.c: Likewise.
* gcc.dg/tree-ssa/forwprop-9.c: Likewise.
* gcc.dg/tree-ssa/fre-vce-1.c: Likewise.
* gcc.dg/tree-ssa/loadpre6.c: Likewise.
* gcc.dg/tree-ssa/pr21574.c: Likewise.
* gcc.dg/tree-ssa/ssa-dom-cse-1.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-1.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-11.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-12.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-13.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-14.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-15.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-16.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-17.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-18.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-19.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-2.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-21.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-22.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-23.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-24.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-25.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-26.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-27.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-3.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-4.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-5.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-6.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-7.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-8.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-9.c: Likewise.
* gcc.dg/tree-ssa/ssa-pre-10.c: Likewise.
* gcc.dg/tree-ssa/ssa-pre-26.c: Likewise.
* gcc.dg/tree-ssa/ssa-pre-7.c: Likewise.
* gcc.dg/tree-ssa/ssa-pre-8.c: Likewise.
* gcc.dg/tree-ssa/ssa-pre-9.c: Likewise.
* gcc.dg/tree-ssa/ssa-sccvn-1.c: Likewise.
* gcc.dg/tree-ssa/ssa-sccvn-2.c: Likewise.
* gcc.dg/tree-ssa/ssa-sccvn-3.c: Likewise.
* gcc.dg/tree-ssa/ssa-sccvn-4.c: Likewise.
* gcc.dg/tree-ssa/struct-aliasing-1.c: Likewise.
* gcc.dg/tree-ssa/struct-aliasing-2.c: Likewise.
* c-c++-common/pr46562-2.c: Likewise.
* gfortran.dg/pr42108.f90: Likewise.
* gcc.dg/torture/pta-structcopy-1.c: Scan ealias dump, force
foo to be inlined even at -O1.
* gcc.dg/tree-ssa/ssa-dce-4.c: Disable FRE.
* gcc.dg/ipa/ipa-pta-14.c: Likewise.
* gcc.dg/tree-ssa/ssa-fre-1.c: Adjust.
* gcc.dg/matrix/matrix.exp: Disable FRE.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@171450
138bc75d-0

[Bug target/53659] New: ARM: Using -mcpu=cortex-a9 option results in bad performance for Cortex-A9 processor in C-Ray phoronix benchmark

2012-06-13 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53659

 Bug #: 53659
   Summary: ARM: Using -mcpu=cortex-a9 option results in bad
performance for Cortex-A9 processor in C-Ray phoronix
benchmark
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: siarhei.siamas...@gmail.com


gcc version 4.7.0
--with-arch=armv7-a --with-float=hard --with-fpu=neon --with-mode=thumb

$ cd /tmp
$ wget http://www.phoronix-test-suite.com/benchmark-files/c-ray-1.1.tar.gz
$ tar -xzf c-ray-1.1.tar.gz
$ cd c-ray-1.1

$ make clean && make
gcc -O3 -ffast-math   -c -o c-ray-mt.o c-ray-mt.c
gcc -o c-ray-mt c-ray-mt.o -lm -lpthread
$ ./c-ray-mt -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm
c-ray-mt v1.1
Rendering took: 6 seconds (6683 milliseconds)

$ sed -i "s,-O3,-O3 -mcpu=cortex-a9,g" Makefile

$ make clean && make
gcc -O3 -mcpu=cortex-a9 -ffast-math   -c -o c-ray-mt.o c-ray-mt.c
gcc -o c-ray-mt c-ray-mt.o -lm -lpthread
$ ./c-ray-mt -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm
c-ray-mt v1.1
Rendering took: 7 seconds (7906 milliseconds)

Comparing to the default -march=armv7-a configuration, -mcpu=cortex-a9 caused a
~18% slowdown (7906 milliseconds vs. 6683 milliseconds). The test was run on a
dual-core ARM Cortex-A9 @1.2GHz

[Bug middle-end/32074] Optimizer does not exploit assertions

2012-03-29 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32074

--- Comment #8 from Siarhei Siamashka  
2012-03-29 10:11:39 UTC ---
(In reply to comment #5)
> (In reply to comment #4)
> > We have __builtin_unreachable() now which should allow for this 
> > optimization.
> 
> I've been using __builtin_unreachable() for some time now, and it's very nice
> for its intended purpose (telling gcc when it's safe to produce better code).
> I've noticed, though, that the ``x'' passed to assert(x) in already-existing
> code is often too expensive (or side effect-ful) to optimize away when
> converted to ``if(!(x)) { __builtin_unreachable(); }''

Based on your comment, looks like asserts are just a superset of
__builtin_unreachable() because asserts give more freedom to the compiler to
either evaluate the expression or optimize it out. It's easy to replace
__builtin_unreachable() with assert(0), but not the other way around as you
have clearly demonstrated. Hence this enhancement request does not look to be
fully resolved yet.

[Bug middle-end/32074] Optimizer does not exploit assertions

2012-03-29 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32074

--- Comment #7 from Siarhei Siamashka  
2012-03-29 09:31:37 UTC ---
(In reply to comment #6)
> Fixed by means of __builtin_unreachable ().

But __builtin_unreachable() is not a part of C standard yet? Is there no way to
extract some useful information from asserts in NDEBUG mode at least in some
simple cases when it is clearly beneficial for optimizations?

[Bug middle-end/52355] [4.7 regression] address difference between array elements is not considered to be a compile time constant anymore

2012-02-23 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52355

--- Comment #4 from Siarhei Siamashka  
2012-02-23 15:56:24 UTC ---
Now I wonder if multidimensional array is still treated as the same array in
"When two pointers are subtracted, both shall point to elements of the same
array object, or one past the last element of the array object; the result is
the difference of the subscripts of the two array elements."

https://groups.google.com/group/comp.lang.c/browse_thread/thread/3a16b9b33cb0cdd0/c16065f5189a0348

[Bug c/52355] New: [4.7 regression] address difference between array elements is not considered to be a compile time constant anymore

2012-02-23 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52355

 Bug #: 52355
   Summary: [4.7 regression] address difference between array
elements is not considered to be a compile time
constant anymore
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: siarhei.siamas...@gmail.com


Created attachment 26733
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=26733
test.c

gcc version 4.7.0 20120223 (experimental) (GCC)

$ cat test.c
void f(char a[16][16][16])
{
asm volatile ("" : : "i" (&a[1][0][0] - &a[0][0][0]));
}

int main(void)
{
char a[16][16][16];
f(a);
return 0;
}

$ gcc -O2 test.c
test.c: In function ‘f’:
test.c:3:5: warning: asm operand 0 probably doesn’t match constraints [enabled
by default]
test.c:3:5: error: impossible constraint in ‘asm’

[Bug target/50856] New: ARM: suboptimal code for absolute difference calculation

2011-10-24 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50856

 Bug #: 50856
   Summary: ARM: suboptimal code for absolute difference
calculation
Classification: Unclassified
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: siarhei.siamas...@gmail.com


gcc generates suboptimal code on ARM for "abs(a - b)" type of operation, which
is used for example in paeth png filter: http://www.w3.org/TR/PNG-Filters.html

Given the following test code:


int absolute_difference1(unsigned char a, unsigned char b)
{
return a > b ? a - b : b - a;
}

int absolute_difference2(unsigned char a, unsigned char b)
{
int tmp = a;
if ((tmp -= b) < 0)
tmp = -tmp;
return tmp;
}


The current gcc svn trunk (r180383) generates the following code for -O2 and
-Os optimizations:

.cpu arm10tdmi
.eabi_attribute 27, 3
.eabi_attribute 28, 1
.fpu vfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 4
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file   "test.c"
.text
.align  2
.global absolute_difference1
.type   absolute_difference1, %function
absolute_difference1:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
cmp r0, r1
rsbhi   r0, r1, r0
rsbls   r0, r0, r1
bx  lr
.size   absolute_difference1, .-absolute_difference1
.align  2
.global absolute_difference2
.type   absolute_difference2, %function
absolute_difference2:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
rsb r0, r1, r0
cmp r0, #0
rsblt   r0, r0, #0
bx  lr
.size   absolute_difference2, .-absolute_difference2
.ident  "GCC: (GNU) 4.7.0 20111024 (experimental)"
.section.note.GNU-stack,"",%progbits

Even in the quite explicit second code variant ('absolute_difference2'
function), gcc does not generate the expected SUBS + NEGLT pair of
instructions. Also for ARMv6 capable processors even a single USAD8 instruction
could be used here if both operands are known to have values in [0-255] range
and if high latency of this instruction can be hidden.

[Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics

2011-06-29 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #6 from Siarhei Siamashka  
2011-06-29 13:35:13 UTC ---
Created attachment 24630
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24630
test.c

Attached a slightly updated testcase, which can demonstrate unnecessary spills
to stack even with more recent versions of gcc as explained in comment 2
earlier (just slightly increased the number of uses for X() macro)

[Bug target/49526] ARM missed optimization: SMMUL instruction

2011-06-24 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49526

--- Comment #1 from Siarhei Siamashka  
2011-06-24 22:48:46 UTC ---
And clang 2.9 has no problems optimizing this code:

$ cat test.c

int smmul(int a, int b) { return ((long long)a * b) >> 32; }

$ clang -ccc-host-triple arm-none-linux -O2 -mcpu=cortex-a8 -S test.c
$ cat test.s
.syntax unified
.cpu cortex-a8
.eabi_attribute 6, 10
.eabi_attribute 7, 65
.eabi_attribute 8, 1
.eabi_attribute 9, 2
.fpu neon
.eabi_attribute 10, 3
.eabi_attribute 12, 1
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.file   "test.c"
.text
.globl  smmul
.align  2
.type   smmul,%function
smmul:
smmul   r0, r1, r0
bx  lr
.Ltmp0:
.size   smmul, .Ltmp0-smmul

[Bug target/49526] New: ARM missed optimization: SMMUL instruction

2011-06-24 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49526

   Summary: ARM missed optimization: SMMUL instruction
   Product: gcc
   Version: 4.7.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: siarhei.siamas...@gmail.com


$ cat test.c

int smmul(int a, int b) { return ((long long)a * b) >> 32; }

$ arm-none-linux-gnueabi-gcc -O2 -S -mcpu=cortex-a8 test.c
$ cat test.s
.cpu cortex-a8
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 18, 4
.file   "test.c"
.text
.align  2
.global smmul
.type   smmul, %function
smmul:
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
smull   r0, r1, r0, r1
mov r0, r1
bx  lr
.size   smmul, .-smmul
.ident  "GCC: (GNU) 4.7.0 20110624 (experimental)"
.section.note.GNU-stack,"",%progbits

[Bug target/48576] wrong code when accessing variables in a large stack frame

2011-04-12 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48576

Siarhei Siamashka  changed:

   What|Removed |Added

 CC||siarhei.siamashka at gmail
   ||dot com

--- Comment #1 from Siarhei Siamashka  
2011-04-12 15:23:02 UTC ---
This reminds me about bug 41074 (apparently the same hard to trigger large
stack frame related issue).

[Bug target/47759] New: _mm_empty() intrinsic fails to serve as a boundary between MMX and x87 code due to optimizations

2011-02-15 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47759

   Summary: _mm_empty() intrinsic fails to serve as a boundary
between MMX and x87 code due to optimizations
   Product: gcc
   Version: 4.5.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: siarhei.siamas...@gmail.com
  Host: i686-pc-linux-gnu


Created attachment 23355
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23355
mm_empty_testcase.c

The attached testcase fails when compiled with -O2 or -O3 optimizations, but
works with -O1.

I'm actually not precisely sure how this code is expected to behave because
intrinsics are x86 architecture specific and C standard can't be used as a
reference. But my guess is that if the optimizer would not be allowed to
arbitrarily move code across _mm_empty() boundary, then the problem would
disappear.

[Bug target/45886] [ARM] support for __ARM_PCS_VFP predefined symbol in gcc 4.5.x would be very nice

2010-12-18 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45886

Siarhei Siamashka  changed:

   What|Removed |Added

 CC||toolchain at gentoo dot org

--- Comment #4 from Siarhei Siamashka  
2010-12-18 17:23:04 UTC ---
I'm sorry for asking again, but what is the status of this issue? Is it even
feasible to get it resolved in upstream gcc 4.5.x? Previously this issue had a
target milestone set to 4.5.2, but then is was simply removed.

While this is not strictly a regression compared to the previous releases, it
makes the use of the new major feature for arm (-mfloat-abi=hard option)
introduced in gcc 4.5 rather problematic. Just a few examples of the affected
software are the upcoming mozilla firefox4 and libffi library which use
__ARM_PCS_VFP define to identify floating point calling conventions.

The trivial fewliner patch which fixes the issue has been already adopted by
linaro gcc [1]. It is also used by ubuntu [2] and probably some other linux
distributions for the obvious pragmatic reasons.

So I wonder what is the current recommendation from your side? Make it a
responsibility of each linux distribution to patch gcc themselves? Or maybe add
some ugly hacks to the affected applications and libraries to identify floating
point calling conventions in some other way specifically for gcc 4.5.x?


1. https://wiki.linaro.org/WorkingGroups/ToolChain/Changelogs/LinaroGCC4.5
2. http://packages.ubuntu.com/natty/gcc-4.5-multilib

[Bug target/45094] [arm] wrong instructions for dword move in some cases

2010-12-18 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45094

--- Comment #10 from Siarhei Siamashka  
2010-12-18 15:47:12 UTC ---
(In reply to comment #9)
> see the link in comment 1

Sorry, I mean the link in the original report from Akos:
http://repo.or.cz/w/official-gcc.git/commitdiff/f1225f6f

[Bug target/45094] [arm] wrong instructions for dword move in some cases

2010-12-18 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45094

--- Comment #9 from Siarhei Siamashka  
2010-12-18 15:43:39 UTC ---
Can this bug get a "[4.5 regression]" header please?

Even though the bug existed in gcc sources since 2007 (see the link in comment
1), the reported wrong-code problem itself was apparently latent until gcc 4.5,
and is not reproducible with older gcc versions.

[Bug target/45886] [ARM] support for __ARM_PCS_VFP predefined symbol in gcc 4.5.x would be very nice

2010-11-12 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45886

--- Comment #3 from Siarhei Siamashka  
2010-11-12 15:30:24 UTC ---
Richard, what would be the appropriate target milestone to get this bug fixed?

This needs just a backport of a trivial patch from trunk to 4.5 branch, but
delaying this fix increases the chances of penetration of problematic gcc into
various linux distributions.

[Bug target/46128] There is no mechanism for detecting VFP revisions in ARM GCC.

2010-10-25 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46128

--- Comment #2 from Siarhei Siamashka  
2010-10-25 14:43:52 UTC ---
(In reply to comment #1)
> Note that there may be problems clobbering D registers.  See bug 43440.  I 
> don't think Richard Earnshaw's patch 
>  ever got 
> reviewed or pinged - it probably needs pinging.  (In general, unreviewed 
> patches are best pinged about weekly.)

Yes, that's a very well known bug. But there should be no problems with D
registers, only Q registers are affected.

They say codesourcery already has it fixed (so I assume the patch has been at
least reviewed): http://www.beagleboard.org/irclogs/index.php?date=2010-06-27

# [11:19:58]  "raster: check gcc bugzilla -
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43440";
# [11:19:59]  "the gcc "design" makes it very hard to do this conversion"
# [11:20:01]  "unliek a cpu - gcc can be fixed and updated easily :)"
# [11:20:14]  "mru: seems codesourcery managed it"

> > More generally, it would be beneficial to be able to optimize routines using
> > specific VFPv3 instructions (such as VMOV's immediate-operand form), or to 
> > make
> > use of VFPv4's fused-mulitply-accumulate instructions.
> 
> For fused multiply-add, the best approach is to describe them in the ARM 
> .md files using the new fma: RTL facility, so that calls to fma / fmaf / 
> __builtin_fma / __builtin_fmaf use the instructions automatically as on 
> other targets whose .md files have been updated like this.

But still there are cases when performance is actually important and
builtins/intrinsics are ruled out because of this. Inline assembly is
convenient because it can be added directly to C sources, without any need to
tweak makefiles or build scripts. This makes inline assembly a good choice for
small non-intrusive performance patches.

Another inconvenience is that in order to check whether for example ARMv6
instructions are supported, one has to use constructs like this (identifiers
fished out from gcc sources):

#if defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || \
defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6Z__) || \
defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) || \
defined(__ARM_ARCH_6M__) || defined(__ARM_ARCH_7__) || \
defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7R__) || \
defined(__ARM_ARCH_7M__)
[...]
#endif

And this is not very maintainable because future gcc versions may introduce
more predefined symbols for newer arm architecture variants. It would be much
nicer if it was possible to just do something like:

#if defined(__arm__) && (__ARM_ARCH__ >= 6)
[...]
#endif

It's basically the same problem as VFP variant identification.

[Bug middle-end/46164] Local variables in specified registers don't work correctly with inline asm operands

2010-10-25 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46164

Siarhei Siamashka  changed:

   What|Removed |Added

  Attachment #22144|0   |1
is obsolete||

--- Comment #2 from Siarhei Siamashka  
2010-10-25 12:32:01 UTC ---
Created attachment 22145
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22145
updated testcase (x86_64)

Actually the previous testcase was not very good. It tried to simulate
earlyclobber operand by specifying it both as input and output, but because
"p1" was actually not initialized, gcc may be allowed to optimize it and screw
up everything (without any kind of warnings, but that's another story).

So the problem is actually related to using specified registers for
earlyclobber output operands in such a way that they try to use the same
registers as function arguments.

[Bug middle-end/46164] Local variables in specified registers don't work correctly with inline asm operands

2010-10-25 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46164

--- Comment #1 from Siarhei Siamashka  
2010-10-25 10:37:13 UTC ---
Created attachment 22144
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=22144
proposed testcase for x86_64

[Bug middle-end/32820] optimizer malfunction when mixed with asm statements

2010-10-25 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32820

--- Comment #8 from Siarhei Siamashka  
2010-10-25 10:17:47 UTC ---
On the second thought, this bug was about global variables. But my problem is
related to the use of local variables. So I have submitted a separate PR46164
about it.

[Bug middle-end/46164] New: Local variables in specified registers don't work correctly with inline asm operands

2010-10-25 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46164

   Summary: Local variables in specified registers don't work
correctly with inline asm operands
   Product: gcc
   Version: 4.5.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: siarhei.siamas...@gmail.com


When testing with gcc 4.5.1

 ARM 

$ cat test.c

int f(int a)
{
  register int result asm("r0");
  asm (
"addr0, %[a], #123\n"
: [result] "=&r" (result)
: [a]  "r"   (a)
  );
  return result;
}

$ gcc -O2 -c test.c
$ objdump -d test.o

 :
   0:   e280007badd r0, r0, #123; 0x7b
   4:   e1a3mov r0, r3
   8:   e12fff1ebx  lr

Here the local variable 'result' gets assigned to register r3 instead of r0
causing all kind of problems.

 x86-64 

$ cat test.c

int f(int a)
{
  register int result asm("edi");
  asm (
"lea0x7b(%[a]), %%edi\n"
: [result] "=&r" (result)
: [a]  "r"   (a)
  );
  return result;
}

$ gcc -O2 -c test.c
$ objdump -d test.o

 :
   0:   67 8d 7f 7b addr32 lea 0x7b(%edi),%edi
   4:   c3  retq

=

And some final bits.

http://gcc.gnu.org/onlinedocs/gcc/Local-Reg-Vars.html#Local-Reg-Vars
http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

The documantation is a bit confusing, but it gives at least one example of
assigining variables to specified registers:

"Sometimes you need to make an asm operand be a specific register, but there's
no matching constraint letter for that register by itself. To force the operand
into that register, use a local variable for the operand and specify the
register in the variable declaration. See Explicit Reg Vars. Then for the asm
operand, use any register constraint letter that matches the register:

 register int *p1 asm ("r0") = ...;
 register int *p2 asm ("r1") = ...;
 register int *result asm ("r0");
 asm ("sysint" : "=r" (result) : "0" (p1), "r" (p2));"

Let's try to use something like that with x86-64:

//
void abort();

int __attribute__((noinline)) f(int a)
{
  register int p1 asm ("edi");
  register int result asm ("edi");
  asm (
"mov %2, %0\n"
"add %2, %0\n"
"add %2, %0\n"
: "=r" (result) : "0"  (p1), "r" (a));
  return result;
}

int main()
{
if (f(1) != 3)
abort();
}

//

This testcase fails.

So is it a bug in gcc? Or the documentation is wrong? Or I'm missing something?

[Bug middle-end/32820] optimizer malfunction when mixed with asm statements

2010-10-11 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32820

Siarhei Siamashka  changed:

   What|Removed |Added

 CC||siarhei.siamashka at gmail
   ||dot com

--- Comment #7 from Siarhei Siamashka  
2010-10-11 19:18:46 UTC ---
Looks like this or similar "Variables in Specified Registers" bug is also
reproducible on ARM with gcc 4.5.1

$ cat test.c

int f(int a)
{
  register int result asm("r0");
  asm (
"addr0, %[a], #123\n"
: [result] "=&r" (result)
: [a]  "r"   (a)
  );
  return result;
}

$ gcc -O2 -c test.c
$ objdump -d test.o

 :
   0:   e280007badd r0, r0, #123; 0x7b
   4:   e1a3mov r0, r3
   8:   e12fff1ebx  lr

Here the local variable 'result' gets assigned to register r3 instead of r0
causing all kind of problems.

[Bug target/45886] [ARM] support for __ARM_PCS_VFP predefined symbol in gcc 4.5.x would be very nice

2010-10-11 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45886

--- Comment #2 from Siarhei Siamashka  
2010-10-11 14:31:29 UTC ---
(In reply to comment #1)
> Confirmed though I think this isn't an "enhancement" but more a bug because
> code can't identify whether -mfloat-abi=hard is chosen by use of a
> pre-processor directive.

Thanks. Can we expect this problem to be fixed in gcc 4.5.2 (backported from
trunk)?

For now I'm going to use the following guard code to make sure that using
unpatched gcc will result in compilation problem instead of runtime failure:

#if defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >=
5)) \
&& defined(__ARM_EABI__) && !defined(__ARM_PCS_VFP) && !defined(__ARM_PCS)
#error "Can't identify floating point calling conventions.\nPlease ensure that
your toolchain defines __ARM_PCS or __ARM_PCS_VFP."
#endif

[Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics

2010-10-08 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #5 from Siarhei Siamashka  
2010-10-08 14:13:08 UTC ---
(In reply to comment #3)
> On Mon, 4 Oct 2010, siarhei.siamashka at gmail dot com wrote:
> 
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
> > 
> > --- Comment #2 from Siarhei Siamashka  
> > 2010-10-04 22:59:56 UTC ---
> > (In reply to comment #1)
> > > So the compiler is correct not to be using vld1 for this code.  The memory
> > > format of int32x4_t is defined to be the format of a neon register that 
> > > has
> > > been filled from an array of int32 values and then stored to memory using 
> > > VSTM
> > > (or equivalent sequence).  The implication of all this is that int32x4_t 
> > > does
> > > not (necessarily) have the same memory layout as int32_t[4].
> > 
> > Could you elaborate on this? Specifically about the case when memory format 
> > for
> > VSTM and VST1 may differ.
> 
> Big-endian.

OK, I see. Looks like VLDM/VSTM instructions could be replaced with VLD1/VST1
(by artificially forcing element size to 64) in almost all cases except when
SCTLR.A == 1 due to unwanted alignment traps potentially happening in this
case.

But the question is whether it is really necessary to suffer from a performance
penalty on little endian systems?

> I previously explained the issues with big-endian NEON vectors in GCC at 
> length:
> 
> http://gcc.gnu.org/ml/gcc-patches/2010-06/msg00409.html

Thanks for the link, something seems to be seriously overengineered. Looks like
you brought a problem upon yourself and now are trying to valiantly solve it.

Does (efficient) support of NEON intrinsics on big endian systems even have any
practical value? Maybe it makes sense to get a reasonable performance at least
on little endian systems first. To me it looks like you are just running after
two hares...

[Bug middle-end/37734] Missing optimization: gcc fails to reuse flags from already calculated expression for condition check with zero

2010-10-04 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=37734

Siarhei Siamashka  changed:

   What|Removed |Added

 CC||rearnsha at gcc dot gnu.org

--- Comment #3 from Siarhei Siamashka  
2010-10-04 23:19:26 UTC ---
So if I understand it correctly, there are 2 independent performance issues
here:
1. one in the middle-end (redundant comparison with -1) when -O2 optimization
is selected.
2. another in ARM target, because it fails to produce efficient code with -Os
optimizations, while x86 can.

I just remembered that Mozilla has been using -Os optimizations up until now
because it was providing the best performance for them:
http://gcc.gnu.org/ml/gcc/2010-06/msg00715.html
I wonder if this particular missed-optimization issue is contributing to the
occasional performance advantage of -Os over -O2 (other than smaller code size
and reduced pressure on the instructions cache). Anyway, when looking at any
code generated by gcc, simple loops and branches always tend to contain
redundant instructions.

[Bug target/43725] Poor instructions selection, scheduling and registers allocation for ARM NEON intrinsics

2010-10-04 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

--- Comment #2 from Siarhei Siamashka  
2010-10-04 22:59:56 UTC ---
(In reply to comment #1)
> So the compiler is correct not to be using vld1 for this code.  The memory
> format of int32x4_t is defined to be the format of a neon register that has
> been filled from an array of int32 values and then stored to memory using VSTM
> (or equivalent sequence).  The implication of all this is that int32x4_t does
> not (necessarily) have the same memory layout as int32_t[4].

Could you elaborate on this? Specifically about the case when memory format for
VSTM and VST1 may differ.

I thought that VST1 instruction could be always used as a replacement for VSTM,
it is just a little bit less convenient in some cases because it is lacking
some more advanced addressing modes. Moreover, VSTM is VFP instruction and VST1
is NEON one. So I guess mixing VSTM with true NEON instructions may be
additionally a bad idea (for performance reasons on Cortex-A9 or other
processors?).

There also used to be FLDMX/FSTMX instructions, but they are deprecated now. I
believe they existed specifically to reserve the use of normal VFP load/store
instructions for floating point data formats only, but later this turned out to
be unnecessary.

> arm_neon.h provides intrinsics for filling neon registers from arrays in
> memory, and in this case I think you should be using these directly.  That is,
> your macro should be modified to contain:
> 
> #define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v =
> vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);}

I'm sorry, but this looks like a completely unjustified limitation to me. Why
intrinsics should be so much more difficult and less intuitive to use than just
inline assembly? Additionally, gcc allows to use normal arithmetic operations
on vector data types, something like:

void x(int32x4_t a, int32x4_t b, int32x4_t *p)
{
#define X(n) p[n] += a; p[n] |= b;
X(0); X(1); X(2); X(3); X(4); X(5); X(6); X(7);
X(8); X(9); X(10); X(11); X(12);
}

> There are still problems after doing this, however.  In particular the 
> compiler
> is not correctly tracking alias information for the load/store intrinsics,
> which means it is unable to move stores past loads to reduce stalls in the
> pipeline.

OK, thanks for the explanation.

> The stack wastage appears to be fixed in trunk gcc; at least I don't see any
> stack allocation for your testcase.

Yes, looks like it got a little bit better. Anyway stack allocation shows up
again after adding just a few more of these X() macros:
... X(13); X(14); X(15); X(16); ...

[Bug target/45886] New: [ARM] support for __ARM_PCS_VFP predefined symbol in gcc 4.5.x would be very nice

2010-10-04 Thread siarhei.siamashka at gmail dot com

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45886

   Summary: [ARM] support for __ARM_PCS_VFP predefined symbol in
gcc 4.5.x would be very nice
   Product: gcc
   Version: 4.5.1
   URL: http://gcc.gnu.org/ml/gcc-patches/2010-07/msg02186.htm
l
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: siarhei.siamas...@gmail.com
CC: rearn...@arm.com
  Host: arm-unknown-linux-gnueabi
Target: arm-unknown-linux-gnueabi
 Build: arm-unknown-linux-gnueabi


This is quite important for JIT code when we want so support all ABI variants
properly. Because now gcc 4.5.x supports -mfloat-abi=hard, being able to
identify its use is also needed.

66 matches

Mail list logo