[Bug c++/98317] New: Vector Extensions aligned(1) not generating unaligned loads/stores

2020-12-16 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317

Bug ID: 98317
   Summary: Vector Extensions aligned(1) not generating unaligned
loads/stores
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: danielhanchen at gmail dot com
  Target Milestone: ---

The ordering of aligned(1) causes GCC to generate movaps / movups.

typedef float   float128_tv1__attribute__ ((aligned(1), vector_size(16)));
typedef float   float128_tv2__attribute__ ((vector_size(16), aligned(1)));

float128_tv1 provides MOVAPS
float128_tv2 provides MOVUPS

It seems like the ordering of the arguments changes the assembly.

https://gcc.godbolt.org/z/5qs7e7

It seems like GCC 10.2 and 9.2 all have this issue.
Unless if this was already documentated, this issue can cause massive issues if
memory is unaligned and an aligned load/store is used instead.

[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores

2020-12-16 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317

Daniel Han-Chen  changed:

   What|Removed |Added

 CC||danielhanchen at gmail dot com

--- Comment #1 from Daniel Han-Chen  ---
https://gcc.godbolt.org/z/sGWevT

I also tried separating the __attribute__s


typedef float   float128_tv1__attribute__ ((aligned(1), vector_size(16)));
typedef float   float128_tv2__attribute__ ((vector_size(16), aligned(1)));
typedef float   float128_tv3__attribute__((aligned(1))) __attribute__
((vector_size(16)));
typedef float   float128_tv4__attribute__ ((vector_size(16)))
__attribute__((aligned(1)));


aligned as the first argument still fails.

[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores

2020-12-16 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317

--- Comment #3 from Daniel Han-Chen  ---
Oh ok then.

It's cause I was trying to do unaligned loads by following:
https://stackoverflow.com/questions/9318115/loading-data-for-gccs-vector-extensions

In it, it mentioned using typedef char __attribute__ ((vector_size (16),aligned
(1))) unaligned_byte16, which works, though the other way does not.

But I like your solution by declaring the type as aligned(1) separately.

[Bug c++/98348] New: GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-17 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

Bug ID: 98348
   Summary: GCC 10.2 AVX512 Mask regression from GCC 9
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: danielhanchen at gmail dot com
  Target Milestone: ---

In GCC 9, vector comparisons on 128 and 256bit vectors on a AVX512 machine used
vpcmpeqd without any masks.

In GCC 10, for 128bit and 256bit vectors, AVX512 mask instructions are used.
https://gcc.godbolt.org/z/1sPzM5

GCC 10 should follow GCC 9 for vector comparisons when a mask is not needed.

The reason why is https://uops.info/table.html shows that using mask registers
makes 128/256/512 operations have a throughput of 1 and a latency of 3.

However, using a vector comparison directly has a throughput of 2 and a latency
of 1.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-17 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #1 from Daniel Han-Chen  ---
I also just noticed that in GCC 10, an extra movdqa is done, which is also not
necessary.

[Bug c++/98317] Vector Extensions aligned(1) not generating unaligned loads/stores

2020-12-18 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98317

Daniel Han-Chen  changed:

   What|Removed |Added

 Resolution|--- |WORKSFORME
 Status|UNCONFIRMED |RESOLVED

--- Comment #4 from Daniel Han-Chen  ---
Jakub mentioned his solution, so all good now.

[Bug c++/98387] New: GCC >= 6 cannot inline _mm_cmp_ps on SSE targets

2020-12-18 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387

Bug ID: 98387
   Summary: GCC >= 6 cannot inline _mm_cmp_ps on SSE targets
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: danielhanchen at gmail dot com
  Target Milestone: ---

https://gcc.godbolt.org/z/493ead

GCC since version 6.1 cannot inline _mm_cmp_ps on targets supporting only SSE
(Nehalem, Tremont etc). From >= SandyBridge, everything inlines fine.

_mm_cmp_ps is called by using it as a function argument (ie auto function).

All SSE only machines use a jmp to _mm_cmp_ps, but it should be inlined.

O3 ffast-math is also used, and the function is declared inline.

[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets

2020-12-18 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387

--- Comment #1 from Daniel Han-Chen  ---
Oh I just noticed _mm_cmp_ps isn't actually supported for SSE targets even in
Intel's Intrinsics Guide: [_mm_cmp_ps first was supported in AVX]

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=5236,827,33,5224,447,456,4085,3864,5224,4179,4118,4115,4115,4121,3864,3870,5579,2030,3319,2809,4127,5156,4179,4201,3536,3539,3533,2184,3505,3533,3542,3505,3533,1606,4174,2809,5576,5578,2063,3895,3893,2484,3864,4076,3864,687,689,689,3544,771,1648,1647,5878,5903,743&techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2&text=cmpps



error: inlining failed in call to always_inline '__m128 _mm_cmp_ps(__m128,
__m128, int)': target specific option mismatch
  390 | _mm_cmp_ps (__m128 __X, __m128 __Y, const int __P)


_mm_cmp[*]_ps ie _mm_cmpeq_ps and derivatives successfully inline.

[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets

2020-12-19 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387

--- Comment #3 from Daniel Han-Chen  ---
(In reply to H.J. Lu from comment #2)
> _mm_cmp_ps is an AVX intrinsic.

Yep noticed _mm_cmp_ps is only in AVX. The weird part is it actually causes no
errors when used on SSE only targets [ie Nehalem], and GCC continues compiling.

Is this supposed to be normal behaivor?

[Bug target/98387] GCC >= 6 cannot inline _mm_cmp_ps on SSE targets

2020-12-19 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98387

--- Comment #5 from Daniel Han-Chen  ---
(In reply to H.J. Lu from comment #4)
> (In reply to Daniel Han-Chen from comment #3)
> > (In reply to H.J. Lu from comment #2)
> > > _mm_cmp_ps is an AVX intrinsic.
> > 
> > Yep noticed _mm_cmp_ps is only in AVX. The weird part is it actually causes
> > no errors when used on SSE only targets [ie Nehalem], and GCC continues
> > compiling.
> > 
> > Is this supposed to be normal behaivor?
> 
> GCC treats it like an undefined function.

Thanks! Sorry I probably might have asked some really dumb questions. But also
thanks for taking your time in answering them! :) Appreciate it!