[Bug target/116004] PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code

2024-07-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004

--- Comment #2 from Steven Munroe  ---
Actually:

  abnez  = (vui8_t) vec_cmpnez (vra, vrb);
  result = vec_cntlz_lsbb (abnez);

[Bug target/116004] PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code

2024-07-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004

--- Comment #1 from Steven Munroe  ---
Compile test code examples:

int
test_intrn_first_mismatch_or_eos_index_PWR9 (vui8_t vra, vui8_t vrb)
{
  return vec_first_mismatch_or_eos_index (vra, vrb);
}

int
test_first_mismatch_byte_or_eos_index_PWR9 (vui8_t vra, vui8_t vrb)
{
  vui8_t abnez;
  int result;

  abnez  = vec_cmpnez (vra, vrb);
  result = vec_cntlz_lsbb (abnez);
  return result;
}

[Bug c/116004] New: PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code

2024-07-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004

Bug ID: 116004
   Summary: PPC64 vector Intrinsic vec_first_mismatch_or_eos_index
generates poor code
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

GCC13 generates the following code for the intrinsic
vec_first_mismatch_or_eos_index -mcpu=power9 -O3:

00c0 :
  c0:   d1 02 00 f0 xxspltib vs32,0
  c4:   07 00 22 10 vcmpneb v1,v2,v0
  c8:   07 00 03 10 vcmpneb v0,v3,v0
  cc:   07 19 42 10 vcmpnezb v2,v2,v3
  d0:   17 04 21 f0 xxland  vs33,vs33,vs32
  d4:   57 0d 42 f0 xxlorc  vs34,vs34,vs33
  d8:   02 16 61 10 vctzlsbb r3,v2
  dc:   b4 07 63 7c extsw   r3,r3
  e0:   20 00 80 4e blr

The use of vcmpneb to compare for EOS is redundant to the vcmpnezb instruction
(which includes the EOS compares). The additional xxland/xxorc logic is only
necessary because of the extra vcmpneb compares.

All you need is a single vcmpnezb as it already handles the a/b mismatch and
EOS tests for both operands. For example:

0070 :
  70:   07 19 42 10 vcmpnezb v2,v2,v3
  74:   02 16 61 10 vctzlsbb r3,v2
  78:   b4 07 63 7c extsw   r3,r3
  7c:   20 00 80 4e blr

[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-10-25 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

--- Comment #6 from Steven Munroe  ---
(In reply to Carl Love from comment #5)
> There are a couple of issues with the test case in the attachment.  For
> example one of the tests is:
> 
> 
> static inline vui64_t
> vec_vsldbi_64 (vui64_t vra, vui64_t vrb, const unsigned int shb)
> {
>  return vec_sldb (vra, vrb, shb);
> }
> 
> When I tried to compile it, it seemed to compile.  However if I take off the
> static inline, then I get an error about in compatible arguments.  The
> built-in requires an explicit integer be based in the third argument.  The
> following worked for me:
> 
> 
> static inline vui64_t
> vec_vsldbi_64 (vui64_t vra, vui64_t vrb, const unsigned int shb)
> {
>  return vec_sldb (vra, vrb, 1);
> }
> 
> The compiler/assembler needs an explicit value for the third argument as it
> has to generate the instruction with the immediate shift value as part of
> the instruction.  Hence a variable for the third argument will not work.
> 
> Agreed that the __int128 arguments can and should be supported.  Patch to
> add that support is in progress but will require getting the LLVM/OpenXL
> team to agree to adding the __128int variants as well.

Yes I know. in the PVECLIB case these functions will always be static inline.
So this is not issue for me.

[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-10-01 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

--- Comment #4 from Steven Munroe  ---
Actually shift/rotate intrinsic: ,vec_rl, vec_rlmi, vec_rlnm, vec_sl, vec_sr,
vec_sra

Support vector __int128 as required for the PowerISA 3.1 POWER vector
shift/rotate quadword instructions 

But: vec_sld, vec_sldb, vec_sldw, vec_sll, vec_slo, vec_srdb, vec_srl, vec_sro

Do not. 

There is no obvious reason for this inconstancy as the target instructions are
effectively 128/256-bit operations returning a 128-bit result.The type of the
inputs is incidental to the operation.

Any restrictions imposed by the original Altivec.h PIM was broken long ago by
VSX and PowerISA 2.07.

Net: the Power Vector Intrinsic Programming Reference and the compilers should
support the vector __int128 type for any instruction where it makes sense as a
input or result.

[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-09-30 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

--- Comment #3 from Steven Munroe  ---
(In reply to Peter Bergner from comment #1)
> I see that we have created built-in overloads for signed and unsigned vector
> char through vector long long.  That said, the rs6000-builtins.def only
> seems to support the signed vector types though, which is why you're seeing
> an error.  So confirmed.
> 
> That said, I believe your 3rd argument needs to be a real constant integer,
> since the vsldbi instruction requires that.  It doesn't allow for a const
> int variable.  I notice some older (not trunk) gcc versions are ICEing with
> that, so another bug to look at.
The original code is static inline, so the const int parm should transfer
intact to the builtin const.

It seems I over-simplified the deduced test case.
> 
> I do not see any documentation that says we support the vector __int128
> type.  Where exactly did you see that?  However, from the instruction
> description, it seems like the hw instruction could support that.

I stand corrected. The documentation only describes vector unsigned long long.
But the instruction is like vsldoi and does not really care what the type is.

[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-09-30 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

Steven Munroe  changed:

   What|Removed |Added

  Attachment #56018|0   |1
is obsolete||

--- Comment #2 from Steven Munroe  ---
Created attachment 56019
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56019=edit
Updated test case with static inline functions

[Bug target/111645] New: Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-09-29 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

Bug ID: 111645
   Summary: Intrinsics vec_sldb /vec_srdb fail with __vector
unsigned __int128
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 56018
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56018=edit
example of the problem. Compile with  gcc -m64 -O3 -mcpu=power10 -c sldbi.c

GCC 12 and 13 fail to compile vector intrisic vec_sldb / vec_srdb as required
by the Power Vector Intrinsic Programming Reference.

error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_sldb’

Both the Programming Reference and the GCC documentation state that vector
(unsigned/signed) __int128 are valid operands. But they fail with a 

error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_sldb’
or
error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_srdb’

[Bug target/110795] Bad code gen for vector compare booleans

2023-07-28 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795

--- Comment #5 from Steven Munroe  ---
Thanks, sorry I missed the obvious.

[Bug target/110795] Bad code gen for vector compare booleans

2023-07-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795

--- Comment #2 from Steven Munroe  ---
Also fails with gcc11/12. Also fails with Advance Toolchain 10.0 GCC 6.4.1.

It might fail for all version between GCC 6 and 13.

[Bug target/110795] Bad code gen for vector compare booleans

2023-07-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795

--- Comment #1 from Steven Munroe  ---
Created attachment 55627
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55627=edit
Main and unit-test. When compiled and linked with vec_divide.c will verify if
the divide code is correct or not.

[Bug c/110795] New: Bad code gen for vector compare booleans

2023-07-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795

Bug ID: 110795
   Summary: Bad code gen for vector compare booleans
   Product: gcc
   Version: 13.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 55626
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55626=edit
Test examples for vector code combinining vector compare combined with logical
or,

Combining a vec_cmplt and vec_cmpge with vector logical OR misscompiles.
For example:
  // Capture the carry t as a bool using signed compare
  t = vec_cmplt ((vi32_t) x, zeros);
  ge = vec_cmpge (x, z);
  // Combine t with (x >= z) for 33-bit compare
  t  = vec_or (ge, t);

This seems to work for the minimized example above but fails when used in the
more complex loop of the example vec_divduw_V1. At -O3 the compiler elides any
code generated for vec_cmplt.

With this bug the function vec_divduw_V1 (Vector_Divide double (words by)
unsigned word) fails the unit test.

[Bug target/104124] Poor optimization for vector splat DW with small consts

2023-06-28 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

--- Comment #5 from Steven Munroe  ---
Thanks

[Bug c/106755] New: Incorrect code gen for altivec intrinsics with constant inputs

2022-08-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106755

Bug ID: 106755
   Summary: Incorrect code gen for altivec intrinsics with
constant inputs
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: blocker
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 53514
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53514=edit
Reducted test case for vec_muludq() with make

Compiling the PVECLIB project V1.0.4-4 fails unit test (make check) when
compiled with GCC 12 on Fedora 36/37.

Two units test for:
Vector Multiply Unsigned Double Quadword. vec_muludq()
and
Vector Multiply-Add Unsigned Quadword. vec_madduq()

The tests that fail are passing local vector constants to in-lined instants of
these functions.

Current status; the PVECLIB package is blocked for Fedora 37 because it will
not compile with the default GCC-12 compiler.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2022-02-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #23 from Steven Munroe  ---
Ok, but I strongly recommend a compiler test that verify that the compiler is
generating the expected code (for this and other cases).

We have a history of common code changes (accidental or deliberate) causing
regressions for POWER targets.

Best to find these early, before they impact customer performance.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2022-02-25 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #21 from Steven Munroe  ---
Yes I was told by Peter Bergner that the fix from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085#c15 had been back ported
top AT15.0-1.

But when ran this test with AT15.0-1 I saw:

 :
   0:   20 00 20 39 li  r9,32
   4:   d0 ff 41 39 addir10,r1,-48
   8:   57 12 42 f0 xxswapd vs34,vs34
   c:   99 4f 4a 7c stxvd2x vs34,r10,r9
  10:   ce 48 4a 7c lvx v2,r10,r9
  14:   20 00 80 4e blr

0030 :
  30:   20 00 20 39 li  r9,32
  34:   d0 ff 41 39 addir10,r1,-48
  38:   57 12 42 f0 xxswapd vs34,vs34
  3c:   99 4f 4a 7c stxvd2x vs34,r10,r9
  40:   ce 48 4a 7c lvx v2,r10,r9
  44:   20 00 80 4e blr

0060 :
  60:   20 00 20 39 li  r9,32
  64:   d0 ff 41 39 addir10,r1,-48
  68:   57 12 42 f0 xxswapd vs34,vs34
  6c:   99 4f 4a 7c stxvd2x vs34,r10,r9
  70:   99 4e 4a 7c lxvd2x  vs34,r10,r9
  74:   57 12 42 f0 xxswapd vs34,vs34
  78:   20 00 80 4e blr

0090 :
  90:   57 12 42 f0 xxswapd vs34,vs34
  94:   20 00 40 39 li  r10,32
  98:   d0 ff 01 39 addir8,r1,-48
  9c:   f0 ff 21 39 addir9,r1,-16
  a0:   99 57 48 7c stxvd2x vs34,r8,r10
  a4:   00 00 69 e8 ld  r3,0(r9)
  a8:   08 00 89 e8 ld  r4,8(r9)
  ac:   20 00 80 4e blr

So either the patch for AT15.0-1 is not applied correctly or is non-functional
because of some difference between GCC11/GCC12. Or regressed because of some
other change/patch.

In my experience this part of GCC is fragile (based on the long/sad history of
IBM long double). So this needs to monitored with each new update.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2022-02-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

Steven Munroe  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #17 from Steven Munroe  ---
I don't think this is fixed.

The fix was supposed to be back-ported to GCC11 for Advance Toolchain 15.

The updated test case shoes that this is clearly not working as advertised.

Either GCC12 fix has regressed due to subsequent updates or the AT15 GCC11
back-port fails due to some missing/different code between GCC11/12.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2022-02-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #16 from Steven Munroe  ---
Created attachment 52510
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52510=edit
Reduced tests for xfers from _float128 to vector or __int128

Cover more types including __int128 and vector __int128

[Bug target/104124] Poor optimization for vector splat DW with small consts

2022-01-27 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

Steven Munroe  changed:

   What|Removed |Added

  Attachment #52236|0   |1
is obsolete||

--- Comment #2 from Steven Munroe  ---
Created attachment 52307
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52307=edit
Enhansed test case that also shows CSE failure

Original test case that adds example where CSE should common a splat immediate
or even .rodata load, but fails to do even that.

[Bug target/104124] Poor optimization for vector splat DW with small consts

2022-01-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

Steven Munroe  changed:

   What|Removed |Added

 CC||munroesj at gcc dot gnu.org

--- Comment #1 from Steven Munroe  ---
Created attachment 52236
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52236=edit
Attempts to load small int consts to vector DW via splat

Multiple attempt to convince GCC to load small integer (-16 - 15) constants via
splat. Current GCC versions (9/10/11) convert vec_splats() and
explicit vec_splat_s32/vec_unpackl sequences into to loads from .rodata. This
generates more instruction, takes more cycles, and causes register pressure
that results in unnecessary spill/reload and load-hit-store rejects.

[Bug target/104124] New: Poor optimization for vector splat DW with small consts

2022-01-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

Bug ID: 104124
   Summary: Poor optimization for vector splat DW with small
consts
   Product: gcc
   Version: 11.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

It looks to me like the compiler is seeing register pressure caused by loading
all the vector long long constants I need in my code. This is leaf code of a
size it can run out of volatilizes (no stack-frame). But this puts more
pressure on volatile VRs, VSRs, and GPRs. Especially GPRs because it loading
from .rodata when it could (and should) use a vector immediate.

For example:

vui64_t
__test_splatudi_0_V0 (void)
{
  return vec_splats ((unsigned long long) 0);
}

vi64_t
__test_splatudi_1_V0 (void)
{
  return vec_splats ((signed long long) -1);
}

Generate:
01a0 <__test_splatudi_0_V0>:
 1a0:   8c 03 40 10 vspltisw v2,0
 1a4:   20 00 80 4e blr

01c0 <__test_splatudi_1_V0>:
 1c0:   8c 03 5f 10 vspltisw v2,-1
 1c4:   20 00 80 4e blr
...

But other cases that could use immedates like:

vui64_t
__test_splatudi_12_V0 (void)
{
  return vec_splats ((unsigned long long) 12);
}

GCC 9/10/11 Generates for power8:

0170 <__test_splatudi_12_V0>:
 170:   00 00 4c 3c addis   r2,r12,0
170: R_PPC64_REL16_HA   .TOC.
 174:   00 00 42 38 addir2,r2,0
174: R_PPC64_REL16_LO   .TOC.+0x4
 178:   00 00 22 3d addis   r9,r2,0
178: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 17c:   00 00 29 39 addir9,r9,0
17c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 180:   ce 48 40 7c lvx v2,0,r9
 184:   20 00 80 4e blr

and for Power9:
 <__test_splatisd_12_PWR9>:
   0:   d1 62 40 f0 xxspltib vs34,12
   4:   02 16 58 10 vextsb2d v2,v2
   8:   20 00 80 4e blr

So why can't the power8 target generate:

00f0 <__test_splatudi_12_V1>:
  f0:   8c 03 4c 10 vspltisw v2,12
  f4:   4e 16 40 10 vupkhsw v2,v2
  f8:   20 00 80 4e blr

This is 4 cycles vs 9 ((best case) and it is always 9 cycles because GCC does
not exploit immediate fusion).
In fact GCC 8 (AT12) does this.

So I tried defining my own vec_splatudi:

vi64_t
__test_splatudi_12_V1 (void)
{
  vi32_t vwi = vec_splat_s32 (12);
  return vec_unpackl (vwi);
}

Which generates the <__test_splatudi_12_V1> sequence above for GCC 8. But for
GCC 9/10/11 it generates:

0110 <__test_splatudi_12_V1>:
 110:   00 00 4c 3c addis   r2,r12,0
110: R_PPC64_REL16_HA   .TOC.
 114:   00 00 42 38 addir2,r2,0
114: R_PPC64_REL16_LO   .TOC.+0x4
 118:   00 00 22 3d addis   r9,r2,0
118: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 11c:   00 00 29 39 addir9,r9,0
11c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 120:   ce 48 40 7c lvx v2,0,r9
 124:   20 00 80 4e blr

Again! GCC has gone out of its way to be this clever! Badly! While it can be
appropriately clever for power9!

I have tried many permutations of this and the only way I have found to prevent
this (GCC 9/10/11) cleverness is to use inline __asm (which has other bad side
effects).

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-06-10 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #13 from Steven Munroe  ---
"We want to use plain TImode instead of V1TImode on newer cpus."

Actually I disagree. We have vector __int128 in the ABI and with POWER10 a
complete set arithmetic operations for 128-bit in VRs.

Also this issue is not restricted to TImode. It also effects _Float128
(KFmode), _ibm128 (TFmode) and Libmvec for vector float/double. The proper and
optimum handling of these "union transfers" has been broken in GCC for years.

And I have grave reservations about the vague plans of small/fringe minority to
subset the PowerISA for their convenience.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-04-29 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #5 from Steven Munroe  ---
Any progress on this?

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-04-16 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #4 from Steven Munroe  ---
I am seeing this a similar problem with union transfers from __float128 to
__int128.


 static inline unsigned __int128
 vec_xfer_bin128_2_int128t (__binary128 f128)
 {
   __VF_128 vunion;

   vunion.vf1 = f128;

   return (vunion.ui1);
 }

and 

unsigned __int128
test_xfer_bin128_2_int128 (__binary128 f128)
{
  return vec_xfer_bin128_2_int128t (f128);
}

generates:

0030 :
  30:   57 12 42 f0 xxswapd vs34,vs34
  34:   20 00 20 39 li  r9,32
  38:   d0 ff 41 39 addir10,r1,-48
  3c:   99 4f 4a 7c stxvd2x vs34,r10,r9
  40:   f0 ff 61 e8 ld  r3,-16(r1)
  44:   f8 ff 81 e8 ld  r4,-8(r1)
  48:   20 00 80 4e blr

For POWER8 should use mfvsrd/xxpermdi/mfvsrd.

This looks like the root cause of poor performance for __float128 soft-float on
POWER8. A simple benchmark using __float128 in C code calling libgcc for
-mcpu=power8 and then hardware instructions for -mcpu=power9.

P8 target P8AT14, Uses libgcc __addkf3_sw and __mulkf3_sw:
test_time_f128 f128 CC  tb delta = 52589, sec = 0.000102713

P9 Target P8AT14, Uses libgcc __addkf3_hw and __mulkf3_hw:
test_time_f128 f128 CC  tb delta = 18762, sec = 3.66445e-05

P9 Target P9AT14, inline hardware binary128 float:
test_time_f128 f128 CC  tb delta = 3809, sec = 7.43945e-06

I used Valgrind Itrace and Sim-ppc and perfstat analysis. Every call to libgcc
__add/sub/mul/divkf3 takes a load-hit-store flush every call. This explains why
__float128 is so 13.8 X slower on P8 then P9.

[Bug rtl-optimization/100085] Bad code for union transfer from __float128 to vector types

2021-04-14 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

Steven Munroe  changed:

   What|Removed |Added

 CC||munroesj at gcc dot gnu.org

--- Comment #1 from Steven Munroe  ---
Created attachment 50596
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50596=edit
Compile test case fo xfer operation.

Compile for PowerPCle fo both -mcpu=power8 -mfloat128 and -mcpu=power9
-mfloat128 and see the differn asm generated.

[Bug rtl-optimization/100085] New: Bad code for union transfer from __float128 to vector types

2021-04-14 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

Bug ID: 100085
   Summary: Bad code for union transfer from __float128 to vector
types
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 50595
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50595=edit
Reduced example of union and __float128 to vector transfer.

GCC 10/9/8/7 will generate poor (-mcpu=power8) code when using a union to
transfer a __float128 scalar to any vector type. __float128 is a scalar type
and not typecast compatible with any vector type. Despite both being in Vector
registers. 

But for runtime codes implementing __float128 operations for -mcpu=power8 it is
useful (and faster) to perform some (data_class, conversions, etc) operations
directly in vector registers. The only solution for this is to use union to
transfer values between __float128/vector types. This should be a simple vector
register transfer and optimized as such.

But when for GCC and PowerPCle and -mcpu=power8, we are consistently seeing
store/reload sequences. For Power8 this can cause load-hit-store and pipe-line
rejects (33 cycles).

We don't see this when targeting -mcpu=power9, but power9 supports hardware
Float128 instruction. Also we don't see this when targeting BE.

[Bug middle-end/99293] Built-in vec_splat generates sub-optimal code for -mcpu=power10

2021-02-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99293

--- Comment #1 from Steven Munroe  ---
Created attachment 50264
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50264=edit
Compile test for simplied test case

Download vec_dummy.c and vec_int128_ppc.h into a local directory and compile

gcc -O3 -mcpu=power10 -m64 -c vec_dummy.c

[Bug middle-end/99293] New: Built-in vec_splat generates sub-optimal code for -mcpu=power10

2021-02-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99293

Bug ID: 99293
   Summary: Built-in vec_splat generates sub-optimal code for
-mcpu=power10
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 50263
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50263=edit
Simplified test case

While adding code to Power Vector Library (PVECLIB), for the POWER10 target, I
see strange code generation for Altivec built-in vec_splat for the vector long
long type. I would expect a xxpermdi (xxspltd) based on the "Power Vector
Intrinsic Programming Reference".

But I see the following generated:

0300 :
 300:   67 02 69 7c mfvsrld r9,vs35
 304:   67 4b 09 7c mtvsrdd vs32,r9,r9
 308:   05 00 42 10 vrlqv2,v2,v0
 30c:   20 00 80 4e blr

While these seems to functionally correct, the trip through the GPR seems
unnecessary. It requires two serially dependent instructions where a single
xxspltd would do. I expected:

0300 :
 300:   57 1b 63 f0 xxspltd vs35,vs35,1
 304:   05 18 42 10 vrlqv2,v2,v3
 308:   20 00 80 4e blr


The compiler was:

Compiler: gcc version 10.2.1 20210104 (Advance-Toolchain 14.0-2) [2093e873bb6c]
(GCC)

[Bug target/98519] rs6000: @pcrel unsupported on this instruction error in pveclib

2021-01-04 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98519

--- Comment #7 from Steven Munroe  ---
Then you have problem as @pcrel is never valid for an instruction like lxsd%X1.

Seems like you will need a new constrain or modifier specific to @pcrel.

[Bug target/98519] rs6000: @pcrel unsupported on this instruction error in pveclib

2021-01-04 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98519

--- Comment #5 from Steven Munroe  ---
I would think you need to look at the instruction and the "m" constraint.

In this case lxsd%X1 would need to be converted to plxsd and the "m" constraint
would have to allow @pcrel. I would think a static variable would be valid, but
stack local or explicit pointer with (nonconst) offset/index would not.