[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-06-13 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

--- Comment #5 from Hongtao Liu  ---
It's fixed by r15-1100-gec985bc97a0157

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-05-20 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

--- Comment #4 from Hongtao Liu  ---
(In reply to Hu Lin from comment #3)
> I found compiler allocates mem to the third source register of vpternlog in
> IRA after commit f55cdce3f8dd8503e080e35be59c5f5390f6d95e. And it cause the
> generate code will be 
> 
>   8 .cfi_startproc
>   9 movl$4, %eax
>  10 vpsraw  $5, %xmm0, %xmm2
>  11 vpbroadcastb%eax, %xmm1
>  12 movl$7, %eax
>  13 vpbroadcastb%eax, %xmm3
>  14 vmovdqa %xmm1, %xmm0
>  15 vpternlogd  $120, %xmm3, %xmm2, %xmm0
>  16 vmovdqa %xmm3, -24(%rsp)
>  17 vpsubb  %xmm1, %xmm0, %xmm0
>  18 ret
> 
> And 6a67fdcb3f0cc8be47b49ddd246d0c50c3770800 changes the vector type from
> v16qi to v4si, leading to movv4si can't combine with the vpternlog in
> postreload, so the result is what you see now.

To clarify: The extra spill is caused by r14-4944-gf55cdce3f8dd85,
r14-7026-g6a67fdcb3f0cc8 only causes an extra mov instruction(which is not a
big deal).

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-05-20 Thread lin1.hu at intel dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

Hu Lin  changed:

   What|Removed |Added

 CC||lin1.hu at intel dot com

--- Comment #3 from Hu Lin  ---
I found compiler allocates mem to the third source register of vpternlog in IRA
after commit f55cdce3f8dd8503e080e35be59c5f5390f6d95e. And it cause the
generate code will be 

  8 .cfi_startproc
  9 movl$4, %eax
 10 vpsraw  $5, %xmm0, %xmm2
 11 vpbroadcastb%eax, %xmm1
 12 movl$7, %eax
 13 vpbroadcastb%eax, %xmm3
 14 vmovdqa %xmm1, %xmm0
 15 vpternlogd  $120, %xmm3, %xmm2, %xmm0
 16 vmovdqa %xmm3, -24(%rsp)
 17 vpsubb  %xmm1, %xmm0, %xmm0
 18 ret

And 6a67fdcb3f0cc8be47b49ddd246d0c50c3770800 changes the vector type from v16qi
to v4si, leading to movv4si can't combine with the vpternlog in postreload, so
the result is what you see now.

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-05-10 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

--- Comment #2 from Roger Sayle  ---
Here's a reduced test case that should be unaffected by the pending changes to
how V8QI shifts are expanded.  Note that the final "t -= t4" is required to
convince the register allocator to "spill".

typedef signed char v16qi __attribute__ ((__vector_size__ (16)));
// sign-extend low 3 bits to a byte.
v16qi foo (v16qi x) {
v16qi t7 = (v16qi){7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};
v16qi t4 = (v16qi){4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4};
v16qi t = x & t7;
t ^= t4;
t -= t4;
return t;
}

which produces:

foo:movl$67372036, %eax
vmovdqa %xmm0, %xmm2
vpbroadcastd%eax, %xmm1
movl$117901063, %eax
vpbroadcastd%eax, %xmm3
vmovdqa %xmm1, %xmm0
vmovdqa %xmm3, -24(%rsp)
vmovdqa -24(%rsp), %xmm4
vpternlogd  $120, %xmm2, %xmm4, %xmm0
vpsubb  %xmm1, %xmm0, %xmm0
ret

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-05-10 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

Roger Sayle  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |roger at 
nextmovesoftware dot com
   Last reconfirmed||2024-05-10
 CC||roger at nextmovesoftware dot 
com
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #1 from Roger Sayle  ---
I have a patch for x86 ternlog handling that changes the output for this
testcase (without the pending change to optimize V8QI shifts) to:
foo:movl$67372036, %eax
vpsraw  $5, %xmm0, %xmm0
vpbroadcastd%eax, %xmm1
vpternlogd  $108, .LC0(%rip), %xmm1, %xmm0
vpsubb  %xmm1, %xmm0, %xmm0
ret
.align 16
.LC0:
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7
.byte   7

which at least doesn't construct the vector with a broadcast, and then "spill"
it to the stack before reading it back from memory.   I've no idea if this is
optimal, but it's certainly better than the current "spill".

I'm curious about what has changed to make this code (register allocation)
regress since GCC 13.  It was a patch of mine that changed broadcastb to
broadcastd, but that shouldn't have affected reload/register preferencing.

[Bug rtl-optimization/115021] [14/15 regression] unnecessary spill for vpternlog

2024-05-10 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115021

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.2
   Priority|P3  |P2