[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-23 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

Jakub Jelinek  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-04-23
 CC||hjl.tools at gmail dot com,
   ||jakub at gcc dot gnu.org,
   ||uros at gcc dot gnu.org
  Component|c   |target
   Target Milestone|--- |8.4
Summary|[8 Regression] C code is|[8/9 Regression] C code is
   |optimized worse than C++|optimized worse than C++
 Ever confirmed|0   |1

--- Comment #1 from Jakub Jelinek  ---
Started with r257505.  A smaller regression happened already earlier with
r254855.  Before the latter, we emitted:
pushq   %rbp
movq%rdi, %rax
movq%rsp, %rbp
andq$-64, %rsp
vmovdqu32   16(%rbp), %zmm1
vpaddd  80(%rbp), %zmm1, %zmm0
vmovdqa64   %zmm0, -64(%rsp)
vmovdqa64   -64(%rsp), %xmm2
vmovdqa64   -48(%rsp), %xmm3
vmovdqa64   -32(%rsp), %xmm4
vmovdqa64   -16(%rsp), %xmm5
vmovups %xmm2, (%rdi)
vmovups %xmm3, 16(%rdi)
vmovups %xmm4, 32(%rdi)
vmovups %xmm5, 48(%rdi)
vzeroupper
leave
ret
r254855 then changed it into:
pushq   %rbp
movq%rsp, %rbp
andq$-32, %rsp
movq%rdi, %rax
vmovdqu32   16(%rbp), %ymm2
vpaddd  80(%rbp), %ymm2, %ymm0
vmovq   %xmm0, %rdx
vmovdqa64   %ymm0, -64(%rsp)
vmovdqu32   48(%rbp), %ymm3
vpaddd  112(%rbp), %ymm3, %ymm0
vmovdqa64   %ymm0, -32(%rsp)
movq%rdx, (%rdi)
movq-56(%rsp), %rdx
movq%rdx, 8(%rdi)
movq-48(%rsp), %rdx
movq%rdx, 16(%rdi)
movq-40(%rsp), %rdx
movq%rdx, 24(%rdi)
vmovq   %xmm0, 32(%rax)
movq-24(%rsp), %rdx
movq%rdx, 40(%rdi)
movq-16(%rsp), %rdx
movq%rdx, 48(%rdi)
movq-8(%rsp), %rdx
movq%rdx, 56(%rdi)
vzeroupper
leave
ret
After the r257505 we seem to be versioning for alignment or so, that can't be
right for a loop with just 16 iterations.

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-23 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #2 from Hongtao.liu  ---
It seems such code generation is r254855's intention.

/* Use 256-bit AVX instructions instead of 512-bit AVX
instructions
4695  in the auto-vectorizer.  */
4696   if (ix86_tune_features[X86_TUNE_AVX256_OPTIMAL]
4697   && !(opts_set->x_ix86_target_flags &
OPTION_MASK_PREFER_AVX256))
4698 opts->x_ix86_target_flags |= OPTION_MASK_PREFER_AVX256;

I know there is a frequency reduction issue when many zmm registers are used,
but i don't know what exact situation did r254855 deal with?

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-23 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #3 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #2)
> It seems such code generation is r254855's intention.
> 
> /* Use 256-bit AVX instructions instead of 512-bit AVX
> instructions
> 4695in the auto-vectorizer.  */
> 4696 if (ix86_tune_features[X86_TUNE_AVX256_OPTIMAL]
> 4697 && !(opts_set->x_ix86_target_flags &
> OPTION_MASK_PREFER_AVX256))
> 4698   opts->x_ix86_target_flags |= OPTION_MASK_PREFER_AVX256;
> 
> I know there is a frequency reduction issue when many zmm registers are
> used, but i don't know what exact situation did r254855 deal with?

But it should generate assemble like what g++ did which also use ymm instead of
zmm.

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-24 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
The difference is that for C++ we directly use DECL_RESULT in the GIMPLE IL
while for C we end up with a copy to it.  The C++ FE does

;; Function v test(v, v) (null)
;; enabled by -tree-original


{
  struct v res [value-expr: ];

and at the end

  <>>;
}

while the C FE uses plain res:

{
  struct v res;

...
  return res;
}

which in the end also results in try/finally processing for CLOBBERs.  Not sure
where the C++ FE decides using  for res is fine and whether the C
FE could do the same.  Certainly eliding this extra copy is beneficial.

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-24 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #5 from Jakub Jelinek  ---
That would be likely NRV optimization in the C++ FE, but then why doesn't the
generic NRV optimization handle it in the middle-end later on?

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #6 from rguenther at suse dot de  ---
On Wed, 24 Apr 2019, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
> 
> --- Comment #5 from Jakub Jelinek  ---
> That would be likely NRV optimization in the C++ FE, but then why doesn't the
> generic NRV optimization handle it in the middle-end later on?

Probably getting address-taken due to vectorization / IVOPTs.  NRV
runs pretty late.

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-25 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #7 from Hongtao.liu  ---
Yes, C++ with NRV optization, so the alignment of (res) is 4.
and the alignment of res is 16 in C.

g++/test.i.158t.vect:

../test.i:8:23: note:   recording new base alignment for &
  alignment:4
  misalignment: 0

gcc/test.i.158t.vect:

../test.i:8:5: note:   recording new base alignment for &res
  alignment:16
  misalignment: 0

When alignment of res is 16, that triggers loop peeling of vectorization.

refer to:
/* Function vect_enhance_data_refs_alignment

   This pass will use loop versioning and loop peeling in order to enhance
   the alignment of data references in the loop.
   .
*/

That's why there are more than 150 lines of assemble.

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-25 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #8 from Hongtao.liu  ---
Cost_model for Function vect_enhance_data_refs_alignment are quite tunable.

More benchmarks are needed if we want to do so.

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-25 Thread crazylht at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #9 from Hongtao.liu  ---
Also what's better between aligned load/store of smaller size  VS unaligned 
load/store of bigger size?

aligned load/store of smaller size:

movq%rdx, (%rdi)
movq-56(%rsp), %rdx
movq%rdx, 8(%rdi)
movq-48(%rsp), %rdx
movq%rdx, 16(%rdi)
movq-40(%rsp), %rdx
movq%rdx, 24(%rdi)
vmovq   %xmm0, 32(%rax)
movq-24(%rsp), %rdx
movq%rdx, 40(%rdi)
movq-16(%rsp), %rdx
movq%rdx, 48(%rdi)
movq-8(%rsp), %rdx
movq%rdx, 56(%rdi)

unaligned load/store of bigger size:

vmovups %xmm2, (%rdi)
vmovups %xmm3, 16(%rdi)
vmovups %xmm4, 32(%rdi)
vmovups %xmm5, 48(%rdi)

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-25 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #10 from rguenther at suse dot de  ---
On Thu, 25 Apr 2019, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
> 
> --- Comment #9 from Hongtao.liu  ---
> Also what's better between aligned load/store of smaller size  VS unaligned 
> load/store of bigger size?
> 
> aligned load/store of smaller size:
> 
> movq%rdx, (%rdi)
> movq-56(%rsp), %rdx
> movq%rdx, 8(%rdi)
> movq-48(%rsp), %rdx
> movq%rdx, 16(%rdi)
> movq-40(%rsp), %rdx
> movq%rdx, 24(%rdi)
> vmovq   %xmm0, 32(%rax)
> movq-24(%rsp), %rdx
> movq%rdx, 40(%rdi)
> movq-16(%rsp), %rdx
> movq%rdx, 48(%rdi)
> movq-8(%rsp), %rdx
> movq%rdx, 56(%rdi)
> 
> unaligned load/store of bigger size:
> 
> vmovups %xmm2, (%rdi)
> vmovups %xmm3, 16(%rdi)
> vmovups %xmm4, 32(%rdi)
> vmovups %xmm5, 48(%rdi)

bigger stores are almost always a win while bigger loads have
the possibility to run into store-to-load forwarding issues
(and bigger stores eventually mitigate them).  Based on
CPU tuning we'd also eventually end up with mov[lh]ps splitting
unaligned loads/stores.

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-25 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #11 from H.J. Lu  ---
(In reply to Hongtao.liu from comment #7)
> Yes, C++ with NRV optization, so the alignment of (res) is 4.
> and the alignment of res is 16 in C.
> 
> g++/test.i.158t.vect:
> 
> ../test.i:8:23: note:   recording new base alignment for &
>   alignment:4
>   misalignment: 0
> 
> gcc/test.i.158t.vect:
> 
> ../test.i:8:5: note:   recording new base alignment for &res
>   alignment:16
>   misalignment: 0
> 
> When alignment of res is 16, that triggers loop peeling of vectorization.
> 

Why does struct v have different alignments in C and C++?

[Bug target/90204] [8/9 Regression] C code is optimized worse than C++

2019-04-25 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204

--- Comment #12 from rguenther at suse dot de  ---
On Thu, 25 Apr 2019, hjl.tools at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
> 
> --- Comment #11 from H.J. Lu  ---
> (In reply to Hongtao.liu from comment #7)
> > Yes, C++ with NRV optization, so the alignment of (res) is 4.
> > and the alignment of res is 16 in C.
> > 
> > g++/test.i.158t.vect:
> > 
> > ../test.i:8:23: note:   recording new base alignment for &
> >   alignment:4
> >   misalignment: 0
> > 
> > gcc/test.i.158t.vect:
> > 
> > ../test.i:8:5: note:   recording new base alignment for &res
> >   alignment:16
> >   misalignment: 0
> > 
> > When alignment of res is 16, that triggers loop peeling of vectorization.
> > 
> 
> Why does struct v have different alignments in C and C++?

I think we re-align automatic variables but obviously cannot do the
same for incoming DECL_RESULT.