from:"kugan"

Re: [PATCH 1/4] Relax COND_EXPR reduction vectorization SLP restriction

2024-06-07 Thread Kugan Vivekanandarajah

Thanks Richard.
Created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115383

Thanks,
Kugan

On Fri, Jun 7, 2024 at 5:51 PM Richard Biener  wrote:
>
> On Fri, 7 Jun 2024, Kugan Vivekanandarajah wrote:
>
> > Hi Richard,
> >
> > This seems to have introduced a regression. I am seeing ICE while
> > building TSVC_2 for AARCH64
> > with -O3 -flto -mcpu=neoverse-v2 -msve-vector-bits=128
> >
> > tsvc.c: In function 's331':
> > tsvc.c:2744:8: internal compiler error: Segmentation fault
> >  2744 | real_t s331(struct args_t * func_args)
> >   |^
> > 0xdfc23b crash_signal
> > /var/jenkins/workspace/GCC_Nightly/gcc/toplev.cc:319
> > 0xa3a6f8 phi_nodes_ptr(basic_block_def*)
> > /var/jenkins/workspace/GCC_Nightly/gcc/gimple.h:4701
> > 0xa3a6f8 gsi_start_phis(basic_block_def*)
> > /var/jenkins/workspace/GCC_Nightly/gcc/gimple-iterator.cc:937
> > 0xa3a6f8 gsi_for_stmt(gimple*)
> > /var/jenkins/workspace/GCC_Nightly/gcc/gimple-iterator.cc:621
> > 0x1e5f22f vectorizable_condition
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-stmts.cc:12577
> > 0x1e7a027 vect_transform_stmt(vec_info*, _stmt_vec_info*,
> > gimple_stmt_iterator*, _slp_tree*, _slp_instance*)
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-stmts.cc:13467
> > 0x1112653 vect_schedule_slp_node
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-slp.cc:9729
> > 0x1127757 vect_schedule_slp_node
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-slp.cc:9522
> > 0x1127757 vect_schedule_scc
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-slp.cc:10017
> > 0x11285ff vect_schedule_slp(vec_info*, vec<_slp_instance*, va_heap,
> > vl_ptr> const&)
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-slp.cc:10110
> > 0x10f56b7 vect_transform_loop(_loop_vec_info*, gimple*)
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-loop.cc:12114
> > 0x1138c7f vect_transform_loops
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vectorizer.cc:1007
> > 0x1139307 try_vectorize_loop_1
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vectorizer.cc:1153
> > 0x1139307 try_vectorize_loop
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vectorizer.cc:1183
> > 0x113967b execute
> > /var/jenkins/workspace/GCC_Nightly/gcc/tree-vectorizer.cc:1299
> > Please submit a full bug report, with preprocessed source (by using
> > -freport-bug).
> >
> > Please let me know if you need a reduced testcase.
>
> Please open a bugzilla with a reduced testcase.
>
> Thanks,
> Richard.
>
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH 1/4] Relax COND_EXPR reduction vectorization SLP restriction

2024-06-07 Thread Kugan Vivekanandarajah

Hi Richard,

This seems to have introduced a regression. I am seeing ICE while
building TSVC_2 for AARCH64
with -O3 -flto -mcpu=neoverse-v2 -msve-vector-bits=128

tsvc.c: In function 's331':
tsvc.c:2744:8: internal compiler error: Segmentation fault
 2744 | real_t s331(struct args_t * func_args)
  |^
0xdfc23b crash_signal
/var/jenkins/workspace/GCC_Nightly/gcc/toplev.cc:319
0xa3a6f8 phi_nodes_ptr(basic_block_def*)
/var/jenkins/workspace/GCC_Nightly/gcc/gimple.h:4701
0xa3a6f8 gsi_start_phis(basic_block_def*)
/var/jenkins/workspace/GCC_Nightly/gcc/gimple-iterator.cc:937
0xa3a6f8 gsi_for_stmt(gimple*)
/var/jenkins/workspace/GCC_Nightly/gcc/gimple-iterator.cc:621
0x1e5f22f vectorizable_condition
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-stmts.cc:12577
0x1e7a027 vect_transform_stmt(vec_info*, _stmt_vec_info*,
gimple_stmt_iterator*, _slp_tree*, _slp_instance*)
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-stmts.cc:13467
0x1112653 vect_schedule_slp_node
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-slp.cc:9729
0x1127757 vect_schedule_slp_node
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-slp.cc:9522
0x1127757 vect_schedule_scc
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-slp.cc:10017
0x11285ff vect_schedule_slp(vec_info*, vec<_slp_instance*, va_heap,
vl_ptr> const&)
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-slp.cc:10110
0x10f56b7 vect_transform_loop(_loop_vec_info*, gimple*)
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vect-loop.cc:12114
0x1138c7f vect_transform_loops
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vectorizer.cc:1007
0x1139307 try_vectorize_loop_1
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vectorizer.cc:1153
0x1139307 try_vectorize_loop
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vectorizer.cc:1183
0x113967b execute
/var/jenkins/workspace/GCC_Nightly/gcc/tree-vectorizer.cc:1299
Please submit a full bug report, with preprocessed source (by using
-freport-bug).

Please let me know if you need a reduced testcase.

Thanks,
Kugan

Re: [PR47785] COLLECT_AS_OPTIONS

2019-11-07 Thread Kugan Vivekanandarajah

Hi Richard,
Thanks for the review.

On Tue, 5 Nov 2019 at 23:08, Richard Biener  wrote:
>
> On Tue, Nov 5, 2019 at 12:17 AM Kugan Vivekanandarajah
>  wrote:
> >
> > Hi,
> > Thanks for the review.
> >
> > On Tue, 5 Nov 2019 at 03:57, H.J. Lu  wrote:
> > >
> > > On Sun, Nov 3, 2019 at 6:45 PM Kugan Vivekanandarajah
> > >  wrote:
> > > >
> > > > Thanks for the reviews.
> > > >
> > > >
> > > > On Sat, 2 Nov 2019 at 02:49, H.J. Lu  wrote:
> > > > >
> > > > > On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, 30 Oct 2019 at 03:11, H.J. Lu  wrote:
> > > > > > >
> > > > > > > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hi Richard,
> > > > > > > >
> > > > > > > > Thanks for the review.
> > > > > > > >
> > > > > > > > On Wed, 23 Oct 2019 at 23:07, Richard Biener 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Richard,
> > > > > > > > > >
> > > > > > > > > > Thanks for the pointers.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Richard,
> > > > > > > > > > > > Thanks for the review.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > As mentioned in the PR, attached patch adds 
> > > > > > > > > > > > > > COLLECT_AS_OPTIONS for
> > > > > > > > > > > > > > passing assembler options specified with -Wa, to 
> > > > > > > > > > > > > > the link-time driver.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The proposed solution only works for uniform -Wa 
> > > > > > > > > > > > > > options across all
> > > > > > > > > > > > > > TUs. As mentioned by Richard Biener, supporting 
> > > > > > > > > > > > > > non-uniform -Wa flags
> > > > > > > > > > > > > > would require either adjusting partitioning 
> > > > > > > > > > > > > > according to flags or
> > > > > > > > > > > > > > emitting multiple object files  from a single 
> > > > > > > > > > > > > > LTRANS CU. We could
> > > > > > > > > > > > > > consider this as a follow up.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Bootstrapped and regression tests on  
> > > > > > > > > > > > > > arm-linux-gcc. Is this OK for trunk?
> > > > > > &

Re: [PR47785] COLLECT_AS_OPTIONS

2019-11-04 Thread Kugan Vivekanandarajah

Hi,
Thanks for the review.

On Tue, 5 Nov 2019 at 03:57, H.J. Lu  wrote:
>
> On Sun, Nov 3, 2019 at 6:45 PM Kugan Vivekanandarajah
>  wrote:
> >
> > Thanks for the reviews.
> >
> >
> > On Sat, 2 Nov 2019 at 02:49, H.J. Lu  wrote:
> > >
> > > On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah
> > >  wrote:
> > > >
> > > > On Wed, 30 Oct 2019 at 03:11, H.J. Lu  wrote:
> > > > >
> > > > > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah
> > > > >  wrote:
> > > > > >
> > > > > > Hi Richard,
> > > > > >
> > > > > > Thanks for the review.
> > > > > >
> > > > > > On Wed, 23 Oct 2019 at 23:07, Richard Biener 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hi Richard,
> > > > > > > >
> > > > > > > > Thanks for the pointers.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Richard,
> > > > > > > > > > Thanks for the review.
> > > > > > > > > >
> > > > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > As mentioned in the PR, attached patch adds 
> > > > > > > > > > > > COLLECT_AS_OPTIONS for
> > > > > > > > > > > > passing assembler options specified with -Wa, to the 
> > > > > > > > > > > > link-time driver.
> > > > > > > > > > > >
> > > > > > > > > > > > The proposed solution only works for uniform -Wa 
> > > > > > > > > > > > options across all
> > > > > > > > > > > > TUs. As mentioned by Richard Biener, supporting 
> > > > > > > > > > > > non-uniform -Wa flags
> > > > > > > > > > > > would require either adjusting partitioning according 
> > > > > > > > > > > > to flags or
> > > > > > > > > > > > emitting multiple object files  from a single LTRANS 
> > > > > > > > > > > > CU. We could
> > > > > > > > > > > > consider this as a follow up.
> > > > > > > > > > > >
> > > > > > > > > > > > Bootstrapped and regression tests on  arm-linux-gcc. Is 
> > > > > > > > > > > > this OK for trunk?
> > > > > > > > > > >
> > > > > > > > > > > While it works for your simple cases it is unlikely to 
> > > > > > > > > > > work in practice since
> > > > > > > > > > > your implementation needs the assembler options be 
> > > > > > > > > > > present at the link
> > > > > > > > > > > command line.  I agree that this might be the way for 
> > > > > > > > > > > people to go when
> > > > > > > > > > > they face the issue but then it needs to be documented 
> > > > > > > > > > > somewhere
> > > > > > > > > > > in the manual.
> > > > > > > > > > >
> > > > >

Re: [PR47785] COLLECT_AS_OPTIONS

2019-11-03 Thread Kugan Vivekanandarajah

Thanks for the reviews.


On Sat, 2 Nov 2019 at 02:49, H.J. Lu  wrote:
>
> On Thu, Oct 31, 2019 at 6:33 PM Kugan Vivekanandarajah
>  wrote:
> >
> > On Wed, 30 Oct 2019 at 03:11, H.J. Lu  wrote:
> > >
> > > On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah
> > >  wrote:
> > > >
> > > > Hi Richard,
> > > >
> > > > Thanks for the review.
> > > >
> > > > On Wed, 23 Oct 2019 at 23:07, Richard Biener 
> > > >  wrote:
> > > > >
> > > > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah
> > > > >  wrote:
> > > > > >
> > > > > > Hi Richard,
> > > > > >
> > > > > > Thanks for the pointers.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hi Richard,
> > > > > > > > Thanks for the review.
> > > > > > > >
> > > > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > As mentioned in the PR, attached patch adds 
> > > > > > > > > > COLLECT_AS_OPTIONS for
> > > > > > > > > > passing assembler options specified with -Wa, to the 
> > > > > > > > > > link-time driver.
> > > > > > > > > >
> > > > > > > > > > The proposed solution only works for uniform -Wa options 
> > > > > > > > > > across all
> > > > > > > > > > TUs. As mentioned by Richard Biener, supporting non-uniform 
> > > > > > > > > > -Wa flags
> > > > > > > > > > would require either adjusting partitioning according to 
> > > > > > > > > > flags or
> > > > > > > > > > emitting multiple object files  from a single LTRANS CU. We 
> > > > > > > > > > could
> > > > > > > > > > consider this as a follow up.
> > > > > > > > > >
> > > > > > > > > > Bootstrapped and regression tests on  arm-linux-gcc. Is 
> > > > > > > > > > this OK for trunk?
> > > > > > > > >
> > > > > > > > > While it works for your simple cases it is unlikely to work 
> > > > > > > > > in practice since
> > > > > > > > > your implementation needs the assembler options be present at 
> > > > > > > > > the link
> > > > > > > > > command line.  I agree that this might be the way for people 
> > > > > > > > > to go when
> > > > > > > > > they face the issue but then it needs to be documented 
> > > > > > > > > somewhere
> > > > > > > > > in the manual.
> > > > > > > > >
> > > > > > > > > That is, with COLLECT_AS_OPTION (why singular?  I'd expected
> > > > > > > > > COLLECT_AS_OPTIONS) available to cc1 we could stream this 
> > > > > > > > > string
> > > > > > > > > to lto_options and re-materialize it at link time (and 
> > > > > > > > > diagnose mismatches
> > > > > > > > > even if we like).
> > > > > > > > OK. I will try to implement this. So the idea is if we provide
> > > > > > > > -Wa,options as part of the lto compile, this should be available
> > > > > > > > during link time. Like in:
> > > > > > > >
> > > > > > > > arm-linux-gnueabihf-gcc -march=armv7-a -mthumb -O2 -flto
> > > > &

Re: [PR47785] COLLECT_AS_OPTIONS

2019-10-31 Thread Kugan Vivekanandarajah

On Wed, 30 Oct 2019 at 03:11, H.J. Lu  wrote:
>
> On Sun, Oct 27, 2019 at 6:33 PM Kugan Vivekanandarajah
>  wrote:
> >
> > Hi Richard,
> >
> > Thanks for the review.
> >
> > On Wed, 23 Oct 2019 at 23:07, Richard Biener  
> > wrote:
> > >
> > > On Mon, Oct 21, 2019 at 10:04 AM Kugan Vivekanandarajah
> > >  wrote:
> > > >
> > > > Hi Richard,
> > > >
> > > > Thanks for the pointers.
> > > >
> > > >
> > > >
> > > > On Fri, 11 Oct 2019 at 22:33, Richard Biener 
> > > >  wrote:
> > > > >
> > > > > On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
> > > > >  wrote:
> > > > > >
> > > > > > Hi Richard,
> > > > > > Thanks for the review.
> > > > > >
> > > > > > On Wed, 2 Oct 2019 at 20:41, Richard Biener 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > As mentioned in the PR, attached patch adds COLLECT_AS_OPTIONS 
> > > > > > > > for
> > > > > > > > passing assembler options specified with -Wa, to the link-time 
> > > > > > > > driver.
> > > > > > > >
> > > > > > > > The proposed solution only works for uniform -Wa options across 
> > > > > > > > all
> > > > > > > > TUs. As mentioned by Richard Biener, supporting non-uniform -Wa 
> > > > > > > > flags
> > > > > > > > would require either adjusting partitioning according to flags 
> > > > > > > > or
> > > > > > > > emitting multiple object files  from a single LTRANS CU. We 
> > > > > > > > could
> > > > > > > > consider this as a follow up.
> > > > > > > >
> > > > > > > > Bootstrapped and regression tests on  arm-linux-gcc. Is this OK 
> > > > > > > > for trunk?
> > > > > > >
> > > > > > > While it works for your simple cases it is unlikely to work in 
> > > > > > > practice since
> > > > > > > your implementation needs the assembler options be present at the 
> > > > > > > link
> > > > > > > command line.  I agree that this might be the way for people to 
> > > > > > > go when
> > > > > > > they face the issue but then it needs to be documented somewhere
> > > > > > > in the manual.
> > > > > > >
> > > > > > > That is, with COLLECT_AS_OPTION (why singular?  I'd expected
> > > > > > > COLLECT_AS_OPTIONS) available to cc1 we could stream this string
> > > > > > > to lto_options and re-materialize it at link time (and diagnose 
> > > > > > > mismatches
> > > > > > > even if we like).
> > > > > > OK. I will try to implement this. So the idea is if we provide
> > > > > > -Wa,options as part of the lto compile, this should be available
> > > > > > during link time. Like in:
> > > > > >
> > > > > > arm-linux-gnueabihf-gcc -march=armv7-a -mthumb -O2 -flto
> > > > > > -Wa,-mimplicit-it=always,-mthumb -c test.c
> > > > > > arm-linux-gnueabihf-gcc  -flto  test.o
> > > > > >
> > > > > > I am not sure where should we stream this. Currently, 
> > > > > > cl_optimization
> > > > > > has all the optimization flag provided for compiler and it is
> > > > > > autogenerated and all the flags are integer values. Do you have any
> > > > > > preference or example where this should be done.
> > > > >
> > > > > In lto_write_options, I'd simply append the contents of 
> > > > > COLLECT_AS_OPTIONS
> > > > > (with -Wa, prepended to each of them), then recover them in 
> > > > > lto-wrapper
> > > > > for each TU and pass them down to the LTRANS compiles (if they agree
> > > > > for all TUs, otherwise I'd warn and drop them).
> > >

Re: [PR47785] COLLECT_AS_OPTIONS

2019-10-28 Thread Kugan Vivekanandarajah

Hi Bernhard,

Thanks for the review.

On Tue, 29 Oct 2019 at 08:52, Bernhard Reutner-Fischer
 wrote:
>
> On Mon, 28 Oct 2019 11:53:06 +1100
> Kugan Vivekanandarajah  wrote:
>
> > On Wed, 23 Oct 2019 at 23:07, Richard Biener  
> > wrote:
>
> > > Did you try this with multiple assembler options?  I see you stream
> > > them as -Wa,-mfpu=xyz,-mthumb but then compare the whole
> > > option strings so a mismatch with -Wa,-mthumb,-mfpu=xyz would be
>
> indeed, i'd have expected some kind of sorting, but i don't see it in
> the proposed patch?
Let me  try to see what is the best way to handle this. If we look at
Richard Earnshaw's comment in the next email, there are cases where
handling this would not be straightforward. I am happy to do what is
acceptable  here.


>
> > > diagnosed.  If there's a spec induced -Wa option do we get to see
> > > that as well?  I can imagine -march=xyz enabling a -Wa option
> > > for example.
> > >
> > > + *collect_as = XNEWVEC (char, strlen (args_text) + 1);
> > > + strcpy (*collect_as, args_text);
> > >
> > > there's strdup.  Btw, I'm not sure why you don't simply leave
> > > the -Wa option in the merged options [individually] and match
> > > them up but go the route of comparing strings and carrying that
> > > along separately.  I think that would be much better.
> >
> > Is attached patch which does this is OK?
>
> > +  obstack_init (_obstack);
> > +  obstack_grow (_obstack, "COLLECT_AS_OPTIONS=",
> > + sizeof ("COLLECT_AS_OPTIONS=") - 1);
> > +  obstack_grow (_obstack, "-Wa,", strlen ("-Wa,"));
>
> Why don't you grow once, including the "-Wa," ?

I will change this.
>
> > +/* Append options OPTS from -Wa, options to ARGV_OBSTACK.  */
> > +
> > +static void
> > +append_compiler_wa_options (obstack *argv_obstack,
> > + struct cl_decoded_option *opts,
> > + unsigned int count)
> > +{
> > +  static const char *collect_as;
> > +  for (unsigned int j = 1; j < count; ++j)
> > +{
> > +  struct cl_decoded_option *option = [j];
>
> Instead of the switch below, why not just
>
> if (option->opt_index != OPT_Wa_)
>   continue;

I will change this.

>
> here?
>
> > +  if (j == 1)
> > + collect_as = NULL;
>
> or at least here?
>
> (why's collect_as static in the first place? wouldn't that live in the parent 
> function?)
I am keeping the -Wa options which come from last TU here and making
sure they are the same. If we get -Wa options with different
incompatible options, handling them is tricky. So in this patch I want
to handle only when they are the same and flag error otherwise. It
again goes back to your first comment. I am happy to workout what is
an acceptable  solution here.

Thanks,
Kugan

>
> > +  const char *args_text = option->orig_option_with_args_text;
> > +  switch (option->opt_index)
> > + {
> > + case OPT_Wa_:
> > +   break;
> > + default:
> > +   continue;
> > + }k

Re: [PR47785] COLLECT_AS_OPTIONS

2019-10-21 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the pointers.



On Fri, 11 Oct 2019 at 22:33, Richard Biener  wrote:
>
> On Fri, Oct 11, 2019 at 6:15 AM Kugan Vivekanandarajah
>  wrote:
> >
> > Hi Richard,
> > Thanks for the review.
> >
> > On Wed, 2 Oct 2019 at 20:41, Richard Biener  
> > wrote:
> > >
> > > On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > As mentioned in the PR, attached patch adds COLLECT_AS_OPTIONS for
> > > > passing assembler options specified with -Wa, to the link-time driver.
> > > >
> > > > The proposed solution only works for uniform -Wa options across all
> > > > TUs. As mentioned by Richard Biener, supporting non-uniform -Wa flags
> > > > would require either adjusting partitioning according to flags or
> > > > emitting multiple object files  from a single LTRANS CU. We could
> > > > consider this as a follow up.
> > > >
> > > > Bootstrapped and regression tests on  arm-linux-gcc. Is this OK for 
> > > > trunk?
> > >
> > > While it works for your simple cases it is unlikely to work in practice 
> > > since
> > > your implementation needs the assembler options be present at the link
> > > command line.  I agree that this might be the way for people to go when
> > > they face the issue but then it needs to be documented somewhere
> > > in the manual.
> > >
> > > That is, with COLLECT_AS_OPTION (why singular?  I'd expected
> > > COLLECT_AS_OPTIONS) available to cc1 we could stream this string
> > > to lto_options and re-materialize it at link time (and diagnose mismatches
> > > even if we like).
> > OK. I will try to implement this. So the idea is if we provide
> > -Wa,options as part of the lto compile, this should be available
> > during link time. Like in:
> >
> > arm-linux-gnueabihf-gcc -march=armv7-a -mthumb -O2 -flto
> > -Wa,-mimplicit-it=always,-mthumb -c test.c
> > arm-linux-gnueabihf-gcc  -flto  test.o
> >
> > I am not sure where should we stream this. Currently, cl_optimization
> > has all the optimization flag provided for compiler and it is
> > autogenerated and all the flags are integer values. Do you have any
> > preference or example where this should be done.
>
> In lto_write_options, I'd simply append the contents of COLLECT_AS_OPTIONS
> (with -Wa, prepended to each of them), then recover them in lto-wrapper
> for each TU and pass them down to the LTRANS compiles (if they agree
> for all TUs, otherwise I'd warn and drop them).

Attached patch streams it and also make sure that the options are the
same for all the TUs. Maybe it is a bit restrictive.

What is the best place to document COLLECT_AS_OPTIONS. We don't seem
to document COLLECT_GCC_OPTIONS anywhere ?

Attached patch passes regression and also fixes the original ARM
kernel build issue with tumb2.

Thanks,
Kugan
>
> Richard.
>
> > Thanks,
> > Kugan
> >
> >
> >
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Kugan
> > > >
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 2019-10-02  kugan.vivekanandarajah  
> > > >
> > > > PR lto/78353
> > > > * gcc.c (putenv_COLLECT_AS_OPTION): New to set COLLECT_AS_OPTION in env.
> > > > (driver::main): Call putenv_COLLECT_AS_OPTION.
> > > > * lto-wrapper.c (run_gcc): use COLLECT_AS_OPTION from env.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > 2019-10-02  kugan.vivekanandarajah  
> > > >
> > > > PR lto/78353
> > > > * gcc.target/arm/pr78353-1.c: New test.
> > > > * gcc.target/arm/pr78353-2.c: New test.
gcc/testsuite/ChangeLog:

2019-10-21  Kugan Vivekanandarajah  

PR lto/78353
* gcc.target/arm/pr78353-1.c: New test.
* gcc.target/arm/pr78353-2.c: New test.


gcc/ChangeLog:

2019-10-21  Kugan Vivekanandarajah  
PR lto/78353
* gcc.c (putenv_COLLECT_AS_OPTIONS): New to set COLLECT_AS_OPTIONS in 
env.
(driver::main): Call putenv_COLLECT_AS_OPTIONS.
* lto-opts.c (lto_write_options): Stream COLLECT_AS_OPTIONS.
* lto-wrapper.c (merge_and_complain): Get COLLECT_AS_OPTIONS and
  make sure they are the same from all TUs.
(find_and_merge_options): Get COLLECT_AS_OPTIONS.
(run_gcc): Likewise.

From 120d61790236cbde5a5d0cb8455b0e3583dd90b2 Mon Sep 17 00:00:00 2001
From: Kugan 
Date: Sat, 28 Sep 2019 02:11:49 +1000
Subject: [PAT

Re: [PR47785] COLLECT_AS_OPTIONS

2019-10-10 Thread Kugan Vivekanandarajah

Hi Richard,
Thanks for the review.

On Wed, 2 Oct 2019 at 20:41, Richard Biener  wrote:
>
> On Wed, Oct 2, 2019 at 10:39 AM Kugan Vivekanandarajah
>  wrote:
> >
> > Hi,
> >
> > As mentioned in the PR, attached patch adds COLLECT_AS_OPTIONS for
> > passing assembler options specified with -Wa, to the link-time driver.
> >
> > The proposed solution only works for uniform -Wa options across all
> > TUs. As mentioned by Richard Biener, supporting non-uniform -Wa flags
> > would require either adjusting partitioning according to flags or
> > emitting multiple object files  from a single LTRANS CU. We could
> > consider this as a follow up.
> >
> > Bootstrapped and regression tests on  arm-linux-gcc. Is this OK for trunk?
>
> While it works for your simple cases it is unlikely to work in practice since
> your implementation needs the assembler options be present at the link
> command line.  I agree that this might be the way for people to go when
> they face the issue but then it needs to be documented somewhere
> in the manual.
>
> That is, with COLLECT_AS_OPTION (why singular?  I'd expected
> COLLECT_AS_OPTIONS) available to cc1 we could stream this string
> to lto_options and re-materialize it at link time (and diagnose mismatches
> even if we like).
OK. I will try to implement this. So the idea is if we provide
-Wa,options as part of the lto compile, this should be available
during link time. Like in:

arm-linux-gnueabihf-gcc -march=armv7-a -mthumb -O2 -flto
-Wa,-mimplicit-it=always,-mthumb -c test.c
arm-linux-gnueabihf-gcc  -flto  test.o

I am not sure where should we stream this. Currently, cl_optimization
has all the optimization flag provided for compiler and it is
autogenerated and all the flags are integer values. Do you have any
preference or example where this should be done.

Thanks,
Kugan



>
> Richard.
>
> > Thanks,
> > Kugan
> >
> >
> > gcc/ChangeLog:
> >
> > 2019-10-02  kugan.vivekanandarajah  
> >
> > PR lto/78353
> > * gcc.c (putenv_COLLECT_AS_OPTION): New to set COLLECT_AS_OPTION in env.
> > (driver::main): Call putenv_COLLECT_AS_OPTION.
> > * lto-wrapper.c (run_gcc): use COLLECT_AS_OPTION from env.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 2019-10-02  kugan.vivekanandarajah  
> >
> > PR lto/78353
> > * gcc.target/arm/pr78353-1.c: New test.
> > * gcc.target/arm/pr78353-2.c: New test.

[ARM] Enable DF only when TARGET_VFP_DOUBLE

2019-10-09 Thread Kugan Vivekanandarajah

As reported in Linaro bug report
(https://bugs.linaro.org/show_bug.cgi?id=4636 ; there is no
reproducible testcase provided), for some applications, we see

(insn 126 125 127 9 (set (reg:DF 189)
(fma:DF (reg:DF 126 [ _74 ])
(reg:DF 190)
(reg:DF 191))) "ops.c":30 -1
 (nil))

This looks like due to a typo in the md patterns. Attached patch fixes
this. Bootsrapped and regression tested on arm-linux-gnueabihf without
any regressions.  Is this OK for trunk?

Thanks,
Kugan

gcc/ChangeLog:

2019-10-10  kugan.vivekanandarajah  

* config/arm/vfp.md (fma4): Enable DF only when
TARGET_VFP_DOUBLE.
(*fmsub4): Likewise.
(*fnmsub4): Likewise.
(*fnmadd4): Likewise.
diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md
index 661919e2357..1979aa6fdb4 100644
--- a/gcc/config/arm/vfp.md
+++ b/gcc/config/arm/vfp.md
@@ -1321,7 +1321,7 @@
 (fma:SDF (match_operand:SDF 1 "register_operand" "")
 (match_operand:SDF 2 "register_operand" "")
 (match_operand:SDF 3 "register_operand" "0")))]
-  "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA"
+  "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA "
   "vfma%?.\\t%0, %1, %2"
   [(set_attr "predicable" "yes")
(set_attr "type" "ffma")]
@@ -1357,7 +1357,7 @@
 ""))
 (match_operand:SDF 2 "register_operand" "")
 (match_operand:SDF 3 "register_operand" "0")))]
-  "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA"
+  "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA "
   "vfms%?.\\t%0, %1, %2"
   [(set_attr "predicable" "yes")
(set_attr "type" "ffma")]
@@ -1379,7 +1379,7 @@
(fma:SDF (match_operand:SDF 1 "register_operand" "")
 (match_operand:SDF 2 "register_operand" "")
 (neg:SDF (match_operand:SDF 3 "register_operand" "0"]
-  "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA"
+  "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA "
   "vfnms%?.\\t%0, %1, %2"
   [(set_attr "predicable" "yes")
(set_attr "type" "ffma")]
@@ -1402,7 +1402,7 @@
   ""))
 (match_operand:SDF 2 "register_operand" "")
 (neg:SDF (match_operand:SDF 3 "register_operand" "0"]
-  "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA"
+  "TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_FMA "
   "vfnma%?.\\t%0, %1, %2"
   [(set_attr "predicable" "yes")
(set_attr "type" "ffma")]

[PR47785] COLLECT_AS_OPTIONS

2019-10-02 Thread Kugan Vivekanandarajah

Hi,

As mentioned in the PR, attached patch adds COLLECT_AS_OPTIONS for
passing assembler options specified with -Wa, to the link-time driver.

The proposed solution only works for uniform -Wa options across all
TUs. As mentioned by Richard Biener, supporting non-uniform -Wa flags
would require either adjusting partitioning according to flags or
emitting multiple object files  from a single LTRANS CU. We could
consider this as a follow up.

Bootstrapped and regression tests on  arm-linux-gcc. Is this OK for trunk?

Thanks,
Kugan


gcc/ChangeLog:

2019-10-02  kugan.vivekanandarajah  

PR lto/78353
* gcc.c (putenv_COLLECT_AS_OPTION): New to set COLLECT_AS_OPTION in env.
(driver::main): Call putenv_COLLECT_AS_OPTION.
* lto-wrapper.c (run_gcc): use COLLECT_AS_OPTION from env.

gcc/testsuite/ChangeLog:

2019-10-02  kugan.vivekanandarajah  

PR lto/78353
* gcc.target/arm/pr78353-1.c: New test.
* gcc.target/arm/pr78353-2.c: New test.
From 6968d4343b2442736946a07df4eca969c916ccd3 Mon Sep 17 00:00:00 2001
From: Kugan 
Date: Sat, 28 Sep 2019 02:11:49 +1000
Subject: [PATCH] COLLECT_AS support

---
 gcc/gcc.c| 29 
 gcc/lto-wrapper.c|  6 -
 gcc/testsuite/gcc.target/arm/pr78353-1.c |  9 
 gcc/testsuite/gcc.target/arm/pr78353-2.c |  9 
 4 files changed, 52 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/pr78353-1.c
 create mode 100644 gcc/testsuite/gcc.target/arm/pr78353-2.c

diff --git a/gcc/gcc.c b/gcc/gcc.c
index 1216cdd505a..058f612d1f4 100644
--- a/gcc/gcc.c
+++ b/gcc/gcc.c
@@ -5239,6 +5239,34 @@ do_specs_vec (vec vec)
 }
 }
 
+/* Store switches specified for as with -Wa in COLLECT_AS_OPTIONS
+   and place that in the environment.  */
+static void
+putenv_COLLECT_AS_OPTION (vec vec)
+{
+  unsigned ix;
+  char *opt;
+  int len = vec.length ();
+
+  if (!len)
+ return;
+
+  obstack_init (_obstack);
+  obstack_grow (_obstack, "COLLECT_AS_OPTION=",
+		sizeof ("COLLECT_AS_OPTION=") - 1);
+  obstack_grow (_obstack, "-Wa,", strlen ("-Wa,"));
+
+  FOR_EACH_VEC_ELT (vec, ix, opt)
+  {
+  obstack_grow (_obstack, opt, strlen (opt));
+  --len;
+  if (len)
+	obstack_grow (_obstack, ",", strlen (","));
+  }
+
+  xputenv (XOBFINISH (_obstack, char *));
+}
+
 /* Process the sub-spec SPEC as a portion of a larger spec.
This is like processing a whole spec except that we do
not initialize at the beginning and we do not supply a
@@ -7360,6 +7388,7 @@ driver::main (int argc, char **argv)
   global_initializations ();
   build_multilib_strings ();
   set_up_specs ();
+  putenv_COLLECT_AS_OPTION (assembler_options);
   putenv_COLLECT_GCC (argv[0]);
   maybe_putenv_COLLECT_LTO_WRAPPER ();
   maybe_putenv_OFFLOAD_TARGETS ();
diff --git a/gcc/lto-wrapper.c b/gcc/lto-wrapper.c
index 9a7bbd0c022..64dfabc202a 100644
--- a/gcc/lto-wrapper.c
+++ b/gcc/lto-wrapper.c
@@ -1250,7 +1250,7 @@ run_gcc (unsigned argc, char *argv[])
   const char **argv_ptr;
   char *list_option_full = NULL;
   const char *linker_output = NULL;
-  const char *collect_gcc, *collect_gcc_options;
+  const char *collect_gcc, *collect_gcc_options, *collect_as_option;
   int parallel = 0;
   int jobserver = 0;
   int auto_parallel = 0;
@@ -1283,6 +1283,7 @@ run_gcc (unsigned argc, char *argv[])
   get_options_from_collect_gcc_options (collect_gcc, collect_gcc_options,
 	_options,
 	_options_count);
+  collect_as_option = getenv ("COLLECT_AS_OPTION");
 
   /* Allocate array for input object files with LTO IL,
  and for possible preceding arguments.  */
@@ -1345,6 +1346,9 @@ run_gcc (unsigned argc, char *argv[])
   obstack_init (_obstack);
   obstack_ptr_grow (_obstack, collect_gcc);
   obstack_ptr_grow (_obstack, "-xlto");
+  if (collect_as_option)
+obstack_ptr_grow (_obstack, collect_as_option);
+
   obstack_ptr_grow (_obstack, "-c");
 
   append_compiler_options (_obstack, fdecoded_options,
diff --git a/gcc/testsuite/gcc.target/arm/pr78353-1.c b/gcc/testsuite/gcc.target/arm/pr78353-1.c
new file mode 100644
index 000..bba81ee50c3
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pr78353-1.c
@@ -0,0 +1,9 @@
+/* { dg-do compile }  */
+/* { dg-options "-march=armv7-a -mthumb -O2 -flto -Wa,-mimplicit-it=always" }  */
+
+int main(int x)
+{
+  asm("teq %0, #0; addne %0, %0, #1" : "=r" (x));
+  return x;
+}
+
diff --git a/gcc/testsuite/gcc.target/arm/pr78353-2.c b/gcc/testsuite/gcc.target/arm/pr78353-2.c
new file mode 100644
index 000..776eb64b8c7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pr78353-2.c
@@ -0,0 +1,9 @@
+/* { dg-do compile }  */
+/* { dg-options "-march=armv7-a -mthumb -O2 -flto -Wa,-mimplicit-it=always,-mthumb" }  */
+
+int main(int x)
+{
+  asm("teq %0, #0; addne %0, %0, #1" : "=r" (x));
+  return x;
+}
+
-- 
2.17.1

Re: AARCH64 configure check for gas -mabi support

2019-06-20 Thread Kugan Vivekanandarajah

Hi Thomas,

On Thu, 20 Jun 2019 at 20:04, Thomas Schwinge  wrote:
>
> Hi!
>
> I was just building an aarch64 cross-compiler (indeed compiler only:
> 'make all-gcc'), and then wanted to check something in gimplification
> ('-S -fdump-tree-gimple'), with '-mabi=ilp32', which told me: "cc1:
> error: assembler does not support '-mabi=ilp32'".  That's unexpected, as
> for '-S' GCC isn't even going to invoke the assembler.  It's coming from
> this change:
>
> On Wed, 11 Dec 2013 13:57:59 +0100, Christophe Lyon 
>  wrote:
> > Committed on Kugan's behalf as rev 205891.
> >
> > On 11 December 2013 13:27, Marcus Shawcroft  
> > wrote:
> > > On 10/12/13 20:23, Kugan wrote:
> > >
> > >> gcc/
> > >>
> > >> +2013-12-11  Kugan Vivekanandarajah  
> > >> +   * configure.ac: Add check for aarch64 assembler -mabi support.
> > >> +   * configure: Regenerate.
> > >> +   * config.in: Regenerate.
> > >> +   * config/aarch64/aarch64-elf.h (ASM_MABI_SPEC): New define.
> > >> +   (ASM_SPEC): Update to substitute -mabi with ASM_MABI_SPEC.
> > >> +   * config/aarch64/aarch64.h (aarch64_override_options):  Issue
> > >> error if
> > >> +   assembler does not support -mabi and option ilp32 is selected.
> > >> +   * doc/install.texi: Added note that building gcc 4.9 and after
> > >> with pre
> > >> +   2.24 binutils will not support -mabi=ilp32.
> > >> +
> > >>
> > >
> > > Kugan, Thanks for sorting this out. OK to commit.
> > >
> > > /Marcus
>
> Specifically:
>
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -5187,6 +5187,13 @@ aarch64_override_options (void)
>aarch64_parse_tune ();
>  }
>
> +#ifndef HAVE_AS_MABI_OPTION
> +  /* The compiler may have been configured with 2.23.* binutils, which 
> does
> + not have support for ILP32.  */
> +  if (TARGET_ILP32)
> +error ("Assembler does not support -mabi=ilp32");
> +#endif
>
> Why is that necessary?  Won't the assembler itself tell the user that it
> "does not support -mabi=ilp32", thus this check can be removed?  If not,
> can a condition simply be added here to only emit this error if we're
> indeed going to invoke the assembler?
Current binutils will but  binutils  2.23 and before didnt.
Specifically, with  2.23.2, bootstrap was failing. That is why we
needed this.

Thanks,
Kugan


>
> (For my own testing, I just locally disabled that, of course.)
>
>
> Grüße
>  Thomas

Re: [PATCH 0/2][RFC][PR88836][AARCH64] Fix redundant ptest instruction

2019-06-19 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for your comments.

On Thu, 16 May 2019 at 18:13, Richard Sandiford
 wrote:
>
> kugan.vivekanandara...@linaro.org writes:
> > From: Kugan Vivekanandarajah 
> >
> > Inorder to fix this PR.
> >  * We need to change the whilelo pattern in backend
> >  * Change RTL CSE such that:
> >- Add support for VEC_DUPLICATE
> >- When handling PARALLEL rtx in cse_insn, we kill CSE defined by all the
> >  parallel rtx at the end.
> >
> > For example, with patch1, we now have rtl insn as follows:
> >
> > (insn 19 18 20 3 (parallel [
> > (set (reg:VNx4BI 93 [ next_mask_18 ])
> > (unspec:VNx4BI [
> > (const_int 0 [0])
> > (reg:DI 95 [ _33 ])
> > ] UNSPEC_WHILE_LO))
> > (set (reg:CC 66 cc)
> > (compare:CC (unspec:SI [
> > (vec_duplicate:VNx4BI (const_int 1 [0x1]))
> > (reg:VNx4BI 93 [ next_mask_18 ])
> > ] UNSPEC_PTEST_PTRUE)
> > (const_int 0 [0])))
> > ]) 4244 {while_ultdivnx4bi}
> >
> > When cse_insn process the first, it records the CSE set in reg 93.  Then 
> > after
> > processing both the instruction in the parallel rtx, we invalidate all
> > expression with reg 93 which means expression in the second instruction is
> > invalidated for CSE. Attached patch relaxes this by invalidating before 
> > processing the
> > second.
>
> As far as patch 1 goes: the traditional reason for using clobbers
> to start with is that:
>
> - setting CC places a requirement on what CC must be after that instruction.
>   We then have to rely on REG_UNUSED notes to tell whether that value
>   actually matters or not.
>
>   This was a bigger deal before df though.  It might not matter as much now.
>
> - many passes find it harder to deal with multiple sets rather than
>   single sets, so it's better to keep a single_set unless we know
>   that both results are needed.
>
> It's currently combine's job to create a multiple set in cases
> where both results are useful.  The pattern for that already exists
> (*while_ult_cc), so if we do go for something
> like patch 1, we should simply expand to that insn rather than adding a
> new one.  Note that:
>
>   (vec_duplicate:PRED_ALL (const_int 1))
>
> shouldn't appear in the insn stream.  It should always be a const_vector
> instead.
>
> From a quick check, I haven't yet found a case where setting CC up-front
> hurts though, so maybe the above is out-of-date best practice and we
> should set the register up-front after all, if only to reduce the number
> of patterns.
>
> However, if possible, I think we should fix the PR in a way that works
> for instructions that only optionally set the flags (which for AArch64
> is the common case).  So it would be good if we could fix the PR without
> needing patch 1.

Do you think that combine should be able to set this. Sorry, I don't
understand how we can let other passes know that this instruction will
set the flags needed.

Thanks,
Kugan

>
> Thanks,
> Richard
>
> >
> > Bootstrap and regression testing for the current version is ongoing.
> >
> > Thanks,
> > Kugan
> >
> > Kugan Vivekanandarajah (2):
> >   [PR88836][aarch64] Set CC_REGNUM instead of clobber
> >   [PR88836][aarch64] Fix CSE to process parallel rtx dest one by one
> >
> >  gcc/config/aarch64/aarch64-sve.md  |  9 +++-
> >  gcc/cse.c  | 67 
> > ++
> >  gcc/testsuite/gcc.target/aarch64/pr88836.c | 14 +++
> >  3 files changed, 80 insertions(+), 10 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr88836.c

Fix ICE due to commit for PR88834

2019-06-16 Thread Kugan Vivekanandarajah

Hi All,

As pointed to me by Jeff, after committing patch to fix PR88834, some
tests are failing for target rx-elf. This is because in
preferred_mem_scale_factor we end up with mem_mode which is BLKmode
and hence GET_MODE_UNIT_SIZE returns zero.

I have fixed this by checking for BLKmode. I believe this is the only
way we can have GET_MODE_UNIT_SIZE of 0. Otherwise, we can check for
GET_MODE_UNIT_SIZE of zero.

Bootstrapped and regression tested attached patch on x86_64-linux-gnu
with no new regressions. Is this OK for trunk?

Thanks,
Kugan

gcc/ChangeLog:

2019-06-17  Kugan Vivekanandarajah  

* tree-ssa-address.c (preferred_mem_scale_factor): Handle when
mem_mode is BLKmode.
From 5cd4ac35ce8006a6c407a2386175382f053dcdd3 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Sun, 16 Jun 2019 21:02:59 +1000
Subject: [PATCH] Fix ICE for rx-elf

Change-Id: I503b6b8316e7d11d63ec7749ff44dbc641078539
---
 gcc/tree-ssa-address.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/gcc/tree-ssa-address.c b/gcc/tree-ssa-address.c
index cdd432a..1dca779 100644
--- a/gcc/tree-ssa-address.c
+++ b/gcc/tree-ssa-address.c
@@ -1138,6 +1138,10 @@ preferred_mem_scale_factor (tree base, machine_mode mem_mode,
   addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (base));
   unsigned int fact = GET_MODE_UNIT_SIZE (mem_mode);
 
+  /* for BLKmode, we cant do anything so return 1.  */
+  if (mem_mode == BLKmode)
+return 1;
+
   /* Addressing mode "base + index".  */
   parts.index = integer_one_node;
   parts.base = integer_one_node;
-- 
2.7.4

Re: [AARCH64] Fix typo in comment

2019-06-12 Thread Kugan Vivekanandarajah

Hi Kyrill,

Thanks for the comments. Committed as you suggested.

Thanks,
Kugan

On Wed, 12 Jun 2019 at 18:07, Kyrill Tkachov
 wrote:
>
> Hi Kugan,
>
> On 6/12/19 4:59 AM, Kugan Vivekanandarajah wrote:
> > AArch64 comment for ADDSUB iterator is a typo or copy-and-paste error.
> > Attached patch fixes this. I believe this falls under obvious
> > category. I will commit it after 48hrs unless comments should be
> > better worded.
> >
> > Thanks,
> > Kugan
> >
> >
> > gcc/ChangeLog:
> >
> > 2019-06-12  Kugan Vivekanandarajah 
> >
> > * config/aarch64/iterators.md (ADDSUB): Fix typo in comment.
>
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md
> index 2179e6f..49c8146 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -1215,7 +1215,7 @@
>   ;; Signed and unsigned max operations.
>   (define_code_iterator USMAX [smax umax])
>
> -;; Code iterator for variants of vector max and min.
>
> +;; Code iterator for variants of vector plus and minus.
>
>
> I'd remove the "variants" and have it "Code iterator for plus and minus"
>
> I do agree such a change is obvious.
>
> Thanks,
>
> Kyrill
>
>   (define_code_iterator ADDSUB [plus minus])
>
>   ;; Code iterator for variants of vector saturating binary ops.
>

[AARCH64] Fix typo in comment

2019-06-11 Thread Kugan Vivekanandarajah

AArch64 comment for ADDSUB iterator is a typo or copy-and-paste error.
Attached patch fixes this. I believe this falls under obvious
category. I will commit it after 48hrs unless comments should be
better worded.

Thanks,
Kugan


gcc/ChangeLog:

2019-06-12  Kugan Vivekanandarajah  

* config/aarch64/iterators.md (ADDSUB): Fix typo in comment.
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 2179e6f..49c8146 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -1215,7 +1215,7 @@
 ;; Signed and unsigned max operations.
 (define_code_iterator USMAX [smax umax])
 
-;; Code iterator for variants of vector max and min.
+;; Code iterator for variants of vector plus and minus.
 (define_code_iterator ADDSUB [plus minus])
 
 ;; Code iterator for variants of vector saturating binary ops.

Re: [RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

2019-06-06 Thread Kugan Vivekanandarajah

Hi Richard,

On Thu, 6 Jun 2019 at 22:07, Richard Sandiford
 wrote:
>
> Kugan Vivekanandarajah  writes:
> > Hi Richard,
> >
> > On Thu, 6 Jun 2019 at 19:35, Richard Sandiford
> >  wrote:
> >>
> >> Kugan Vivekanandarajah  writes:
> >> > Hi Richard,
> >> >
> >> > Thanks for the review. Attached is the latest patch.
> >> >
> >> > For testcase like cond_arith_1.c, with the patch, gcc ICE in fwprop. I
> >> > am limiting fwprop in cases like this. Is there a better fix for this?
> >> > index cf2c9de..2c99285 100644
> >> > --- a/gcc/fwprop.c
> >> > +++ b/gcc/fwprop.c
> >> > @@ -1358,6 +1358,15 @@ forward_propagate_and_simplify (df_ref use,
> >> > rtx_insn *def_insn, rtx def_set)
> >> >else
> >> >  mode = GET_MODE (*loc);
> >> >
> >> > +  /* TODO. We can't get the mode for
> >> > + (set (reg:VNx16BI 109)
> >> > +  (unspec:VNx16BI [
> >> > +(reg:SI 131)
> >> > +(reg:SI 106)
> >> > +   ] UNSPEC_WHILE_LO))
> >> > + Thus, bailout when it is UNSPEC and MODEs are not compatible.  */
> >> > +  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg)))
> >> > +return false;
> >> >new_rtx = propagate_rtx (*loc, mode, reg, src,
> >> >   optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn)));
> >>
> >> What specifically goes wrong?  The unspec above isn't that unusual --
> >> many unspecs have different modes from their inputs.
> >
> > cond_arith_1.c:38:1: internal compiler error: in paradoxical_subreg_p,
> > at rtl.h:3130
> > 0x135f1d3 paradoxical_subreg_p(machine_mode, machine_mode)
> > ../../88838/gcc/rtl.h:3130
> > 0x135f1d3 propagate_rtx
> > ../../88838/gcc/fwprop.c:683
> > 0x135f4a3 forward_propagate_and_simplify
> > ../../88838/gcc/fwprop.c:1371
> > 0x135f4a3 forward_propagate_into
> > ../../88838/gcc/fwprop.c:1430
> > 0x135fdcb fwprop
> > ../../88838/gcc/fwprop.c:1519
> > 0x135fdcb execute
> > ../../88838/gcc/fwprop.c:1550
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> >
> >
> > in forward_propagate_and_simplify
> >
> > use_set:
> > (set (reg:VNx16BI 96 [ loop_mask_52 ])
> > (unspec:VNx16BI [
> > (reg:SI 92 [ _3 ])
> > (reg:SI 95 [ niters.36 ])
> > ] UNSPEC_WHILE_LO))
> >
> > reg:
> > (reg:SI 92 [ _3 ])
> >
> > *loc:
> > (unspec:VNx16BI [
> > (reg:SI 92 [ _3 ])
> > (reg:SI 95 [ niters.36 ])
> > ] UNSPEC_WHILE_LO)
> >
> > src:
> > (subreg:SI (reg:DI 136 [ ivtmp_101 ]) 0)
> >
> > use_insn:
> > (insn 87 86 88 4 (parallel [
> > (set (reg:VNx16BI 96 [ loop_mask_52 ])
> > (unspec:VNx16BI [
> > (reg:SI 92 [ _3 ])
> > (reg:SI 95 [ niters.36 ])
> > ] UNSPEC_WHILE_LO))
> > (clobber (reg:CC 66 cc))
> > ]) 4255 {while_ultsivnx16bi}
> >  (expr_list:REG_UNUSED (reg:CC 66 cc)
> > (nil)))
> >
> > I think we calculate the mode to be VNx16BI which is wrong?
> > because of which in propgate_rtx,   !paradoxical_subreg_p (mode,
> > GET_MODE (SUBREG_REG (new_rtx)  ICE
>
> Looks like something I hit on the ACLE branch, but didn't have a
> non-ACLE reproducer for (see 065881acf0de35ff7818c1fc92769e1c106e1028).
>
> Does the attached work?  The current call is wrong because "mode"
> is the mode of "x", not the mode of "new_rtx".

Yes, attached patch works for this testcase. Are you planning to
commit it to trunk. I will wait for this.

Thanks,
Kugan
>
> Thanks,
> Richard
>
>
> 2019-06-06  Richard Sandiford  
>
> gcc/
> * fwprop.c (propagate_rtx): Fix call to paradoxical_subreg_p.
>
> Index: gcc/fwprop.c
> ===
> --- gcc/fwprop.c2019-03-08 18:14:25.333011645 +
> +++ gcc/fwprop.c2019-06-06 13:04:34.423476690 +0100
> @@ -680,7 +680,7 @@ propagate_rtx (rtx x, machine_mode mode,
>|| CONSTANT_P (new_rtx)
>|| (GET_CODE (new_rtx) == SUBREG
>   && REG_P (SUBREG_REG (new_rtx))
> - && !paradoxical_subreg_p (mode, GET_MODE (SUBREG_REG (new_rtx)
> + && !paradoxical_subreg_p (new_rtx)))
>  flags |= PR_CAN_APPEAR;
>if (!varying_mem_p (new_rtx))
>  flags |= PR_HANDLE_MEM;

Re: [RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

2019-06-06 Thread Kugan Vivekanandarajah

Hi Richard,

On Thu, 6 Jun 2019 at 19:35, Richard Sandiford
 wrote:
>
> Kugan Vivekanandarajah  writes:
> > Hi Richard,
> >
> > Thanks for the review. Attached is the latest patch.
> >
> > For testcase like cond_arith_1.c, with the patch, gcc ICE in fwprop. I
> > am limiting fwprop in cases like this. Is there a better fix for this?
> > index cf2c9de..2c99285 100644
> > --- a/gcc/fwprop.c
> > +++ b/gcc/fwprop.c
> > @@ -1358,6 +1358,15 @@ forward_propagate_and_simplify (df_ref use,
> > rtx_insn *def_insn, rtx def_set)
> >else
> >  mode = GET_MODE (*loc);
> >
> > +  /* TODO. We can't get the mode for
> > + (set (reg:VNx16BI 109)
> > +  (unspec:VNx16BI [
> > +(reg:SI 131)
> > +(reg:SI 106)
> > +   ] UNSPEC_WHILE_LO))
> > + Thus, bailout when it is UNSPEC and MODEs are not compatible.  */
> > +  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg)))
> > +return false;
> >new_rtx = propagate_rtx (*loc, mode, reg, src,
> >   optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn)));
>
> What specifically goes wrong?  The unspec above isn't that unusual --
> many unspecs have different modes from their inputs.

cond_arith_1.c:38:1: internal compiler error: in paradoxical_subreg_p,
at rtl.h:3130
0x135f1d3 paradoxical_subreg_p(machine_mode, machine_mode)
../../88838/gcc/rtl.h:3130
0x135f1d3 propagate_rtx
../../88838/gcc/fwprop.c:683
0x135f4a3 forward_propagate_and_simplify
../../88838/gcc/fwprop.c:1371
0x135f4a3 forward_propagate_into
../../88838/gcc/fwprop.c:1430
0x135fdcb fwprop
../../88838/gcc/fwprop.c:1519
0x135fdcb execute
../../88838/gcc/fwprop.c:1550
Please submit a full bug report,
with preprocessed source if appropriate.


in forward_propagate_and_simplify

use_set:
(set (reg:VNx16BI 96 [ loop_mask_52 ])
(unspec:VNx16BI [
(reg:SI 92 [ _3 ])
(reg:SI 95 [ niters.36 ])
] UNSPEC_WHILE_LO))

reg:
(reg:SI 92 [ _3 ])

*loc:
(unspec:VNx16BI [
(reg:SI 92 [ _3 ])
(reg:SI 95 [ niters.36 ])
] UNSPEC_WHILE_LO)

src:
(subreg:SI (reg:DI 136 [ ivtmp_101 ]) 0)

use_insn:
(insn 87 86 88 4 (parallel [
(set (reg:VNx16BI 96 [ loop_mask_52 ])
(unspec:VNx16BI [
(reg:SI 92 [ _3 ])
(reg:SI 95 [ niters.36 ])
] UNSPEC_WHILE_LO))
(clobber (reg:CC 66 cc))
]) 4255 {while_ultsivnx16bi}
 (expr_list:REG_UNUSED (reg:CC 66 cc)
(nil)))

I think we calculate the mode to be VNx16BI which is wrong?
because of which in propgate_rtx,   !paradoxical_subreg_p (mode,
GET_MODE (SUBREG_REG (new_rtx)  ICE

Thanks,
Kugan

>
> Thanks,
> Richard

Re: [RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

2019-06-05 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review. Attached is the latest patch.

For testcase like cond_arith_1.c, with the patch, gcc ICE in fwprop. I
am limiting fwprop in cases like this. Is there a better fix for this?
index cf2c9de..2c99285 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1358,6 +1358,15 @@ forward_propagate_and_simplify (df_ref use,
rtx_insn *def_insn, rtx def_set)
   else
 mode = GET_MODE (*loc);

+  /* TODO. We can't get the mode for
+ (set (reg:VNx16BI 109)
+  (unspec:VNx16BI [
+(reg:SI 131)
+(reg:SI 106)
+   ] UNSPEC_WHILE_LO))
+ Thus, bailout when it is UNSPEC and MODEs are not compatible.  */
+  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg)))
+return false;
   new_rtx = propagate_rtx (*loc, mode, reg, src,
  optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn)));

Thanks,
Kugan

On Mon, 3 Jun 2019 at 19:08, Richard Sandiford
 wrote:
>
> Kugan Vivekanandarajah  writes:
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index b3fae5b..ad838dd 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -415,6 +415,7 @@ vect_set_loop_masks_directly (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> > bool might_wrap_p)
> >  {
> >tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
> > +  tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);
> >tree mask_type = rgm->mask_type;
> >unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
> >poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
> > @@ -445,11 +446,16 @@ vect_set_loop_masks_directly (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >tree index_before_incr, index_after_incr;
> >gimple_stmt_iterator incr_gsi;
> >bool insert_after;
> > -  tree zero_index = build_int_cst (compare_type, 0);
> >standard_iv_increment_position (loop, _gsi, _after);
> > -  create_iv (zero_index, nscalars_step, NULL_TREE, loop, _gsi,
> > +
> > +  tree zero_index = build_int_cst (iv_type, 0);
> > +  tree step = build_int_cst (iv_type,
> > +  LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> > +  /* Creating IV of iv_type.  */
>
> s/Creating/Create/
>
> > +  create_iv (zero_index, step, NULL_TREE, loop, _gsi,
> >insert_after, _before_incr, _after_incr);
> >
> > +  zero_index = build_int_cst (compare_type, 0);
> >tree test_index, test_limit, first_limit;
> >gimple_stmt_iterator *test_gsi;
> >if (might_wrap_p)
> > [...]
> > @@ -1066,11 +1077,17 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> > if (this_type
> > && can_produce_all_loop_masks_p (loop_vinfo, this_type))
> >   {
> > -   /* Although we could stop as soon as we find a valid mode,
> > -  it's often better to continue until we hit Pmode, since the
> > +   /* See whether zero-based IV would ever generate all-false masks
> > +  before wrapping around.  */
> > +   bool might_wrap_p = (iv_precision > cmp_bits);
> > +   /* Stop as soon as we find a valid mode.  If we decided to use
> > +  cmp_type which is less than Pmode precision, it is often 
> > better
> > +  to use iv_type corresponding to Pmode, since the
> >operands to the WHILE are more likely to be reusable in
> > -  address calculations.  */
> > -   cmp_type = this_type;
> > +  address calculations in this case.  */
>
> We're not stopping as soon as we find a valid mode though.  Any type
> that satisfies the if condition above is valid, but we pick wider
> cmp_types and iv_types for optimisation reasons.  How about:
>
>   /* Although we could stop as soon as we find a valid mode,
>  there are at least two reasons why that's not always the
>  best choice:
>
>  - An IV that's Pmode or wider is more likely to be reusable
>in address calculations than an IV that's narrower than
>Pmode.
>
>  - Doing the comparison in IV_PRECISION or wider allows
>a natural 0-based IV, whereas using a narrower comparison
>type requires mitigations against wrap-around.
>
>  Conversely, if the IV limit is variable, doing the comparison
>  in a wider type than the original type can introduce
>  unnecessary extensions, so picking the widest valid mode
>

Re: [RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

2019-06-02 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review,

On Fri, 31 May 2019 at 19:43, Richard Sandiford
 wrote:
>
> Kugan Vivekanandarajah  writes:
> > @@ -609,8 +615,14 @@ vect_set_loop_masks_directly (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >
> >/* Get the mask value for the next iteration of the loop.  */
> >next_mask = make_temp_ssa_name (mask_type, NULL, "next_mask");
> > -  gcall *call = vect_gen_while (next_mask, test_index, 
> > this_test_limit);
> > -  gsi_insert_before (test_gsi, call, GSI_SAME_STMT);
> > +  tree test_index_cmp_type = make_ssa_name (compare_type);
> > +  gimple *conv_stmt = gimple_build_assign (test_index_cmp_type,
> > +NOP_EXPR,
> > +test_index);
> > +  gsi_insert_before (test_gsi, conv_stmt, GSI_NEW_STMT);
>
> We only need to convert once, so this should happen before the
> FOR_EACH_VEC_ELT_REVERSE loop.  Would be better as:
>
>   gimple_seq test_seq = NULL;
>   test_index = gimple_convert (_seq, compare_type, text_index);
>   gimple_insert_seq_before (test_gsi, test_seq, GSI_SAME_STMT);

Ok.
>
> so that we don't generate unnecessary converts.
>
> > +  gcall *call = vect_gen_while (next_mask, test_index_cmp_type,
> > + this_test_limit);
> > +  gsi_insert_after (test_gsi, call, GSI_SAME_STMT);
> >
> >vect_set_loop_mask (loop, mask, init_mask, next_mask);
> >  }
> > @@ -637,12 +649,12 @@ vect_set_loop_condition_masked (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >
> >tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
> >unsigned int compare_precision = TYPE_PRECISION (compare_type);
> > -  unsigned HOST_WIDE_INT max_vf = vect_max_vf (loop_vinfo);
> >tree orig_niters = niters;
> >
> >/* Type of the initial value of NITERS.  */
> >tree ni_actual_type = TREE_TYPE (niters);
> >unsigned int ni_actual_precision = TYPE_PRECISION (ni_actual_type);
> > +  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> >
> >/* Convert NITERS to the same size as the compare.  */
> >if (compare_precision > ni_actual_precision
> > @@ -661,33 +673,7 @@ vect_set_loop_condition_masked (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >else
> >  niters = gimple_convert (_seq, compare_type, niters);
> >
> > -  /* Convert skip_niters to the right type.  */
> > -  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> > -
> > -  /* Now calculate the value that the induction variable must be able
> > - to hit in order to ensure that we end the loop with an all-false mask.
> > - This involves adding the maximum number of inactive trailing scalar
> > - iterations.  */
> > -  widest_int iv_limit;
> > -  bool known_max_iters = max_loop_iterations (loop, _limit);
> > -  if (known_max_iters)
> > -{
> > -  if (niters_skip)
> > - {
> > -   /* Add the maximum number of skipped iterations to the
> > -  maximum iteration count.  */
> > -   if (TREE_CODE (niters_skip) == INTEGER_CST)
> > - iv_limit += wi::to_widest (niters_skip);
> > -   else
> > - iv_limit += max_vf - 1;
> > - }
> > -  /* IV_LIMIT is the maximum number of latch iterations, which is also
> > -  the maximum in-range IV value.  Round this value down to the previous
> > -  vector alignment boundary and then add an extra full iteration.  */
> > -  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > -  iv_limit = (iv_limit & -(int) known_alignment (vf)) + max_vf;
> > -}
> > -
> > +  widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo);
> >/* Get the vectorization factor in tree form.  */
> >tree vf = build_int_cst (compare_type,
> >  LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> > @@ -717,7 +703,7 @@ vect_set_loop_condition_masked (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >   /* See whether zero-based IV would ever generate all-false masks
> >  before wrapping around.  */
> >   bool might_wrap_p
> > -   = (!known_max_iters
> > +   = (iv_limit == UINT_MAX
>
> Shouldn't this be == -1?
>
> >|| (wi::min_precision (iv_limit * rgm->max_nscalars_per_iter,
> >   UNSIGNED)
> >> compare_precision));
> > diff --git a/gcc/tree-vect-loop.c b/g

Re: [RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

2019-05-30 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.

On Tue, 28 May 2019 at 20:44, Richard Sandiford
 wrote:
>
> Kugan Vivekanandarajah  writes:
> > [...]
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index b3fae5b..c15b8a2 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -415,10 +415,16 @@ vect_set_loop_masks_directly (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> > bool might_wrap_p)
> >  {
> >tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo);
> > +  tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo);
> >tree mask_type = rgm->mask_type;
> >unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
> >poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
> > +  bool convert = false;
> >
> > +  /* If the compare_type is not iv_type, we will create an IV with
> > + iv_type with truncated use (i.e. converted to the correct type).  */
> > +  if (compare_type != iv_type)
> > +convert = true;
> >/* Calculate the maximum number of scalar values that the rgroup
> >   handles in total, the number that it handles for each iteration
> >   of the vector loop, and the number that it should skip during the
> > @@ -444,12 +450,43 @@ vect_set_loop_masks_directly (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >   processed.  */
> >tree index_before_incr, index_after_incr;
> >gimple_stmt_iterator incr_gsi;
> > +  gimple_stmt_iterator incr_gsi2;
> >bool insert_after;
> > -  tree zero_index = build_int_cst (compare_type, 0);
> > +  tree zero_index;
> >standard_iv_increment_position (loop, _gsi, _after);
> > -  create_iv (zero_index, nscalars_step, NULL_TREE, loop, _gsi,
> > -  insert_after, _before_incr, _after_incr);
> >
> > +  if (convert)
> > +{
> > +  /* If we are creating IV of iv_type and then converting.  */
> > +  zero_index = build_int_cst (iv_type, 0);
> > +  tree step = build_int_cst (iv_type,
> > +  LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> > +  /* Creating IV of iv_type.  */
> > +  create_iv (zero_index, step, NULL_TREE, loop, _gsi,
> > +  insert_after, _before_incr, _after_incr);
> > +  /* Create truncated index_before and after increament.  */
> > +  tree index_before_incr_trunc = make_ssa_name (compare_type);
> > +  tree index_after_incr_trunc = make_ssa_name (compare_type);
> > +  gimple *incr_before_stmt = gimple_build_assign 
> > (index_before_incr_trunc,
> > +   NOP_EXPR,
> > +   index_before_incr);
> > +  gimple *incr_after_stmt = gimple_build_assign 
> > (index_after_incr_trunc,
> > +  NOP_EXPR,
> > +  index_after_incr);
> > +  incr_gsi2 = incr_gsi;
> > +  gsi_insert_before (_gsi2, incr_before_stmt, GSI_NEW_STMT);
> > +  gsi_insert_after (_gsi, incr_after_stmt, GSI_NEW_STMT);
> > +  index_before_incr = index_before_incr_trunc;
> > +  index_after_incr = index_after_incr_trunc;
> > +  zero_index = build_int_cst (compare_type, 0);
> > +}
> > +  else
> > +{
> > +  /* If the IV is of compare_type, no convertion needed.  */
> > +  zero_index = build_int_cst (compare_type, 0);
> > +  create_iv (zero_index, nscalars_step, NULL_TREE, loop, _gsi,
> > +  insert_after, _before_incr, _after_incr);
> > +}
> >tree test_index, test_limit, first_limit;
> >gimple_stmt_iterator *test_gsi;
> >if (might_wrap_p)
>
> Now that we have an explicit iv_type, there shouldn't be any need to
> treat this as two special cases.  I think we should just convert the
> IV to the comparison type before passing it to the WHILE.

Changed it.
>
> > @@ -617,6 +654,41 @@ vect_set_loop_masks_directly (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >return next_mask;
> >  }
> >
> > +/* Return the iv_limit for fully masked loop LOOP with LOOP_VINFO.
> > +   If it is not possible to calcilate iv_limit, return -1.  */
>
> Maybe:
>
> /* Decide whether it is possible to use a zero-based induction variable
>when vectorizing LOOP_VINFO with a fully-masked loop.  If it is,
>return the value that the induction variable must be able to hold
>in order to ensure that the loop ends with an all

Re: [RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

2019-05-27 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.

On Sat, 25 May 2019 at 19:41, Richard Sandiford
 wrote:
>
> Kugan Vivekanandarajah  writes:
> > diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> > index 77d3dac..d6452a1 100644
> > --- a/gcc/tree-vect-loop-manip.c
> > +++ b/gcc/tree-vect-loop-manip.c
> > @@ -418,7 +418,20 @@ vect_set_loop_masks_directly (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >tree mask_type = rgm->mask_type;
> >unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
> >poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
> > -
> > +  bool convert = false;
> > +  tree iv_type = NULL_TREE;
> > +
> > +  /* If the compare_type is not with Pmode size, we will create an IV with
> > + Pmode size with truncated use (i.e. converted to the correct type).
> > + This is because using Pmode allows ivopts to reuse the IV for indices
> > + (in the loads and store).  */
> > +  if (known_lt (GET_MODE_BITSIZE (TYPE_MODE (compare_type)),
> > + GET_MODE_BITSIZE (Pmode)))
> > +{
> > +  iv_type = build_nonstandard_integer_type (GET_MODE_BITSIZE (Pmode),
> > + true);
> > +  convert = true;
> > +}
> >/* Calculate the maximum number of scalar values that the rgroup
> >   handles in total, the number that it handles for each iteration
> >   of the vector loop, and the number that it should skip during the
> > @@ -444,12 +457,43 @@ vect_set_loop_masks_directly (struct loop *loop, 
> > loop_vec_info loop_vinfo,
> >   processed.  */
> >tree index_before_incr, index_after_incr;
> >gimple_stmt_iterator incr_gsi;
> > +  gimple_stmt_iterator incr_gsi2;
> >bool insert_after;
> > -  tree zero_index = build_int_cst (compare_type, 0);
> > +  tree zero_index;
> >standard_iv_increment_position (loop, _gsi, _after);
> > -  create_iv (zero_index, nscalars_step, NULL_TREE, loop, _gsi,
> > -  insert_after, _before_incr, _after_incr);
> >
> > +  if (convert)
> > +{
> > +  /* If we are creating IV of Pmode type and converting.  */
> > +  zero_index = build_int_cst (iv_type, 0);
> > +  tree step = build_int_cst (iv_type,
> > +  LOOP_VINFO_VECT_FACTOR (loop_vinfo));
> > +  /* Creating IV of Pmode type.  */
> > +  create_iv (zero_index, step, NULL_TREE, loop, _gsi,
> > +  insert_after, _before_incr, _after_incr);
> > +  /* Create truncated index_before and after increament.  */
> > +  tree index_before_incr_trunc = make_ssa_name (compare_type);
> > +  tree index_after_incr_trunc = make_ssa_name (compare_type);
> > +  gimple *incr_before_stmt = gimple_build_assign 
> > (index_before_incr_trunc,
> > +   NOP_EXPR,
> > +   index_before_incr);
> > +  gimple *incr_after_stmt = gimple_build_assign 
> > (index_after_incr_trunc,
> > +  NOP_EXPR,
> > +  index_after_incr);
> > +  incr_gsi2 = incr_gsi;
> > +  gsi_insert_before (_gsi2, incr_before_stmt, GSI_NEW_STMT);
> > +  gsi_insert_after (_gsi, incr_after_stmt, GSI_NEW_STMT);
> > +  index_before_incr = index_before_incr_trunc;
> > +  index_after_incr = index_after_incr_trunc;
> > +  zero_index = build_int_cst (compare_type, 0);
> > +}
> > +  else
> > +{
> > +  /* If the IV is of Pmode compare_type, no convertion needed.  */
> > +  zero_index = build_int_cst (compare_type, 0);
> > +  create_iv (zero_index, nscalars_step, NULL_TREE, loop, _gsi,
> > +  insert_after, _before_incr, _after_incr);
> > +}
> >tree test_index, test_limit, first_limit;
> >gimple_stmt_iterator *test_gsi;
> >if (might_wrap_p)
>
> Instead of hard-coding Pmode as a special case here, I think we should
> record the IV type in vect_verify_full_masking in addition to the comparison
> type.  (With the IV type always being at least as wide as the comparison
> type.)
Ok.

>
> > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> > index bd81193..2769c86 100644
> > --- a/gcc/tree-vect-loop.c
> > +++ b/gcc/tree-vect-loop.c
> > @@ -1035,6 +1035,30 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
> >/* Find a scalar mode for which WHILE_ULT is supported.  */
> >opt_scalar_int_m

Re: [PATCH 1/2] Add support for IVOPT

2019-05-21 Thread Kugan Vivekanandarajah

Hi Richard,


On Fri, 17 May 2019 at 18:47, Richard Sandiford
 wrote:
>
> Kugan Vivekanandarajah  writes:
> > [...]
> >> > +{
> >> > +  struct mem_address parts = {NULL_TREE, integer_one_node,
> >> > +   NULL_TREE, NULL_TREE, NULL_TREE};
> >>
> >> Might be better to use "= {}" and initialise the fields that matter by
> >> assignment.  As it stands this uses integer_one_node as the base, but I
> >> couldn't tell if that was deliberate.
> >
> > I just copied this part from get_address_cost, similar to what is done
> > there.
>
> Ah, sorry :-)
>
> > I have now changed the way you suggested but using the values
> > used in get_address_cost.
>
> Thanks.
>
> > [...]
> > @@ -3479,6 +3481,35 @@ add_iv_candidate_derived_from_uses (struct 
> > ivopts_data *data)
> >data->iv_common_cands.truncate (0);
> >  }
> >
> > +/* Return the preferred mem scale factor for accessing MEM_MODE
> > +   of BASE in LOOP.  */
> > +static unsigned int
> > +preferred_mem_scale_factor (struct loop *loop,
> > + tree base, machine_mode mem_mode)
>
> IMO this should live in tree-ssa-address.c instead.
>
> The only use of "loop" is to test for size vs. speed, but other callers
> might want to make that decision based on individual blocks, so I think
> it would make sense to pass a "speed" bool instead.  Probably also worth
> making it the last parameter, so that the order is consistent with
> address_cost (though probably then inconsistent with something else :-)).
>
> > [...]
> > @@ -3500,6 +3531,28 @@ add_iv_candidate_for_use (struct ivopts_data *data, 
> > struct iv_use *use)
> >  basetype = sizetype;
> >record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);
> >
> > +  /* Compare the cost of an address with an unscaled index with the cost of
> > +an address with a scaled index and add candidate if useful. */
> > +  if (use != NULL
> > +  && poly_int_tree_p (iv->step)
> > +  && tree_fits_poly_int64_p (iv->step)
> > +  && address_p (use->type))
> > +{
> > +  poly_int64 new_step;
> > +  poly_int64 poly_step = tree_to_poly_int64 (iv->step);
>
> This should be:
>
>   poly_int64 step;
>   if (use != NULL
>   && poly_int_tree_p (iv->step, )
>   && address_p (use->type))
> {
>   poly_int64 new_step;
>
> > +  unsigned int fact
> > + = preferred_mem_scale_factor (data->current_loop,
> > +use->iv->base,
> > +TYPE_MODE (use->mem_type));
> > +
> > +  if ((fact != 1)
> > +   && multiple_p (poly_step, fact, _step))
>
> Should be no brackets around "fact != 1".
>
> > [...]
>
> Looks really good to me otherwise, thanks.  Bin, any comments?
Revised patch which handles the above review comments is attached.

Thanks,
Kugan

> Richard
From 6a146662fab39de876de332bacbb1a3300caefb8 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Wed, 15 May 2019 09:16:43 +1000
Subject: [PATCH 1/2] Add support for IVOPT

gcc/ChangeLog:

2019-05-15  Kugan Vivekanandarajah  

	PR target/88834
	* tree-ssa-loop-ivopts.c (get_mem_type_for_internal_fn): Handle
	IFN_MASK_LOAD_LANES and IFN_MASK_STORE_LANES.
	(get_alias_ptr_type_for_ptr_address): Likewise.
	(add_iv_candidate_for_use): Add scaled index candidate if useful.
	* tree-ssa-address.c (preferred_mem_scale_factor): New.

Change-Id: Ie47b1722dc4fb430f07dadb8a58385759e75df58
---
 gcc/tree-ssa-address.c | 28 
 gcc/tree-ssa-address.h |  3 +++
 gcc/tree-ssa-loop-ivopts.c | 26 +-
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-address.c b/gcc/tree-ssa-address.c
index 1c17e93..fdb6619 100644
--- a/gcc/tree-ssa-address.c
+++ b/gcc/tree-ssa-address.c
@@ -1127,6 +1127,34 @@ maybe_fold_tmr (tree ref)
   return new_ref;
 }
 
+/* Return the preferred mem scale factor for accessing MEM_MODE
+   of BASE which is optimized for SPEED.  */
+unsigned int
+preferred_mem_scale_factor (tree base, machine_mode mem_mode,
+			bool speed)
+{
+  struct mem_address parts = {};
+  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (base));
+  unsigned int fact = GET_MODE_UNIT_SIZE (mem_mode);
+
+  /* Addressing mode "base + index".  */
+  parts.index = integer_one_node;
+  parts.base = integer_one_node;
+  rtx addr = addr_for_mem_ref (, as, false);
+  unsigned cost = address_cost (addr, mem_mode, as, speed)

[RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode

2019-05-21 Thread Kugan Vivekanandarajah

Hi,

Attached RFC patch attempts to use 32-bit WHILELO in LP64 mode to fix
the PR. Bootstarp and regression testing ongoing. In earlier testing,
I ran into an issue related to fwprop. I will tackle that based on the
feedback for the patch.

Thanks,
Kugan
From 4e9837ff9c0c080923f342e83574a6fdba2b3d92 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Tue, 5 Mar 2019 10:01:45 +1100
Subject: [PATCH] pr88838[v2]

As Mentioned in PR88838, this patch  avoid the SXTW by using WHILELO on W
registers instead of X registers.

As mentined in PR, vect_verify_full_masking checks which IV widths are
supported for WHILELO but prefers to go to Pmode width.  This is because
using Pmode allows ivopts to reuse the IV for indices (as in the loads
and store above).  However, it would be better to use a 32-bit WHILELO
with a truncated 64-bit IV if:

(a) the limit is extended from 32 bits.
(b) the detection loop in vect_verify_full_masking detects that using a
32-bit IV would be correct.

gcc/ChangeLog:

2019-05-22  Kugan Vivekanandarajah  

	* tree-vect-loop-manip.c (vect_set_loop_masks_directly): If the
	compare_type is not with Pmode size, we will create an IV with
	Pmode size with truncated use (i.e. converted to the correct type).
	* tree-vect-loop.c (vect_verify_full_masking): Find which IV
	widths are supported for WHILELO.

gcc/testsuite/ChangeLog:

2019-05-22  Kugan Vivekanandarajah  

	* gcc.target/aarch64/pr88838.c: New test.
	* gcc.target/aarch64/sve/while_1.c: Adjust.

Change-Id: Iff52946c28d468078f2cc0868d53edb05325b8ca
---
 gcc/fwprop.c   | 13 +++
 gcc/testsuite/gcc.target/aarch64/pr88838.c | 11 ++
 gcc/testsuite/gcc.target/aarch64/sve/while_1.c | 16 
 gcc/tree-vect-loop-manip.c | 52 --
 gcc/tree-vect-loop.c   | 39 ++-
 5 files changed, 117 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr88838.c

diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index cf2c9de..5275ad3 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1358,6 +1358,19 @@ forward_propagate_and_simplify (df_ref use, rtx_insn *def_insn, rtx def_set)
   else
 mode = GET_MODE (*loc);
 
+  /* TODO.  */
+  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg)))
+return false;
+  /* TODO. We can't get the mode for
+ (set (reg:VNx16BI 109)
+  (unspec:VNx16BI [
+	(reg:SI 131)
+	(reg:SI 106)
+   ] UNSPEC_WHILE_LO))
+ Thus, bailout when it is UNSPEC and MODEs are not compatible.  */
+  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg))
+  && GET_CODE (SET_SRC (use_set)) == UNSPEC)
+return false;
   new_rtx = propagate_rtx (*loc, mode, reg, src,
   			   optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn)));
 
diff --git a/gcc/testsuite/gcc.target/aarch64/pr88838.c b/gcc/testsuite/gcc.target/aarch64/pr88838.c
new file mode 100644
index 000..9d03c0a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr88838.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-S -O3 -march=arm8.2-a+sve" } */
+
+void
+f (int *restrict x, int *restrict y, int *restrict z, int n)
+{
+for (int i = 0; i < n; i += 1)
+  x[i] = y[i] + z[i];
+}
+
+/* { dg-final { scan-assembler-not "sxtw" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/while_1.c b/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
index a93a04b..05a4860 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
@@ -26,14 +26,14 @@
 TEST_ALL (ADD_LOOP)
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, wzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, w[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, wzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, w[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, wzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, w[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, wzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, w[0-9]+,} 3 } } */
 /* { dg-final { scan-assembler-time

Re: [PATCH v3 2/3] Add predict_doloop_p target hook

2019-05-16 Thread Kugan Vivekanandarajah

oop_p (struct loop *loop ATTRIBUTE_UNUSED)
> +{
> +  return false;
> +}
> +
>  /* NULL if INSN insn is valid within a low-overhead loop, otherwise returns
> an error message.
>
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index 5943627..70bfb17 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -85,6 +85,7 @@ extern bool default_fixed_point_supported_p (void);
>
>  extern bool default_has_ifunc_p (void);
>
> +extern bool default_predict_doloop_p (struct loop *);
>  extern const char * default_invalid_within_doloop (const rtx_insn *);
>
>  extern tree default_builtin_vectorized_function (unsigned int, tree, tree);
> diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> index a44b4cb..ed7f2a5 100644
> --- a/gcc/tree-ssa-loop-ivopts.c
> +++ b/gcc/tree-ssa-loop-ivopts.c
> @@ -3776,6 +3776,111 @@ prepare_decl_rtl (tree *expr_p, int *ws, void *data)
>return NULL_TREE;
>  }
>
> +/* Check whether number of iteration computation is too costly for doloop
> +   transformation.  It expands the gimple sequence to equivalent RTL insn
> +   sequence, then evaluate the cost.
> +
> +   Return true if it's costly, otherwise return false.  */
> +
> +static bool
> +costly_iter_for_doloop_p (struct loop *loop, tree niters)
> +{
> +  tree type = TREE_TYPE (niters);
> +  unsigned cost = 0;
> +  bool speed = optimize_loop_for_speed_p (loop);
> +  int regno = LAST_VIRTUAL_REGISTER + 1;
> +  walk_tree (, prepare_decl_rtl, , NULL);
> +  start_sequence ();
> +  expand_expr (niters, NULL_RTX, TYPE_MODE (type), EXPAND_NORMAL);
> +  rtx_insn *seq = get_insns ();
> +  end_sequence ();
> +
> +  for (; seq; seq = NEXT_INSN (seq))
> +{
> +  if (!INSN_P (seq))
> +   continue;
> +  rtx body = PATTERN (seq);
> +  if (GET_CODE (body) == SET)
> +   {
> + rtx set_val = XEXP (body, 1);
> + enum rtx_code code = GET_CODE (set_val);
> + enum rtx_class cls = GET_RTX_CLASS (code);
> + /* For now, we only consider these two RTX classes, to match what we
> +get in doloop_optimize, excluding operations like zero/sign extend.  
> */
> + if (cls == RTX_BIN_ARITH || cls == RTX_COMM_ARITH)
> +   cost += set_src_cost (set_val, GET_MODE (set_val), speed);
Cant you have PARALLEL with SET here?

> +   }
> +}
> +  unsigned max_cost
> += COSTS_N_INSNS (PARAM_VALUE (PARAM_MAX_ITERATIONS_COMPUTATION_COST));
> +  if (cost > max_cost)
> +return true;
Maybe it is better to bailout early if the limit is reached instead of
doing it outside the loop?

Thanks,
Kugan

> +
> +  return false;
> +}
> +
> +/* Predict whether the given loop will be transformed in the RTL
> +   doloop_optimize pass.  Attempt to duplicate as many doloop_optimize checks
> +   as possible.  This is only for target independent checks, see
> +   targetm.predict_doloop_p for the target dependent ones.
> +
> +   Some RTL specific checks seems unable to be checked in gimple, if any new
> +   checks or easy checks _are_ missing here, please add them.  */
> +
> +static bool
> +generic_predict_doloop_p (struct ivopts_data *data)
> +{
> +  struct loop *loop = data->current_loop;
> +
> +  /* Call target hook for target dependent checks.  */
> +  if (!targetm.predict_doloop_p (loop))
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +   fprintf (dump_file, "predict doloop failure due to"
> +   "target specific checks.\n");
> +  return false;
> +}
> +
> +  /* Similar to doloop_optimize, check iteration description to know it's
> + suitable or not.  */
> +  edge exit = loop_latch_edge (loop);
> +  struct tree_niter_desc *niter_desc = niter_for_exit (data, exit);
> +  if (niter_desc == NULL)
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +   fprintf (dump_file, "predict doloop failure due to"
> +   "unexpected niters.\n");
> +  return false;
> +}
> +
> +  /* Similar to doloop_optimize, check whether iteration count too small
> + and not profitable.  */
> +  HOST_WIDE_INT est_niter = get_estimated_loop_iterations_int (loop);
> +  if (est_niter == -1)
> +est_niter = get_likely_max_loop_iterations_int (loop);
> +  if (est_niter >= 0 && est_niter < 3)
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +   fprintf (dump_file,
> +"predict doloop failure due to"
> +"too few iterations (%u).\n",
> +(unsigned int) est_niter);
> +  return false;
> +}
> +
> +  /* Similar to doloop_optimize, check whether number of iterations too 
> costly
> + to compute.  */
> +  if (costly_iter_for_doloop_p (loop, niter_desc->niter))
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +   fprintf (dump_file, "predict doloop failure due to"
> +   "costly niter computation.\n");
> +  return false;
> +}
> +
> +  return true;
> +}
> +
>  /* Determines cost of the computation of EXPR.  */
>
>  static unsigned
> --
> 2.7.4
>

Re: [PATCH 1/2] Add support for IVOPT

2019-05-16 Thread Kugan Vivekanandarajah

Hi Richard,

On Thu, 16 May 2019 at 21:14, Richard Biener  wrote:
>
> On Wed, May 15, 2019 at 4:40 AM  wrote:
> >
> > From: Kugan Vivekanandarajah 
> >
> > gcc/ChangeLog:
> >
> > 2019-05-15  Kugan Vivekanandarajah  
> >
> > PR target/88834
> > * tree-ssa-loop-ivopts.c (get_mem_type_for_internal_fn): Handle
> > IFN_MASK_LOAD_LANES and IFN_MASK_STORE_LANES.
> > (find_interesting_uses_stmt): Likewise.
> > (get_alias_ptr_type_for_ptr_address): Likewise.
> > (add_iv_candidate_for_use): Add scaled index candidate if useful.
> >
> > Change-Id: I8e8151fe2dde2845dedf38b090103694da6fc9d1
> > ---
> >  gcc/tree-ssa-loop-ivopts.c | 60 
> > +-
> >  1 file changed, 59 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> > index 9864b59..115a70c 100644
> > --- a/gcc/tree-ssa-loop-ivopts.c
> > +++ b/gcc/tree-ssa-loop-ivopts.c
> > @@ -2451,11 +2451,13 @@ get_mem_type_for_internal_fn (gcall *call, tree 
> > *op_p)
> >switch (gimple_call_internal_fn (call))
> >  {
> >  case IFN_MASK_LOAD:
> > +case IFN_MASK_LOAD_LANES:
> >if (op_p == gimple_call_arg_ptr (call, 0))
> > return TREE_TYPE (gimple_call_lhs (call));
> >return NULL_TREE;
> >
> >  case IFN_MASK_STORE:
> > +case IFN_MASK_STORE_LANES:
> >if (op_p == gimple_call_arg_ptr (call, 0))
> > return TREE_TYPE (gimple_call_arg (call, 3));
> >return NULL_TREE;
> > @@ -2545,7 +2547,7 @@ find_interesting_uses_stmt (struct ivopts_data *data, 
> > gimple *stmt)
> >   return;
> > }
> >
> > -  /* TODO -- we should also handle address uses of type
> > +  /* TODO -- we should also handle all address uses of type
> >
> >  memory = call (whatever);
> >
> > @@ -2553,6 +2555,27 @@ find_interesting_uses_stmt (struct ivopts_data 
> > *data, gimple *stmt)
> >
> >  call (memory).  */
> >  }
> > +  else if (is_gimple_call (stmt))
> > +{
> > +  gcall *call = dyn_cast  (stmt);
> > +  if (call
>
> that's testing things twice, just do
>
>else if (gcall *call = dyn_cast  (stmt))
>  {
> ...
>
> no other comments besides why do you need _LANES handling here where
> the w/o _LANES handling didn't need anything.
Right,  I have now changed this in the revised patch.

Thanks,
Kugan

>
> > + && gimple_call_internal_p (call)
> > + && (gimple_call_internal_fn (call) == IFN_MASK_LOAD_LANES
> > + || gimple_call_internal_fn (call) == IFN_MASK_STORE_LANES))
> > +   {
> > + tree *arg = gimple_call_arg_ptr (call, 0);
> > + struct iv *civ = get_iv (data, *arg);
> > + tree mem_type = get_mem_type_for_internal_fn (call, arg);
> > + if (civ && mem_type)
> > +   {
> > + civ = alloc_iv (data, civ->base, civ->step);
> > + record_group_use (data, arg, civ, stmt, USE_PTR_ADDRESS,
> > +   mem_type);
> > + return;
> > +   }
> > +   }
> > +}
> > +
> >
> >if (gimple_code (stmt) == GIMPLE_PHI
> >&& gimple_bb (stmt) == data->current_loop->header)
> > @@ -3500,6 +3523,39 @@ add_iv_candidate_for_use (struct ivopts_data *data, 
> > struct iv_use *use)
> >  basetype = sizetype;
> >record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);
> >
> > +  /* Compare the cost of an address with an unscaled index with the cost of
> > +an address with a scaled index and add candidate if useful. */
> > +  if (use != NULL && use->type == USE_PTR_ADDRESS)
> > +{
> > +  struct mem_address parts = {NULL_TREE, integer_one_node,
> > + NULL_TREE, NULL_TREE, NULL_TREE};
> > +  poly_uint64 temp;
> > +  poly_int64 fact;
> > +  bool speed = optimize_loop_for_speed_p (data->current_loop);
> > +  poly_int64 poly_step = tree_to_poly_int64 (iv->step);
> > +  machine_mode mem_mode = TYPE_MODE (use->mem_type);
> > +  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (use->iv->base));
> > +
> > +  fact = GET_MODE_SIZE (GET_MODE_INNER (TYPE_MODE (use->mem_type)));
> > +  parts.index = integer_one_node;
> > +
> > +  if (fac

Re: [PATCH 1/2] Add support for IVOPT

2019-05-16 Thread Kugan Vivekanandarajah

Hi Richard,

On Wed, 15 May 2019 at 16:57, Richard Sandiford
 wrote:
>
> Thanks for doing this.
>
> kugan.vivekanandara...@linaro.org writes:
> > From: Kugan Vivekanandarajah 
> >
> > gcc/ChangeLog:
> >
> > 2019-05-15  Kugan Vivekanandarajah  
> >
> >   PR target/88834
> >   * tree-ssa-loop-ivopts.c (get_mem_type_for_internal_fn): Handle
> >   IFN_MASK_LOAD_LANES and IFN_MASK_STORE_LANES.
> >   (find_interesting_uses_stmt): Likewise.
> >   (get_alias_ptr_type_for_ptr_address): Likewise.
> >   (add_iv_candidate_for_use): Add scaled index candidate if useful.
> >
> > Change-Id: I8e8151fe2dde2845dedf38b090103694da6fc9d1
> > ---
> >  gcc/tree-ssa-loop-ivopts.c | 60 
> > +-
> >  1 file changed, 59 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
> > index 9864b59..115a70c 100644
> > --- a/gcc/tree-ssa-loop-ivopts.c
> > +++ b/gcc/tree-ssa-loop-ivopts.c
> > @@ -2451,11 +2451,13 @@ get_mem_type_for_internal_fn (gcall *call, tree 
> > *op_p)
> >switch (gimple_call_internal_fn (call))
> >  {
> >  case IFN_MASK_LOAD:
> > +case IFN_MASK_LOAD_LANES:
> >if (op_p == gimple_call_arg_ptr (call, 0))
> >   return TREE_TYPE (gimple_call_lhs (call));
> >return NULL_TREE;
> >
> >  case IFN_MASK_STORE:
> > +case IFN_MASK_STORE_LANES:
> >if (op_p == gimple_call_arg_ptr (call, 0))
> >   return TREE_TYPE (gimple_call_arg (call, 3));
> >return NULL_TREE;
> > @@ -2545,7 +2547,7 @@ find_interesting_uses_stmt (struct ivopts_data *data, 
> > gimple *stmt)
> > return;
> >   }
> >
> > -  /* TODO -- we should also handle address uses of type
> > +  /* TODO -- we should also handle all address uses of type
> >
> >memory = call (whatever);
> >
> > @@ -2553,6 +2555,27 @@ find_interesting_uses_stmt (struct ivopts_data 
> > *data, gimple *stmt)
> >
> >call (memory).  */
> >  }
> > +  else if (is_gimple_call (stmt))
> > +{
> > +  gcall *call = dyn_cast  (stmt);
> > +  if (call
> > +   && gimple_call_internal_p (call)
> > +   && (gimple_call_internal_fn (call) == IFN_MASK_LOAD_LANES
> > +   || gimple_call_internal_fn (call) == IFN_MASK_STORE_LANES))
> > + {
> > +   tree *arg = gimple_call_arg_ptr (call, 0);
> > +   struct iv *civ = get_iv (data, *arg);
> > +   tree mem_type = get_mem_type_for_internal_fn (call, arg);
> > +   if (civ && mem_type)
> > + {
> > +   civ = alloc_iv (data, civ->base, civ->step);
> > +   record_group_use (data, arg, civ, stmt, USE_PTR_ADDRESS,
> > + mem_type);
> > +   return;
> > + }
> > + }
> > +}
> > +
>
> Why do you need to handle this specially?  Does:
>
>   FOR_EACH_PHI_OR_STMT_USE (use_p, stmt, iter, SSA_OP_USE)
> {
>   op = USE_FROM_PTR (use_p);
>
>   if (TREE_CODE (op) != SSA_NAME)
> continue;
>
>   iv = get_iv (data, op);
>   if (!iv)
> continue;
>
>   if (!find_address_like_use (data, stmt, use_p->use, iv))
> find_interesting_uses_op (data, op);
> }
>
> not do the right thing for the load/store lane case?
Right, I initially thought load lanes should be handled differently
but turned out they can be done the same way. I should have removed
it. Done now.

>
> > @@ -3500,6 +3523,39 @@ add_iv_candidate_for_use (struct ivopts_data *data, 
> > struct iv_use *use)
> >  basetype = sizetype;
> >record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);
> >
> > +  /* Compare the cost of an address with an unscaled index with the cost of
> > +an address with a scaled index and add candidate if useful. */
> > +  if (use != NULL && use->type == USE_PTR_ADDRESS)
>
> I think we want this for all address uses.  E.g. for SVE, masked and
> unmasked accesses would both benefit.
OK.

>
> > +{
> > +  struct mem_address parts = {NULL_TREE, integer_one_node,
> > +   NULL_TREE, NULL_TREE, NULL_TREE};
>
> Might be better to use "= {}" and initialise the fields that matter by
> assignment.  As it stands this uses integer_one_node as the base, but I
> couldn't tell if that was deliberate.

I just copied this part from get_address

Re: [PATCH 2/2] aarch64 back-end changes

2019-05-15 Thread Kugan Vivekanandarajah

Hi Richard,

On Wed, 15 May 2019 at 23:24, Richard Earnshaw (lists)
 wrote:
>
> On 15/05/2019 13:48, Richard Earnshaw (lists) wrote:
> > On 15/05/2019 03:39, kugan.vivekanandara...@linaro.org wrote:
> >> From: Kugan Vivekanandarajah 
> >>
> >
> > The subject line to this email is not helpful.  Why should I be
> > interested in reviewing this patch?  Also, why does it claim to be 2/2
> > when there's no 1/2 to go with it?
>
> Ah, just noticed that there is a part 1/2 (and a part 0/2) but they hit
> a different filter in my mailer so ended up in a different folder.  That
> doesn't affect the other comments, though: each patch should be
> self-contained, even if there's an overall covering letter.

My bad. I will do as you suggested from now on.

Thanks,
Kugan

>
> R.
>
> >
> > Please include with all patches a justification giving background to why
> > you believe the patch is correct.  All patches need this sort of
> > description - don't assume that the reviewer is familiar with the code
> > or will just accept your word for it.
> >
> > R.
> >

[PATCH 2/2] [PR88836][aarch64] Fix CSE to process parallel rtx dest one by one

2019-05-15 Thread kugan . vivekanandarajah

From: Kugan Vivekanandarajah 

This patch changes cse_insn to process parallel rtx one by one such that
any destination rtx in cse list is invalidated before processing the
next.

gcc/ChangeLog:

2019-05-16  Kugan Vivekanandarajah  

PR target/88834
* cse.c (safe_hash): Handle VEC_DUPLICATE.
(exp_equiv_p): Likewise.
(hash_rtx_cb): Change to accept const_rtx.
(struct set): Add field to record if uses of dest is invalidated.
(cse_insn): For parallel rtx, invalidate register set by first rtx
before processing the next.

gcc/testsuite/ChangeLog:

2019-05-16  Kugan Vivekanandarajah  

PR target/88834
* gcc.target/aarch64/pr88834.c: New test.

Change-Id: I7c3a61f034128f38abe0c2b7dab5d81dec28146c
---
 gcc/cse.c  | 67 ++
 gcc/testsuite/gcc.target/aarch64/pr88836.c | 14 +++
 2 files changed, 73 insertions(+), 8 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr88836.c

diff --git a/gcc/cse.c b/gcc/cse.c
index 6c9cda1..9dc31f5 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -570,7 +570,7 @@ static void invalidate_for_call (void);
 static rtx use_related_value (rtx, struct table_elt *);
 
 static inline unsigned canon_hash (rtx, machine_mode);
-static inline unsigned safe_hash (rtx, machine_mode);
+static inline unsigned safe_hash (const_rtx, machine_mode);
 static inline unsigned hash_rtx_string (const char *);
 
 static rtx canon_reg (rtx, rtx_insn *);
@@ -2369,6 +2369,11 @@ hash_rtx_cb (const_rtx x, machine_mode mode,
   hash += fixed_hash (CONST_FIXED_VALUE (x));
   return hash;
 
+case VEC_DUPLICATE:
+  return hash_rtx_cb (XEXP (x, 0), VOIDmode,
+ do_not_record_p, hash_arg_in_memory_p,
+ have_reg_qty, cb);
+
 case CONST_VECTOR:
   {
int units;
@@ -2599,7 +2604,7 @@ canon_hash (rtx x, machine_mode mode)
and hash_arg_in_memory are not changed.  */
 
 static inline unsigned
-safe_hash (rtx x, machine_mode mode)
+safe_hash (const_rtx x, machine_mode mode)
 {
   int dummy_do_not_record;
   return hash_rtx (x, mode, _do_not_record, NULL, true);
@@ -2630,6 +2635,16 @@ exp_equiv_p (const_rtx x, const_rtx y, int validate, 
bool for_gcse)
 return x == y;
 
   code = GET_CODE (x);
+  if ((code == CONST_VECTOR && GET_CODE (y) == VEC_DUPLICATE)
+   || (code == VEC_DUPLICATE && GET_CODE (y) == CONST_VECTOR))
+{
+  if (code == VEC_DUPLICATE)
+   std::swap (x, y);
+  if (const_vector_encoded_nelts (x) != 1)
+   return 0;
+  return exp_equiv_p (CONST_VECTOR_ENCODED_ELT (x, 0), XEXP (y, 0),
+ validate, for_gcse);
+}
   if (code != GET_CODE (y))
 return 0;
 
@@ -4192,7 +4207,8 @@ struct set
   char src_in_memory;
   /* Nonzero if the SET_SRC contains something
  whose value cannot be predicted and understood.  */
-  char src_volatile;
+  char src_volatile : 1;
+  char invalidate_dest_p : 1;
   /* Original machine mode, in case it becomes a CONST_INT.
  The size of this field should match the size of the mode
  field of struct rtx_def (see rtl.h).  */
@@ -4639,7 +4655,7 @@ cse_insn (rtx_insn *insn)
   for (i = 0; i < n_sets; i++)
 {
   bool repeat = false;
-  bool mem_noop_insn = false;
+  bool noop_insn = false;
   rtx src, dest;
   rtx src_folded;
   struct table_elt *elt = 0, *p;
@@ -4736,6 +4752,7 @@ cse_insn (rtx_insn *insn)
   sets[i].src = src;
   sets[i].src_hash = HASH (src, mode);
   sets[i].src_volatile = do_not_record;
+  sets[i].invalidate_dest_p = 1;
   sets[i].src_in_memory = hash_arg_in_memory;
 
   /* If SRC is a MEM, there is a REG_EQUIV note for SRC, and DEST is
@@ -5365,7 +5382,7 @@ cse_insn (rtx_insn *insn)
   || insn_nothrow_p (insn)))
{
  SET_SRC (sets[i].rtl) = trial;
- mem_noop_insn = true;
+ noop_insn = true;
  break;
}
 
@@ -5418,6 +5435,19 @@ cse_insn (rtx_insn *insn)
  src_folded_cost = constant_pool_entries_cost;
  src_folded_regcost = constant_pool_entries_regcost;
}
+ else if (n_sets == 1
+  && REG_P (trial)
+  && REG_P (SET_DEST (sets[i].rtl))
+  && GET_MODE_CLASS (mode) == MODE_CC
+  && REGNO (trial) == REGNO (SET_DEST (sets[i].rtl))
+  && !side_effects_p (dest)
+  && (cfun->can_delete_dead_exceptions
+  || insn_nothrow_p (insn)))
+   {
+ SET_SRC (sets[i].rtl) = trial;
+ noop_insn = true;
+ break;
+   }
}
 
   /* If we changed the insn too much, handle this set from scratch.  */
@@ -5588,7 +5618,7 @@ cse_insn (rtx_insn *insn)
}
 
   /* Similar

[PATCH 1/2] [PR88836][aarch64] Set CC_REGNUM instead of clobber

2019-05-15 Thread kugan . vivekanandarajah

From: Kugan Vivekanandarajah 

For aarch64 sve while_ult pattern, Set CC_REGNUM instead of clobbering.

gcc/ChangeLog:

2019-05-16  Kugan Vivekanandarajah  

PR target/88834
* config/aarch64/aarch64-sve.md (while_ult): Set CC_REGNUM instead
of clobbering.

Change-Id: I96f16b8f81140fb4a6897d31d427c62bcc1e7997
---
 gcc/config/aarch64/aarch64-sve.md | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 3f39c4c..a18eb80 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1331,13 +1331,18 @@
 )
 
 ;; Set element I of the result if operand1 + J < operand2 for all J in [0, I].
-;; with the comparison being unsigned.
+;; with the comparison being unsigned. Als set CC_REFNUM with the flags.
 (define_insn "while_ult"
   [(set (match_operand:PRED_ALL 0 "register_operand" "=Upa")
(unspec:PRED_ALL [(match_operand:GPI 1 "aarch64_reg_or_zero" "rZ")
  (match_operand:GPI 2 "aarch64_reg_or_zero" "rZ")]
 UNSPEC_WHILE_LO))
-   (clobber (reg:CC CC_REGNUM))]
+   (set (reg:CC CC_REGNUM)
+   (compare:CC
+ (unspec:SI [(vec_duplicate:PRED_ALL (const_int 1))
+ (match_dup 0)]
+UNSPEC_PTEST_PTRUE)
+ (const_int 0)))]
   "TARGET_SVE"
   "whilelo\t%0., %1, %2"
 )
-- 
2.7.4

[PATCH 0/2][RFC][PR88836][AARCH64] Fix redundant ptest instruction

2019-05-15 Thread kugan . vivekanandarajah

From: Kugan Vivekanandarajah 

Inorder to fix this PR.
 * We need to change the whilelo pattern in backend
 * Change RTL CSE such that:
   - Add support for VEC_DUPLICATE
   - When handling PARALLEL rtx in cse_insn, we kill CSE defined by all the
 parallel rtx at the end.

For example, with patch1, we now have rtl insn as follows:

(insn 19 18 20 3 (parallel [
(set (reg:VNx4BI 93 [ next_mask_18 ])
(unspec:VNx4BI [
(const_int 0 [0])
(reg:DI 95 [ _33 ])
] UNSPEC_WHILE_LO))
(set (reg:CC 66 cc)
(compare:CC (unspec:SI [
(vec_duplicate:VNx4BI (const_int 1 [0x1]))
(reg:VNx4BI 93 [ next_mask_18 ])
] UNSPEC_PTEST_PTRUE)
(const_int 0 [0])))
]) 4244 {while_ultdivnx4bi}

When cse_insn process the first, it records the CSE set in reg 93.  Then after
processing both the instruction in the parallel rtx, we invalidate all
expression with reg 93 which means expression in the second instruction is
invalidated for CSE. Attached patch relaxes this by invalidating before 
processing the
second.

Bootstrap and regression testing for the current version is ongoing.

Thanks,
Kugan

Kugan Vivekanandarajah (2):
  [PR88836][aarch64] Set CC_REGNUM instead of clobber
  [PR88836][aarch64] Fix CSE to process parallel rtx dest one by one

 gcc/config/aarch64/aarch64-sve.md  |  9 +++-
 gcc/cse.c  | 67 ++
 gcc/testsuite/gcc.target/aarch64/pr88836.c | 14 +++
 3 files changed, 80 insertions(+), 10 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr88836.c

-- 
2.7.4

[PATCH 2/2] aarch64 back-end changes

2019-05-14 Thread kugan . vivekanandarajah

From: Kugan Vivekanandarajah 

gcc/ChangeLog:

2019-05-15  Kugan Vivekanandarajah  

PR target/88834
* config/aarch64/aarch64.c (aarch64_classify_address): Relax
allow_reg_index_p.

gcc/testsuite/ChangeLog:

2019-05-15  Kugan Vivekanandarajah  

PR target/88834
* gcc.target/aarch64/pr88834.c: New test.
* gcc.target/aarch64/sve/struct_vect_1.c: Adjust.
* gcc.target/aarch64/sve/struct_vect_14.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_15.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_16.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_17.c: Likewise.
* gcc.target/aarch64/sve/struct_vect_7.c: Likewise.

Change-Id: I840d08dc89a845b3913204228bae1bed40601d07
---
 gcc/config/aarch64/aarch64.c  |  2 +-
 gcc/testsuite/gcc.target/aarch64/pr88834.c| 15 +++
 gcc/testsuite/gcc.target/aarch64/sve/struct_vect_1.c  |  8 
 gcc/testsuite/gcc.target/aarch64/sve/struct_vect_14.c |  8 
 gcc/testsuite/gcc.target/aarch64/sve/struct_vect_15.c |  8 
 gcc/testsuite/gcc.target/aarch64/sve/struct_vect_16.c |  8 
 gcc/testsuite/gcc.target/aarch64/sve/struct_vect_17.c |  8 
 gcc/testsuite/gcc.target/aarch64/sve/struct_vect_7.c  |  8 
 8 files changed, 40 insertions(+), 25 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr88834.c

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1f90467..34292eb 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -6592,7 +6592,7 @@ aarch64_classify_address (struct aarch64_address_info 
*info,
   bool allow_reg_index_p = (!load_store_pair_p
&& (known_lt (GET_MODE_SIZE (mode), 16)
|| vec_flags == VEC_ADVSIMD
-   || vec_flags == VEC_SVE_DATA));
+   || vec_flags & VEC_SVE_DATA));
 
   /* For SVE, only accept [Rn], [Rn, Rm, LSL #shift] and
  [Rn, #offset, MUL VL].  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr88834.c 
b/gcc/testsuite/gcc.target/aarch64/pr88834.c
new file mode 100644
index 000..ea00967
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr88834.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-S -O3 -march=armv8.2-a+sve" } */
+
+void
+f (int *restrict x, int *restrict y, int *restrict z, int n)
+{
+  for (int i = 0; i < n; i += 2)
+{
+  x[i] = y[i] + z[i];
+  x[i + 1] = y[i + 1] - z[i + 1];
+}
+}
+
+/* { dg-final { scan-assembler-times {\tld2w\t{z[0-9]+.s - z[0-9]+.s}, 
p[0-7]/z, \[x[0-9]+, x[0-9]+, lsl 2\]\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tst2w\t{z[0-9]+.s - z[0-9]+.s}, p[0-7], 
\[x[0-9]+, x[0-9]+, lsl 2\]\n} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_1.c 
b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_1.c
index 6e3c889..918a581 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_1.c
@@ -83,9 +83,9 @@ NAME(g4) (TYPE *__restrict a, TYPE *__restrict b, TYPE 
*__restrict c,
 }
 }
 
-/* { dg-final { scan-assembler {\tld2b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7]/z, 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler {\tld2b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7]/z, 
\[x[0-9]+, x[0-9]+\]\n} } } */
 /* { dg-final { scan-assembler {\tld3b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7]/z, 
\[x[0-9]+\]\n} } } */
-/* { dg-final { scan-assembler {\tld4b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7]/z, 
\[x[0-9]+\]\n} } } */
-/* { dg-final { scan-assembler {\tst2b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7], 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler {\tld4b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7]/z, 
\[x[0-9]+, x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler {\tst2b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7], 
\[x[0-9]+, x[0-9]+\]\n} } } */
 /* { dg-final { scan-assembler {\tst3b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7], 
\[x[0-9]+\]\n} } } */
-/* { dg-final { scan-assembler {\tst4b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7], 
\[x[0-9]+\]\n} } } */
+/* { dg-final { scan-assembler {\tst4b\t{z[0-9]+.b - z[0-9]+.b}, p[0-7], 
\[x[0-9]+, x[0-9]+\]\n} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_14.c 
b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_14.c
index 45644b6..a16a79e 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_14.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_14.c
@@ -43,12 +43,12 @@
 #undef NAME
 #undef TYPE
 
-/* { dg-final { scan-assembler-times {\tld2b\t{z[0-9]+.b - z[0-9]+.b}, 
p[0-7]/z, \[x[0-9]+\]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tld2b\t{z[0-9]+.b - z[0-9]+.b}, 
p[0-7]/z, \[x[0-9]+, x[0-9]+\]\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tld3b\t{z[0-9]+.b - z[0-9]+.b}, 
p[0-7]/z, \[x[0-9]+\]\n} 1 } } */
-/* { dg-final { scan-assembler-times {\tld4b\t{z[0-9]+.b - z[0-9]+.b}, 
p[0-7]/z, \[x[0-9]+\]\n} 1 } }

[PATCH 0/2] [RFC][PR88834]

2019-05-14 Thread kugan . vivekanandarajah

From: Kugan Vivekanandarajah 

In PR88834, IVOPT is not selecting the right addressing mode. Inorder to fix 
thix,
we need to add support to add IV uses for IFN_MASK_LOAD_LANES and 
IFN_MASK_STORE_LANES.
In addition, we also need to add IV candidate with scaled by the element or 
access size if
that is useful. Richard Sandiford has provided some feedback in the PR and I 
tried to
incoporate this in PATCH1.

PATCH 2 is the changes needed in aarch64 back in the testadjustments.

Bootstrap and regression testing for the current version is ongoing.

Thanks,
Kugan

Kugan Vivekanandarajah (2):
  Add support for IVOPT
  aarch64 back-end changes

 gcc/config/aarch64/aarch64.c   |  2 +-
 gcc/testsuite/gcc.target/aarch64/pr88834.c | 15 ++
 .../gcc.target/aarch64/sve/struct_vect_1.c |  8 +--
 .../gcc.target/aarch64/sve/struct_vect_14.c|  8 +--
 .../gcc.target/aarch64/sve/struct_vect_15.c|  8 +--
 .../gcc.target/aarch64/sve/struct_vect_16.c|  8 +--
 .../gcc.target/aarch64/sve/struct_vect_17.c|  8 +--
 .../gcc.target/aarch64/sve/struct_vect_7.c |  8 +--
 gcc/tree-ssa-loop-ivopts.c | 60 +-
 9 files changed, 99 insertions(+), 26 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr88834.c

-- 
2.7.4

[PATCH 1/2] Add support for IVOPT

2019-05-14 Thread kugan . vivekanandarajah

From: Kugan Vivekanandarajah 

gcc/ChangeLog:

2019-05-15  Kugan Vivekanandarajah  

PR target/88834
* tree-ssa-loop-ivopts.c (get_mem_type_for_internal_fn): Handle
IFN_MASK_LOAD_LANES and IFN_MASK_STORE_LANES.
(find_interesting_uses_stmt): Likewise.
(get_alias_ptr_type_for_ptr_address): Likewise.
(add_iv_candidate_for_use): Add scaled index candidate if useful.

Change-Id: I8e8151fe2dde2845dedf38b090103694da6fc9d1
---
 gcc/tree-ssa-loop-ivopts.c | 60 +-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 9864b59..115a70c 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -2451,11 +2451,13 @@ get_mem_type_for_internal_fn (gcall *call, tree *op_p)
   switch (gimple_call_internal_fn (call))
 {
 case IFN_MASK_LOAD:
+case IFN_MASK_LOAD_LANES:
   if (op_p == gimple_call_arg_ptr (call, 0))
return TREE_TYPE (gimple_call_lhs (call));
   return NULL_TREE;
 
 case IFN_MASK_STORE:
+case IFN_MASK_STORE_LANES:
   if (op_p == gimple_call_arg_ptr (call, 0))
return TREE_TYPE (gimple_call_arg (call, 3));
   return NULL_TREE;
@@ -2545,7 +2547,7 @@ find_interesting_uses_stmt (struct ivopts_data *data, 
gimple *stmt)
  return;
}
 
-  /* TODO -- we should also handle address uses of type
+  /* TODO -- we should also handle all address uses of type
 
 memory = call (whatever);
 
@@ -2553,6 +2555,27 @@ find_interesting_uses_stmt (struct ivopts_data *data, 
gimple *stmt)
 
 call (memory).  */
 }
+  else if (is_gimple_call (stmt))
+{
+  gcall *call = dyn_cast  (stmt);
+  if (call
+ && gimple_call_internal_p (call)
+ && (gimple_call_internal_fn (call) == IFN_MASK_LOAD_LANES
+ || gimple_call_internal_fn (call) == IFN_MASK_STORE_LANES))
+   {
+ tree *arg = gimple_call_arg_ptr (call, 0);
+ struct iv *civ = get_iv (data, *arg);
+ tree mem_type = get_mem_type_for_internal_fn (call, arg);
+ if (civ && mem_type)
+   {
+ civ = alloc_iv (data, civ->base, civ->step);
+ record_group_use (data, arg, civ, stmt, USE_PTR_ADDRESS,
+   mem_type);
+ return;
+   }
+   }
+}
+
 
   if (gimple_code (stmt) == GIMPLE_PHI
   && gimple_bb (stmt) == data->current_loop->header)
@@ -3500,6 +3523,39 @@ add_iv_candidate_for_use (struct ivopts_data *data, 
struct iv_use *use)
 basetype = sizetype;
   record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);
 
+  /* Compare the cost of an address with an unscaled index with the cost of
+an address with a scaled index and add candidate if useful. */
+  if (use != NULL && use->type == USE_PTR_ADDRESS)
+{
+  struct mem_address parts = {NULL_TREE, integer_one_node,
+ NULL_TREE, NULL_TREE, NULL_TREE};
+  poly_uint64 temp;
+  poly_int64 fact;
+  bool speed = optimize_loop_for_speed_p (data->current_loop);
+  poly_int64 poly_step = tree_to_poly_int64 (iv->step);
+  machine_mode mem_mode = TYPE_MODE (use->mem_type);
+  addr_space_t as = TYPE_ADDR_SPACE (TREE_TYPE (use->iv->base));
+
+  fact = GET_MODE_SIZE (GET_MODE_INNER (TYPE_MODE (use->mem_type)));
+  parts.index = integer_one_node;
+
+  if (fact.is_constant ()
+ && can_div_trunc_p (poly_step, fact, ))
+   {
+ /* Addressing mode "base + index".  */
+ rtx addr = addr_for_mem_ref (, as, false);
+ unsigned cost = address_cost (addr, mem_mode, as, speed);
+ tree step = wide_int_to_tree (sizetype,
+   exact_div (poly_step, fact));
+ parts.step = wide_int_to_tree (sizetype, fact);
+ /* Addressing mode "base + index << scale".  */
+ addr = addr_for_mem_ref (, as, false);
+ unsigned new_cost = address_cost (addr, mem_mode, as, speed);
+ if (new_cost < cost)
+   add_candidate (data, size_int (0), step, true, NULL);
+   }
+}
+
   /* Record common candidate with constant offset stripped in base.
  Like the use itself, we also add candidate directly for it.  */
   base = strip_offset (iv->base, );
@@ -7112,6 +7168,8 @@ get_alias_ptr_type_for_ptr_address (iv_use *use)
 {
 case IFN_MASK_LOAD:
 case IFN_MASK_STORE:
+case IFN_MASK_LOAD_LANES:
+case IFN_MASK_STORE_LANES:
   /* The second argument contains the correct alias type.  */
   gcc_assert (use->op_p = gimple_call_arg_ptr (call, 0));
   return TREE_TYPE (gimple_call_arg (call, 1));
-- 
2.7.4

Re: [aarch64][RFA][rtl-optimization/87763] Fix insv_1 and insv_2 for aarch64

2019-04-22 Thread Kugan Vivekanandarajah

Hi Jeff,

[...]

+  "#"
+  "&& 1"
+  [(const_int 0)]
+  "{
+ /* If we do not have an RMW operand, then copy the input
+ to the output before this insn.  Also modify the existing
+ insn in-place so we can have make_field_assignment actually
+ generate a suitable extraction.  */
+ if (!rtx_equal_p (operands[0], operands[1]))
+   {
+ emit_move_insn (operands[0], operands[1]);
+ XEXP (XEXP (SET_SRC (PATTERN (curr_insn)), 0), 0) = copy_rtx (operands[0]);
+   }
+
+ rtx make_field_assignment (rtx);
+ rtx newpat = make_field_assignment (PATTERN (curr_insn));
+ gcc_assert (newpat);
+ emit_insn (newpat);
+ DONE;

It seems that make_field_assignment returns a new pattern only  if it
succeeds and returns the same pattern otherwise. So I am wondering if
it is worth simplifying the above. Like removing the assert and
checking/inserting move only when new pattern is returned?

Thanks,
Kugan



>
> Jeff

[PR89862] Fix ARM lto bootstrap

2019-03-28 Thread Kugan Vivekanandarajah

Hi All,
LTO bootstrap for ARM fails with the commit

commit 67c18bce7054934528ff5930cca283b4ac967dca
* combine.c (record_dead_and_set_regs_1): Record the source unmodified
for a paradoxical SUBREG on a WORD_REGISTER_OPERATIONS target.

It fails with an  internal compiler error: in operator+=, at
profile-count.h:792.

With the commit now we are not  generating gen_lowpart for CONST_INT as in
(set (subreg:SI (reg:QI 1434) 0)
(const_int 224 [0xe0])) and likes.

As discussed in the PR, attached patch fixes this and fixes the
bootstrap failure. I am not able to create a reduced testcase for
this. However, it is being tested with LTO bootstrap for ARM. I
therefore believe that it is OK.
I have also tested the patch with x86_64-linux-gnu with no new regressions.
Is this OK for trunk?

Thanks,
Kugan
diff --git a/gcc/rtl.h b/gcc/rtl.h
index f991919..52ecd5a 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -4401,6 +4401,7 @@ word_register_operation_p (const_rtx x)
 {
   switch (GET_CODE (x))
 {
+case CONST_INT:
 case ROTATE:
 case ROTATERT:
 case SIGN_EXTRACT:


log
Description: Binary data

[SVE ACLE] svbic implementation

2019-03-19 Thread Kugan Vivekanandarajah

I have committed attached patch to aarch64/sve-acle-branch branch
which implements svbic.

Thanks,
Kugan
From 182bd15334874844bef5e317f55a6497f77e12ff Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Thu, 24 Jan 2019 20:57:19 +1100
Subject: [PATCH 1/3] svbic

Change-Id: I819490ec63ee38b9cdc7c5e342436b7afdee2973
---
 gcc/config/aarch64/aarch64-sve-builtins.c  |  30 ++
 gcc/config/aarch64/aarch64-sve-builtins.def|   1 +
 gcc/config/aarch64/aarch64-sve.md  |  54 ++-
 .../gcc.target/aarch64/sve-acle/asm/bic_s16.c  | 398 +
 .../gcc.target/aarch64/sve-acle/asm/bic_s32.c  | 394 
 .../gcc.target/aarch64/sve-acle/asm/bic_s64.c  | 394 
 .../gcc.target/aarch64/sve-acle/asm/bic_s8.c   | 317 
 .../gcc.target/aarch64/sve-acle/asm/bic_u16.c  | 398 +
 .../gcc.target/aarch64/sve-acle/asm/bic_u32.c  | 394 
 .../gcc.target/aarch64/sve-acle/asm/bic_u64.c  | 394 
 .../gcc.target/aarch64/sve-acle/asm/bic_u8.c   | 317 
 11 files changed, 3087 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/bic_s16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/bic_s32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/bic_s64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/bic_s8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/bic_u16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/bic_u32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/bic_u64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/bic_u8.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins.c b/gcc/config/aarch64/aarch64-sve-builtins.c
index 0e3db66..106f21e 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.c
+++ b/gcc/config/aarch64/aarch64-sve-builtins.c
@@ -166,6 +166,7 @@ enum function {
   FUNC_svadd,
   FUNC_svand,
   FUNC_svasrd,
+  FUNC_svbic,
   FUNC_svdiv,
   FUNC_svdivr,
   FUNC_svdot,
@@ -474,6 +475,7 @@ private:
   rtx expand_add (unsigned int);
   rtx expand_and ();
   rtx expand_asrd ();
+  rtx expand_bic ();
   rtx expand_div (bool);
   rtx expand_dot ();
   rtx expand_dup ();
@@ -1214,6 +1216,7 @@ arm_sve_h_builder::get_attributes (const function_instance )
 case FUNC_svadd:
 case FUNC_svand:
 case FUNC_svasrd:
+case FUNC_svbic:
 case FUNC_svdiv:
 case FUNC_svdivr:
 case FUNC_svdot:
@@ -1887,6 +1890,7 @@ gimple_folder::fold ()
 case FUNC_svadd:
 case FUNC_svand:
 case FUNC_svasrd:
+case FUNC_svbic:
 case FUNC_svdiv:
 case FUNC_svdivr:
 case FUNC_svdot:
@@ -1990,6 +1994,9 @@ function_expander::expand ()
 case FUNC_svdot:
   return expand_dot ();
 
+case FUNC_svbic:
+  return expand_bic ();
+
 case FUNC_svdup:
   return expand_dup ();
 
@@ -2133,6 +2140,29 @@ function_expander::expand_dot ()
 return expand_via_unpred_direct_optab (sdot_prod_optab);
 }
 
+/* Expand a call to svbic.  */
+rtx
+function_expander::expand_bic ()
+{
+  if (CONST_INT_P (m_args[2]))
+{
+  machine_mode mode = GET_MODE_INNER (get_mode (0));
+  m_args[2] = simplify_unary_operation (NOT, mode, m_args[2], mode);
+  return expand_and ();
+}
+
+  if (m_fi.pred == PRED_x)
+{
+  insn_code icode = code_for_aarch64_bic (get_mode (0));
+  return expand_via_unpred_insn (icode);
+}
+  else
+{
+  insn_code icode = code_for_cond_bic (get_mode (0));
+  return expand_via_pred_insn (icode);
+}
+}
+
 /* Expand a call to svdup.  */
 rtx
 function_expander::expand_dup ()
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.def b/gcc/config/aarch64/aarch64-sve-builtins.def
index 8322c4b..4af06ac 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.def
+++ b/gcc/config/aarch64/aarch64-sve-builtins.def
@@ -65,6 +65,7 @@ DEF_SVE_FUNCTION (svabs, unary, all_signed_and_float, mxz)
 DEF_SVE_FUNCTION (svadd, binary_opt_n, all_data, mxz)
 DEF_SVE_FUNCTION (svand, binary_opt_n, all_integer, mxz)
 DEF_SVE_FUNCTION (svasrd, shift_right_imm, all_signed, mxz)
+DEF_SVE_FUNCTION (svbic, binary_opt_n, all_integer, mxz)
 DEF_SVE_FUNCTION (svdiv, binary_opt_n, all_sdi_and_float, mxz)
 DEF_SVE_FUNCTION (svdivr, binary_opt_n, all_sdi_and_float, mxz)
 DEF_SVE_FUNCTION (svdot, ternary_qq_opt_n, sdi, none)
diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index d480289..5e629de 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1360,6 +1360,52 @@
   [(set_attr "movprfx" "*,yes,*")]
 )
 
+;; Predicated BIC with select.
+(define_expand "@cond_bic"
+  [(set (match_operand:SVE_I 0 "register_operand")
+	(unspec:SVE_I
+	  [(match_operand: 1 "register_operand")
+	   (

[SVE ACLE] Implements svdot

2019-01-17 Thread Kugan Vivekanandarajah

I committed the following patch which implements  svdot to
aarch64/sve-acle-branch. branch

Thanks,
Kugan
From b75cd8ba8f911c137380677b85882c22a6467bf6 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 18 Jan 2019 09:07:10 +1100
Subject: [PATCH] [SVE ACLE] Implements svdot

Change-Id: I9d9f77f814a62e03db2ccd749f41bd35fea16035
---
 gcc/config/aarch64/aarch64-sve-builtins.c  | 148 +
 gcc/config/aarch64/aarch64-sve-builtins.def|   1 +
 gcc/config/aarch64/aarch64-sve.md  |  14 ++
 gcc/config/aarch64/iterators.md|   6 +-
 .../aarch64/sve-acle/general-c++/dot_1.C   |   9 ++
 .../aarch64/sve-acle/general-c++/dot_2.C   |  17 +++
 .../gcc.target/aarch64/sve-acle/asm/dot_s32.c  | 111 
 .../gcc.target/aarch64/sve-acle/asm/dot_s64.c  | 111 
 .../gcc.target/aarch64/sve-acle/asm/dot_u32.c  | 111 
 .../gcc.target/aarch64/sve-acle/asm/dot_u64.c  | 111 
 .../aarch64/sve-acle/asm/test_sve_acle.h   |  48 +++
 .../gcc.target/aarch64/sve-acle/general-c/dot_1.c  |  13 ++
 .../gcc.target/aarch64/sve-acle/general-c/dot_2.c  |  15 +++
 13 files changed, 714 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.target/aarch64/sve-acle/general-c++/dot_1.C
 create mode 100644 gcc/testsuite/g++.target/aarch64/sve-acle/general-c++/dot_2.C
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/dot_s32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/dot_s64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/dot_u32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/dot_u64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/general-c/dot_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/general-c/dot_2.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins.c b/gcc/config/aarch64/aarch64-sve-builtins.c
index 35ed531..f080a67 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.c
+++ b/gcc/config/aarch64/aarch64-sve-builtins.c
@@ -115,6 +115,10 @@ enum function_shape {
  sv_t svfoo[_n_t0](sv_t, sv_t, _t).  */
   SHAPE_ternary_opt_n,
 
+  /* sv_t svfoo[_t0](sv_t, sv_t, sv_t)
+ sv_t svfoo[_n_t0](sv_t, sv_t, _t).  */
+  SHAPE_ternary_qq_opt_n,
+
   /* sv_t svfoo[_n_t0])(sv_t, uint64_t)
 
  The final argument must be an integer constant expression in the
@@ -161,6 +165,7 @@ enum function {
   FUNC_svasrd,
   FUNC_svdiv,
   FUNC_svdivr,
+  FUNC_svdot,
   FUNC_svdup,
   FUNC_sveor,
   FUNC_svindex,
@@ -261,6 +266,8 @@ struct GTY(()) function_instance {
 
   tree scalar_type (unsigned int) const;
   tree vector_type (unsigned int) const;
+  tree quarter_vector_type (unsigned int i) const;
+  tree quarter_scalar_type (unsigned int i) const;
 
   /* The explicit "enum"s are required for gengtype.  */
   enum group_id group;
@@ -321,7 +328,9 @@ private:
   void sig_000 (const function_instance &, vec &);
   void sig_n_000 (const function_instance &, vec &);
   void sig_ (const function_instance &, vec &);
+  void sig_qq_ (const function_instance &, vec &);
   void sig_n_ (const function_instance &, vec &);
+  void sig_qq_n_ (const function_instance &, vec &);
   void sig_n_00i (const function_instance &, vec &);
 
   void apply_predication (const function_instance &, vec &);
@@ -361,6 +370,7 @@ public:
 
 private:
   tree resolve_uniform (unsigned int);
+  tree resolve_dot ();
   tree resolve_uniform_imm (unsigned int, unsigned int);
 
   bool check_first_vector_argument (unsigned int, unsigned int &,
@@ -459,6 +469,7 @@ private:
   rtx expand_and ();
   rtx expand_asrd ();
   rtx expand_div (bool);
+  rtx expand_dot ();
   rtx expand_dup ();
   rtx expand_eor ();
   rtx expand_index ();
@@ -618,6 +629,7 @@ DEF_SVE_TYPES_ARRAY (all_integer);
 DEF_SVE_TYPES_ARRAY (all_data);
 DEF_SVE_TYPES_ARRAY (all_sdi_and_float);
 DEF_SVE_TYPES_ARRAY (all_signed_and_float);
+DEF_SVE_TYPES_ARRAY (sdi);
 
 /* Used by functions in aarch64-sve-builtins.def that have no governing
predicate.  */
@@ -668,6 +680,43 @@ find_vector_type (const_tree type)
   return NUM_VECTOR_TYPES;
 }
 
+/* Return the type suffix associated with integer elements that have
+   ELEM_BITS bits and the signedness given by UNSIGNED_P.  Return
+   NUM_TYPE_SUFFIXES if no such element exists.  */
+static type_suffix
+maybe_find_integer_type_suffix (bool unsigned_p, unsigned int elem_bits)
+{
+  for (unsigned int i = 0; i < NUM_TYPE_SUFFIXES; ++i)
+{
+  if (type_suffixes[i].integer_p
+	  && type_suffixes[i].unsigned_p == unsigned_p
+	  && type_suffixes[i].elem_bits == elem_bits)
+	return type_suffix (i);
+}
+  return NUM_TYPE_SUFFIXES;
+}
+
+/* Return the type suffix for elements that are a quarter the size of integer
+   type suffix TYPE.  Return NUM

[SVE ACLE] Implements svmulh

2019-01-17 Thread Kugan Vivekanandarajah

I committed the following patch which implements svmulh to
aarch64/sve-acle-branch. branch

Thanks,
Kugan
From 33b76de8ef5f370dfacba0addef2fe0b1f2a61db Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 18 Jan 2019 07:33:26 +1100
Subject: [PATCH] [SVE ACLE] Implements svmulh

Change-Id: Iaf4bd9898f46a53950e574750f68bdc709adbc1d
---
 gcc/config/aarch64/aarch64-sve-builtins.c  |  14 ++
 gcc/config/aarch64/aarch64-sve.md  |  50 +++-
 gcc/config/aarch64/iterators.md|   2 +
 .../gcc.target/aarch64/sve-acle/asm/mulh_s16.c | 254 +
 .../gcc.target/aarch64/sve-acle/asm/mulh_s32.c | 254 +
 .../gcc.target/aarch64/sve-acle/asm/mulh_s64.c | 254 +
 .../gcc.target/aarch64/sve-acle/asm/mulh_s8.c  | 254 +
 .../gcc.target/aarch64/sve-acle/asm/mulh_u16.c | 254 +
 .../gcc.target/aarch64/sve-acle/asm/mulh_u32.c | 254 +
 .../gcc.target/aarch64/sve-acle/asm/mulh_u64.c | 254 +
 .../gcc.target/aarch64/sve-acle/asm/mulh_u8.c  | 254 +
 gcc/tree-core.h|   8 +-
 12 files changed, 2101 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/mulh_s16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/mulh_s32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/mulh_s64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/mulh_s8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/mulh_u16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/mulh_u32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/mulh_u64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/mulh_u8.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins.c b/gcc/config/aarch64/aarch64-sve-builtins.c
index c039ceb..b1deee9 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.c
+++ b/gcc/config/aarch64/aarch64-sve-builtins.c
@@ -169,6 +169,7 @@ enum function {
   FUNC_svmls,
   FUNC_svmsb,
   FUNC_svmul,
+  FUNC_svmulh,
   FUNC_svneg,
   FUNC_svnot,
   FUNC_svptrue,
@@ -463,6 +464,7 @@ private:
   rtx expand_mla ();
   rtx expand_mls ();
   rtx expand_mul ();
+  rtx expand_mulh ();
   rtx expand_neg ();
   rtx expand_not ();
   rtx expand_ptrue ();
@@ -1088,6 +1090,7 @@ arm_sve_h_builder::get_attributes (const function_instance )
 case FUNC_svmls:
 case FUNC_svmsb:
 case FUNC_svmul:
+case FUNC_svmulh:
 case FUNC_svneg:
 case FUNC_svnot:
 case FUNC_svqadd:
@@ -1700,6 +1703,7 @@ gimple_folder::fold ()
 case FUNC_svmls:
 case FUNC_svmsb:
 case FUNC_svmul:
+case FUNC_svmulh:
 case FUNC_svneg:
 case FUNC_svnot:
 case FUNC_svqadd:
@@ -1808,6 +1812,9 @@ function_expander::expand ()
 case FUNC_svmul:
   return expand_mul ();
 
+case FUNC_svmulh:
+  return expand_mulh ();
+
 case FUNC_svneg:
   return expand_neg ();
 
@@ -2033,6 +2040,13 @@ function_expander::expand_mul ()
 return expand_via_pred_direct_optab (cond_smul_optab);
 }
 
+/* Expand a call to svmulh.  */
+rtx
+function_expander::expand_mulh ()
+{
+  return expand_signed_pred_op (UNSPEC_SMUL_HIGHPART, UNSPEC_UMUL_HIGHPART, 0);
+}
+
 /* Expand a call to svneg.  */
 rtx
 function_expander::expand_neg ()
diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md
index 944de82..6944d2b 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -1210,7 +1210,7 @@
 )
 
 ;; Predicated highpart multiplication.
-(define_insn "*mul3_highpart"
+(define_insn "@aarch64_pred_"
   [(set (match_operand:SVE_I 0 "register_operand" "=w, ?")
 	(unspec:SVE_I
 	  [(match_operand: 1 "register_operand" "Upl, Upl")
@@ -1225,6 +1225,54 @@
   [(set_attr "movprfx" "*,yes")]
 )
 
+;; Predicated MULH with select.
+(define_expand "@cond_"
+  [(set (match_operand:SVE_I 0 "register_operand")
+	(unspec:SVE_I
+	  [(match_operand: 1 "register_operand")
+	   (unspec:SVE_I
+	 [(match_operand:SVE_I 2 "register_operand")
+	  (match_operand:SVE_I 3 "register_operand")]
+	 MUL_HIGHPART)
+	   (match_operand:SVE_I 4 "aarch64_simd_reg_or_zero")]
+	  UNSPEC_SEL))]
+  "TARGET_SVE"
+)
+
+;; Predicated MULH with select matching the first input.
+(define_insn "*cond__2"
+  [(set (match_operand:SVE_I 0 "register_operand" "=w, ?")
+	(unspec:SVE_I
+	  [(match_operand: 1 "register_operand" "Upl, Upl")
+	   (unspec:SVE_I
+	 [(match_operand:SVE_I 2 "register_operand" "0, w")
+	  (match_operand:SVE_I 3 "register_operand" "w, w")]
+	 MUL_HIGHP

[SVE ACLE] Implements svabs, svnot, svneg and svsqrt

2019-01-15 Thread Kugan Vivekanandarajah

I committed the following patch which implements  svabs, svnot, svneg
and svsqrt to  aarch64/sve-acle-branch. branch

Thanks,
Kugan
From 2af9609a58cf7efbed93f15413224a2552b9696d Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Wed, 16 Jan 2019 07:45:52 +1100
Subject: [PATCH] [SVE ACLE ] svab, svnot, svneg and svsqrt implementation

Change-Id: Iec1e9491e4a84a351702550babedd0f17968617e
---
 gcc/config/aarch64/aarch64-sve-builtins.c  | 126 -
 gcc/config/aarch64/aarch64-sve-builtins.def|   4 +
 gcc/config/aarch64/aarch64-sve.md  |  52 +++--
 gcc/config/aarch64/iterators.md|  16 ++-
 .../gcc.target/aarch64/sve-acle/asm/abs_f16.c  | 122 
 .../gcc.target/aarch64/sve-acle/asm/abs_f32.c  | 122 
 .../gcc.target/aarch64/sve-acle/asm/abs_f64.c  | 122 
 .../gcc.target/aarch64/sve-acle/asm/abs_s16.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/abs_s32.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/abs_s64.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/abs_s8.c   |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/neg_f16.c  | 122 
 .../gcc.target/aarch64/sve-acle/asm/neg_f32.c  | 122 
 .../gcc.target/aarch64/sve-acle/asm/neg_f64.c  | 122 
 .../gcc.target/aarch64/sve-acle/asm/neg_s16.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/neg_s32.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/neg_s64.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/neg_s8.c   |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/not_s16.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/not_s32.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/not_s64.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/not_s8.c   |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/not_u16.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/not_u32.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/not_u64.c  |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/not_u8.c   |  83 ++
 .../gcc.target/aarch64/sve-acle/asm/sqrt_f16.c | 122 
 .../gcc.target/aarch64/sve-acle/asm/sqrt_f32.c | 122 
 .../gcc.target/aarch64/sve-acle/asm/sqrt_f64.c | 122 
 29 files changed, 2610 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/abs_f16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/abs_f32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/abs_f64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/abs_s16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/abs_s32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/abs_s64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/abs_s8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/neg_f16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/neg_f32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/neg_f64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/neg_s16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/neg_s32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/neg_s64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/neg_s8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/not_s16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/not_s32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/not_s64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/not_s8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/not_u16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/not_u32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/not_u64.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/not_u8.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/sqrt_f16.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/sqrt_f32.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve-acle/asm/sqrt_f64.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins.c b/gcc/config/aarch64/aarch64-sve-builtins.c
index c300957..d663de4 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.c
+++ b/gcc/config/aarch64/aarch64-sve-builtins.c
@@ -101,6 +101,9 @@ enum function_shape {
   /* sv_t svfoo[_n]_t0(_t).  */
   SHAPE_unary_n,
 
+  /* sv_t svfoo[_t0](sv_t).  */
+  SHAPE_unary,
+
   /* sv_t svfoo_t0(_t, _t).  */
   SHAPE_binary_scalar,
 
@@ -151,6 +154,7 @@ typedef enum type_suffix type_suffix_pair

Re: [RFC][PR87528][PR86677] Disable builtin popcount detection when back-end does not define it

2018-11-11 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.
On Thu, 8 Nov 2018 at 00:03, Richard Biener  wrote:
>
> On Fri, Nov 2, 2018 at 10:02 AM Kugan Vivekanandarajah
>  wrote:
> >
> > Hi Richard,
> > Thanks for the review.
> > On Tue, 30 Oct 2018 at 01:25, Richard Biener  
> > wrote:
> > >
> > > On Mon, Oct 29, 2018 at 2:06 AM Kugan Vivekanandarajah
> > >  wrote:
> > > >
> > > > Hi Richard and Jeff,
> > > >
> > > > Thanks for your comments.
> > > >
> > > > On Fri, 26 Oct 2018 at 19:40, Richard Biener 
> > > >  wrote:
> > > > >
> > > > > On Fri, Oct 26, 2018 at 4:55 AM Jeff Law  wrote:
> > > > > >
> > > > > > On 10/25/18 4:33 PM, Kugan Vivekanandarajah wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > PR87528 showed a case where libgcc generated popcount is causing
> > > > > > > regression for Skylake.
> > > > > > > We also have PR86677 where kernel build is failing because the 
> > > > > > > kernel
> > > > > > > does not use the libgcc (when backend is not defining popcount
> > > > > > > pattern).  While I agree that the kernel should implement its own
> > > > > > > functionality when it is not using the libgcc, I am afraid that 
> > > > > > > the
> > > > > > > implementation can have the same performance issues reported for
> > > > > > > Skylake in PR87528.
> > > > > > >
> > > > > > > Therefore, I would like to propose that we disable popcount 
> > > > > > > detection
> > > > > > > when we don't have a pattern for that. The attached patch (based 
> > > > > > > on
> > > > > > > previous discussions) does this.
> > > > > > >
> > > > > > > Bootstrapped and regression tested on x86_64-linux-gnu with no new
> > > > > > > regressions. We need to disable the popcount* testcases. I will 
> > > > > > > have
> > > > > > > to define a effective_target_with_popcount in
> > > > > > > gcc/testsuite/lib/target-supports.exp if this patch is OK?
> > > > > > > Thanks,
> > > > > > > Kugan
> > > > > > >
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > > 2018-10-25  Kugan Vivekanandarajah  
> > > > > > >
> > > > > > > * tree-scalar-evolution.c (expression_expensive_p): Make 
> > > > > > > BUILTIN POPCOUNT
> > > > > > > as expensive when backend does not define it.
> > > > > > >
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > > 2018-10-25  Kugan Vivekanandarajah  
> > > > > > >
> > > > > > > * gcc.target/aarch64/popcount4.c: New test.
> > > > > > >
> > > > > > FWIW, I've been disabling by checking direct_optab_handler elsewhere
> > > > > > (number_of_iterations_popcount) in my tester.  It may in fact be an 
> > > > > > old
> > > > > > patch from you.
> > > > > >
> > > > > > Richi argued that it's the kernel team's responsibility to provide a
> > > > > > popcount since they don't link with libgcc.  And I'm generally in
> > > > > > agreement with that position, though it does tend to generate some
> > > > > > friction with the kernel developers.  We also run the real risk of 
> > > > > > GCC 9
> > > > > > not being able to build the kernel which, IMHO, would be a disaster 
> > > > > > from
> > > > > > a PR standpoint.
> > > > > >
> > > > > > I'd like to hear from others here.  I fully realize we're beyond the
> > > > > > realm of what is strictly technically correct here from a review 
> > > > > > standpoint.
> > > > >
> > > > > As said final value replacement to a library call is probably not 
> > > > > wanted
> > > > > for optimization purpose, so adjusting expression_expensive_p is OK 
> > > > > with

Re: [RFC][PR87528][PR86677] Disable builtin popcount detection when back-end does not define it

2018-11-02 Thread Kugan Vivekanandarajah

Hi Richard,
Thanks for the review.
On Tue, 30 Oct 2018 at 01:25, Richard Biener  wrote:
>
> On Mon, Oct 29, 2018 at 2:06 AM Kugan Vivekanandarajah
>  wrote:
> >
> > Hi Richard and Jeff,
> >
> > Thanks for your comments.
> >
> > On Fri, 26 Oct 2018 at 19:40, Richard Biener  
> > wrote:
> > >
> > > On Fri, Oct 26, 2018 at 4:55 AM Jeff Law  wrote:
> > > >
> > > > On 10/25/18 4:33 PM, Kugan Vivekanandarajah wrote:
> > > > > Hi,
> > > > >
> > > > > PR87528 showed a case where libgcc generated popcount is causing
> > > > > regression for Skylake.
> > > > > We also have PR86677 where kernel build is failing because the kernel
> > > > > does not use the libgcc (when backend is not defining popcount
> > > > > pattern).  While I agree that the kernel should implement its own
> > > > > functionality when it is not using the libgcc, I am afraid that the
> > > > > implementation can have the same performance issues reported for
> > > > > Skylake in PR87528.
> > > > >
> > > > > Therefore, I would like to propose that we disable popcount detection
> > > > > when we don't have a pattern for that. The attached patch (based on
> > > > > previous discussions) does this.
> > > > >
> > > > > Bootstrapped and regression tested on x86_64-linux-gnu with no new
> > > > > regressions. We need to disable the popcount* testcases. I will have
> > > > > to define a effective_target_with_popcount in
> > > > > gcc/testsuite/lib/target-supports.exp if this patch is OK?
> > > > > Thanks,
> > > > > Kugan
> > > > >
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > 2018-10-25  Kugan Vivekanandarajah  
> > > > >
> > > > > * tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN 
> > > > > POPCOUNT
> > > > > as expensive when backend does not define it.
> > > > >
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > > 2018-10-25  Kugan Vivekanandarajah  
> > > > >
> > > > > * gcc.target/aarch64/popcount4.c: New test.
> > > > >
> > > > FWIW, I've been disabling by checking direct_optab_handler elsewhere
> > > > (number_of_iterations_popcount) in my tester.  It may in fact be an old
> > > > patch from you.
> > > >
> > > > Richi argued that it's the kernel team's responsibility to provide a
> > > > popcount since they don't link with libgcc.  And I'm generally in
> > > > agreement with that position, though it does tend to generate some
> > > > friction with the kernel developers.  We also run the real risk of GCC 9
> > > > not being able to build the kernel which, IMHO, would be a disaster from
> > > > a PR standpoint.
> > > >
> > > > I'd like to hear from others here.  I fully realize we're beyond the
> > > > realm of what is strictly technically correct here from a review 
> > > > standpoint.
> > >
> > > As said final value replacement to a library call is probably not wanted
> > > for optimization purpose, so adjusting expression_expensive_p is OK with
> > > me.  It might not fully solve the (non-)issue in case another 
> > > optimization pass
> > > chooses to materialize niter computation result.
> > >
> > > Few comments on the patch:
> > >
> > > +  tree fndecl = get_callee_fndecl (expr);
> > > +
> > > +  if (fndecl && DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL)
> > > +   {
> > > + combined_fn cfn = as_combined_fn (DECL_FUNCTION_CODE (fndecl));
> > >
> > >   combined_fn cfn = gimple_call_combined_fn (expr);
> > >   switch (cfn)
> > > {
> >
> > Did you mean:
> > combined_fn cfn = get_call_combined_fn (expr);
>
> Yes.
>
> > > ...
> > >
> > > cfn will be CFN_LAST for a non-builtin/internal call.  I know Richard is 
> > > mostly
> > > offline but eventually he knows whether there is a better way to query
> > >
> > > +   CASE_CFN_POPCOUNT:
> > > + /* Check if opcode for popcount is available.  */
> > > + if (optab_handler (popcount_optab,
> > > +

Re: [RFC][PR87528][PR86677] Disable builtin popcount detection when back-end does not define it

2018-10-28 Thread Kugan Vivekanandarajah

Hi Richard and Jeff,

Thanks for your comments.

On Fri, 26 Oct 2018 at 19:40, Richard Biener  wrote:
>
> On Fri, Oct 26, 2018 at 4:55 AM Jeff Law  wrote:
> >
> > On 10/25/18 4:33 PM, Kugan Vivekanandarajah wrote:
> > > Hi,
> > >
> > > PR87528 showed a case where libgcc generated popcount is causing
> > > regression for Skylake.
> > > We also have PR86677 where kernel build is failing because the kernel
> > > does not use the libgcc (when backend is not defining popcount
> > > pattern).  While I agree that the kernel should implement its own
> > > functionality when it is not using the libgcc, I am afraid that the
> > > implementation can have the same performance issues reported for
> > > Skylake in PR87528.
> > >
> > > Therefore, I would like to propose that we disable popcount detection
> > > when we don't have a pattern for that. The attached patch (based on
> > > previous discussions) does this.
> > >
> > > Bootstrapped and regression tested on x86_64-linux-gnu with no new
> > > regressions. We need to disable the popcount* testcases. I will have
> > > to define a effective_target_with_popcount in
> > > gcc/testsuite/lib/target-supports.exp if this patch is OK?
> > > Thanks,
> > > Kugan
> > >
> > >
> > > gcc/ChangeLog:
> > >
> > > 2018-10-25  Kugan Vivekanandarajah  
> > >
> > > * tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN 
> > > POPCOUNT
> > > as expensive when backend does not define it.
> > >
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > 2018-10-25  Kugan Vivekanandarajah  
> > >
> > > * gcc.target/aarch64/popcount4.c: New test.
> > >
> > FWIW, I've been disabling by checking direct_optab_handler elsewhere
> > (number_of_iterations_popcount) in my tester.  It may in fact be an old
> > patch from you.
> >
> > Richi argued that it's the kernel team's responsibility to provide a
> > popcount since they don't link with libgcc.  And I'm generally in
> > agreement with that position, though it does tend to generate some
> > friction with the kernel developers.  We also run the real risk of GCC 9
> > not being able to build the kernel which, IMHO, would be a disaster from
> > a PR standpoint.
> >
> > I'd like to hear from others here.  I fully realize we're beyond the
> > realm of what is strictly technically correct here from a review standpoint.
>
> As said final value replacement to a library call is probably not wanted
> for optimization purpose, so adjusting expression_expensive_p is OK with
> me.  It might not fully solve the (non-)issue in case another optimization 
> pass
> chooses to materialize niter computation result.
>
> Few comments on the patch:
>
> +  tree fndecl = get_callee_fndecl (expr);
> +
> +  if (fndecl && DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL)
> +   {
> + combined_fn cfn = as_combined_fn (DECL_FUNCTION_CODE (fndecl));
>
>   combined_fn cfn = gimple_call_combined_fn (expr);
>   switch (cfn)
> {

Did you mean:
combined_fn cfn = get_call_combined_fn (expr);

> ...
>
> cfn will be CFN_LAST for a non-builtin/internal call.  I know Richard is 
> mostly
> offline but eventually he knows whether there is a better way to query
>
> +   CASE_CFN_POPCOUNT:
> + /* Check if opcode for popcount is available.  */
> + if (optab_handler (popcount_optab,
> +TYPE_MODE (TREE_TYPE (CALL_EXPR_ARG
> (expr, 0
> + == CODE_FOR_nothing)
> +   return true;
>
> note that we currently generate builtin calls rather than IFN calls
> (when a direct
> optab is supported).
>
> Another comment on the patch is that you probably have to adjust existing
> popcount testcases to add architecture specific flags enabling suport for
> the instructions, otherwise you won't see loop replacement.
Indeed.
In lib/target-supports.exp, I will try to add support for
check_effective_target_popcount_long.
When I grep for the popcount pattern in md files, I see it is defined for:

tilegx
tilepro
alpha
aarch64  when TARGET_SIMD
ia64
rs6000
i386  when TARGET_POPCOUNT
popwerpcspce  when TARGET_POPCNTB || TARGET_POPCNTD
s390  when TARGET_Z916 && TARGET_64BIT
sparc when TARGET_POPC
arm when TARGET_NEON
mips when ISA_HAS_POP
spu
avr

I can check these targets with the condition.
Another possibility is to check with a sample code and see if we are
getting a libcall in the asm. Not sure if that is straightforward. Are
th

[PR87469] ICE in record_estimate, at tree-ssa-loop-niter.c

2018-10-27 Thread Kugan Vivekanandarajah

Hi,

In the testcase provided in the bug report, max value for niter
estimation is off by one when it is INTEGER_CST. As a results it
asserts at the place where it is checked for equality.
Attached patch fixes this. Bootstrapped and regression tested on
x86_64-linux-gnu with no new regression. Is this OK?

Thanks,
Kugan

gcc/testsuite/ChangeLog:

2018-10-26  Kugan Vivekanandarajah  

PR middle-end/87469
* g++.dg/pr87469.C: New test.

gcc/ChangeLog:

2018-10-26  Kugan Vivekanandarajah  

PR middle-end/87469
* tree-ssa-loop-niter.c (number_of_iterations_popcount): Fix niter
max value.
From 359f6aa2d603784b900feedb7ad450523695e191 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 26 Oct 2018 09:04:47 +1100
Subject: [PATCH] pr87469 V2

Change-Id: If1f7da7450ae27e24baf638861c97ff416f8484a
---
 gcc/testsuite/g++.dg/pr87469.C | 15 +++
 gcc/tree-ssa-loop-niter.c  |  8 +++-
 2 files changed, 18 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/pr87469.C

diff --git a/gcc/testsuite/g++.dg/pr87469.C b/gcc/testsuite/g++.dg/pr87469.C
new file mode 100644
index 000..2f6de97
--- /dev/null
+++ b/gcc/testsuite/g++.dg/pr87469.C
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-c -w -O2"  } */
+long a;
+struct c {
+void d(unsigned f) {
+	long e = f;
+	while (e & (e - 1))
+	  e &= e - 1;
+	a = e;
+}
+};
+void g() {
+c b;
+b.d(4 + 2);
+}
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index e2bc936..e763b35 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -2589,11 +2589,9 @@ number_of_iterations_popcount (loop_p loop, edge exit,
   if (TREE_CODE (call) == INTEGER_CST)
 max = tree_to_uhwi (call);
   else
-{
-  max = TYPE_PRECISION (TREE_TYPE (src));
-  if (adjust)
-	max = max - 1;
-}
+max = TYPE_PRECISION (TREE_TYPE (src));
+  if (adjust)
+max = max - 1;
 
   niter->niter = iter;
   niter->assumptions = boolean_true_node;
-- 
2.7.4

[ABSU_EXPR] Add some of the missing patterns in match.pd

2018-10-25 Thread Kugan Vivekanandarajah

Hi,

This patch adds some of the missing patterns in match.pd for ABSU_EXPR
and it is a revised version based on the review at
https://gcc.gnu.org/ml/gcc-patches/2018-07/msg00046.html
Bootstrapped and regression tested on x86_64-linux-gnu with no new
regressions. Is this OK trunk?

Thanks,
Kugan

gcc/testsuite/ChangeLog:

2018-10-25  Kugan Vivekanandarajah  

* gcc.dg/gimplefe-30.c: New test.
* gcc.dg/gimplefe-31.c: New test.
* gcc.dg/gimplefe-32.c: New test.
* gcc.dg/gimplefe-33.c: New test.


gcc/ChangeLog:

2018-10-25  Kugan Vivekanandarajah  

* doc/generic.texi (ABSU_EXPR): Document.
* match.pd (absu(x)*absu(x) -> x*x): Handle.
(absu(absu(X)) -> absu(X)): Likewise.
(absu(-X) -> absu(X)): Likewise.
(absu(X)  where X is nonnegative -> X): Likewise.
From de3cc3764cc38e4a28fd4edf138e77c310513eba Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Wed, 24 Oct 2018 20:43:54 +1100
Subject: [PATCH] absu pattern

Change-Id: I1f8fab31a33790f2266683230d38a2172f710f4d
---
 gcc/doc/generic.texi   |  6 ++
 gcc/match.pd   | 24 
 gcc/testsuite/gcc.dg/gimplefe-30.c | 16 
 gcc/testsuite/gcc.dg/gimplefe-31.c | 16 
 gcc/testsuite/gcc.dg/gimplefe-32.c | 14 ++
 gcc/testsuite/gcc.dg/gimplefe-33.c | 16 
 6 files changed, 92 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/gimplefe-30.c
 create mode 100644 gcc/testsuite/gcc.dg/gimplefe-31.c
 create mode 100644 gcc/testsuite/gcc.dg/gimplefe-32.c
 create mode 100644 gcc/testsuite/gcc.dg/gimplefe-33.c

diff --git a/gcc/doc/generic.texi b/gcc/doc/generic.texi
index cf4bcf5..41a4062 100644
--- a/gcc/doc/generic.texi
+++ b/gcc/doc/generic.texi
@@ -1274,6 +1274,7 @@ the byte offset of the field, but should not be used directly; call
 @subsection Unary and Binary Expressions
 @tindex NEGATE_EXPR
 @tindex ABS_EXPR
+@tindex ABSU_EXPR
 @tindex BIT_NOT_EXPR
 @tindex TRUTH_NOT_EXPR
 @tindex PREDECREMENT_EXPR
@@ -1371,6 +1372,11 @@ or complex abs of a complex value, use the @code{BUILT_IN_CABS},
 to implement the C99 @code{cabs}, @code{cabsf} and @code{cabsl}
 built-in functions.
 
+@item ABSU_EXPR
+These nodes represent the absolute value of the single operand in
+eauivalent unsigned type such that @code{ABSU_EXPR} of TYPE_MIN is
+well defined.
+
 @item BIT_NOT_EXPR
 These nodes represent bitwise complement, and will always have integral
 type.  The only operand is the value to be complemented.
diff --git a/gcc/match.pd b/gcc/match.pd
index b36d7cc..1c1f225 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -590,6 +590,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (mult (abs@1 @0) @1)
  (mult @0 @0))
 
+/* Convert absu(x)*absu(x) -> x*x.  */
+(simplify
+ (mult (absu@1 @0) @1)
+ (mult (convert@2 @0) @2))
+
 /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
 (for coss (COS COSH)
  copysigns (COPYSIGN)
@@ -1121,16 +1126,35 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && tree_nop_conversion_p (type, TREE_TYPE (@2)))
   (bit_xor (convert @1) (convert @2
 
+/* Convert abs (abs (X)) into abs (X).
+   also absu (absu (X)) into absu (X).  */
 (simplify
  (abs (abs@1 @0))
  @1)
+
+(simplify
+ (absu (convert@2 (absu@1 @0)))
+ (if (tree_nop_conversion_p (TREE_TYPE (@2), TREE_TYPE (@1)))
+  @1))
+
+/* Convert abs[u] (-X) -> abs[u] (X).  */
 (simplify
  (abs (negate @0))
  (abs @0))
+
+(simplify
+ (absu (negate @0))
+ (absu @0))
+
+/* Convert abs[u] (X)  where X is nonnegative -> (X).  */
 (simplify
  (abs tree_expr_nonnegative_p@0)
  @0)
 
+(simplify
+ (absu tree_expr_nonnegative_p@0)
+ (convert @0))
+
 /* A few cases of fold-const.c negate_expr_p predicate.  */
 (match negate_expr_p
  INTEGER_CST
diff --git a/gcc/testsuite/gcc.dg/gimplefe-30.c b/gcc/testsuite/gcc.dg/gimplefe-30.c
new file mode 100644
index 000..6c25106
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gimplefe-30.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fgimple -fdump-tree-optimized" } */
+
+unsigned int __GIMPLE() f(int a)
+{
+  unsigned int t0;
+  unsigned int t1;
+  unsigned int t2;
+  t0 = __ABSU a;
+  t1 = __ABSU a;
+  t2 = t0 * t1;
+  return t2;
+}
+
+
+/* { dg-final { scan-tree-dump-times "ABSU" 0 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/gimplefe-31.c b/gcc/testsuite/gcc.dg/gimplefe-31.c
new file mode 100644
index 000..a97d0dd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gimplefe-31.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O -fgimple -fdump-tree-optimized" } */
+
+
+unsigned int __GIMPLE() f(int a)
+{
+  unsigned int t0;
+  int t1;
+  unsigned int t2;
+  t0 = __ABSU a;
+  t1 = (int) t0;
+  t2 = __ABSU t1;
+  return t2;
+}
+
+/* { dg-final { scan-tree-dump-times "ABSU" 1 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/gimplefe-32.c b/gcc/testsuite/gcc.dg/gimplefe-32.c
new fi

[RFC][PR87528][PR86677] Disable builtin popcount detection when back-end does not define it

2018-10-25 Thread Kugan Vivekanandarajah

Hi,

PR87528 showed a case where libgcc generated popcount is causing
regression for Skylake.
We also have PR86677 where kernel build is failing because the kernel
does not use the libgcc (when backend is not defining popcount
pattern).  While I agree that the kernel should implement its own
functionality when it is not using the libgcc, I am afraid that the
implementation can have the same performance issues reported for
Skylake in PR87528.

Therefore, I would like to propose that we disable popcount detection
when we don't have a pattern for that. The attached patch (based on
previous discussions) does this.

Bootstrapped and regression tested on x86_64-linux-gnu with no new
regressions. We need to disable the popcount* testcases. I will have
to define a effective_target_with_popcount in
gcc/testsuite/lib/target-supports.exp if this patch is OK?

Thanks,
Kugan


gcc/ChangeLog:

2018-10-25  Kugan Vivekanandarajah  

* tree-scalar-evolution.c (expression_expensive_p): Make BUILTIN POPCOUNT
as expensive when backend does not define it.


gcc/testsuite/ChangeLog:

2018-10-25  Kugan Vivekanandarajah  

* gcc.target/aarch64/popcount4.c: New test.
From 1cf48663a678def7eb7f464ca4dbadd7e7311155 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Wed, 24 Oct 2018 20:33:50 +1100
Subject: [PATCH] fix kernel build

Change-Id: I1ac6d419419c1e87981f7c15916c313a11a23d97
---
 gcc/testsuite/gcc.target/aarch64/popcount4.c | 14 ++
 gcc/tree-scalar-evolution.c  | 20 
 2 files changed, 34 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/popcount4.c

diff --git a/gcc/testsuite/gcc.target/aarch64/popcount4.c b/gcc/testsuite/gcc.target/aarch64/popcount4.c
new file mode 100644
index 000..ee55b2e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/popcount4.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized -mgeneral-regs-only" } */
+
+int PopCount (long b) {
+int c = 0;
+
+while (b) {
+	b &= b - 1;
+	c++;
+}
+return c;
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_popcount" 0 "optimized" } } */
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 6475743..3dcb0d5 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -257,7 +257,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "rtl.h"
+#include "optabs-query.h"
 #include "tree.h"
 #include "gimple.h"
 #include "ssa.h"
@@ -282,6 +284,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "tree-into-ssa.h"
 #include "builtins.h"
+#include "case-cfn-macros.h"
 
 static tree analyze_scalar_evolution_1 (struct loop *, tree);
 static tree analyze_scalar_evolution_for_address_of (struct loop *loop,
@@ -3500,6 +3503,23 @@ expression_expensive_p (tree expr)
 {
   tree arg;
   call_expr_arg_iterator iter;
+  tree fndecl = get_callee_fndecl (expr);
+
+  if (fndecl && DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL)
+	{
+	  combined_fn cfn = as_combined_fn (DECL_FUNCTION_CODE (fndecl));
+	  switch (cfn)
+	{
+	CASE_CFN_POPCOUNT:
+	  /* Check if opcode for popcount is available.  */
+	  if (optab_handler (popcount_optab,
+ TYPE_MODE (TREE_TYPE (CALL_EXPR_ARG (expr, 0
+		  == CODE_FOR_nothing)
+		return true;
+	default:
+	  break;
+	}
+	}
 
   if (!is_inexpensive_builtin (get_callee_fndecl (expr)))
 	return true;
-- 
2.7.4

[SVE ACLE] Implements ACLE svdup, svindex, svqad/qsub, svabd and svmul

2018-10-15 Thread Kugan Vivekanandarajah

Hi,
Attached patch implements ACLE svdup, svindex, svqad/qsub, svabd and
svmul built-ins.
Committed to ACLE branch,
Thanks,
Kugan


0001-svdup-svindex-svqad-qsub-svabd-and-svmul.patch.gz
Description: application/gzip

Re: [RFC] Fix recent popcount change is breaking

2018-07-27 Thread Kugan Vivekanandarajah

Hi,

On 28 July 2018 at 01:13, Richard Biener  wrote:
> On July 27, 2018 3:33:59 PM GMT+02:00, "Martin Liška"  wrote:
>>On 07/11/2018 02:31 PM, Richard Biener wrote:
>>> Why not simply make popcountdi available in the kernel?  They do have
>>> implementations for other libgcc functions IIRC.
>>
>>Can you please Kugan create Linux kernel bug for that? So that
>>discussion
>>can happen?
>
> There's no discussion necessary, libgcc is the core compiler runtime. If you 
> choose not to use it you have to provide your own implementation.

Created a bug against kernel at
https://bugzilla.kernel.org/show_bug.cgi?id=200671

Thanks,
Kugan
>
> Richard.
>
>>Thanks,
>>Martin
>

[PR86544] Fix Popcount detection generates different code on C and C++

2018-07-17 Thread Kugan Vivekanandarajah

Attached patch fixes phi-opt not optimizing c++ testcase where we have

if (b_4(D) == 0) instead of if (b_4(D) != 0) as shown in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86544

Patch bootstrapped and regression tested on x86_64-linux-gnu with no
new regressions.

Is this OK for trunk?

Thanks,
Kugan

gcc/ChangeLog:

2018-07-18  Kugan Vivekanandarajah  

PR middle-end/86544
* tree-ssa-phiopt.c (cond_removal_in_popcount_pattern): Handle
comparison with EQ_EXPR
in last stmt.

gcc/testsuite/ChangeLog:

2018-07-18  Kugan Vivekanandarajah  

PR middle-end/86544
* g++.dg/tree-ssa/pr86544.C: New test.
From c482a4225764e0b338abafb7ccea4553f273f5d5 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Wed, 18 Jul 2018 08:14:16 +1000
Subject: [PATCH] fix cpp testcase

Change-Id: Icd59b31faef2ac66beb42990cb69cbbe38c238aa
---
 gcc/testsuite/g++.dg/tree-ssa/pr86544.C | 15 +++
 gcc/tree-ssa-phiopt.c   | 26 +++---
 2 files changed, 30 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/tree-ssa/pr86544.C

diff --git a/gcc/testsuite/g++.dg/tree-ssa/pr86544.C b/gcc/testsuite/g++.dg/tree-ssa/pr86544.C
new file mode 100644
index 000..8a90089
--- /dev/null
+++ b/gcc/testsuite/g++.dg/tree-ssa/pr86544.C
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-phiopt3 -fdump-tree-optimized" } */
+
+int PopCount (long b) {
+int c = 0;
+
+while (b) {
+	b &= b - 1;
+	c++;
+}
+return c;
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_popcount" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "if" 0 "phiopt3" } } */
diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c
index 656f840..1667bad 100644
--- a/gcc/tree-ssa-phiopt.c
+++ b/gcc/tree-ssa-phiopt.c
@@ -1614,8 +1614,22 @@ cond_removal_in_popcount_pattern (basic_block cond_bb, basic_block middle_bb,
   arg = gimple_assign_rhs1 (cast);
 }
 
+  cond = last_stmt (cond_bb);
+
+  /* Cond_bb has a check for b_4 [!=|==] 0 before calling the popcount
+ builtin.  */
+  if (gimple_code (cond) != GIMPLE_COND
+  || (gimple_cond_code (cond) != NE_EXPR
+	  && gimple_cond_code (cond) != EQ_EXPR)
+  || !integer_zerop (gimple_cond_rhs (cond))
+  || arg != gimple_cond_lhs (cond))
+return false;
+
   /* Canonicalize.  */
-  if (e2->flags & EDGE_TRUE_VALUE)
+  if ((e2->flags & EDGE_TRUE_VALUE
+   && gimple_cond_code (cond) == NE_EXPR)
+  || (e1->flags & EDGE_TRUE_VALUE
+	  && gimple_cond_code (cond) == EQ_EXPR))
 {
   std::swap (arg0, arg1);
   std::swap (e1, e2);
@@ -1625,16 +1639,6 @@ cond_removal_in_popcount_pattern (basic_block cond_bb, basic_block middle_bb,
   if (lhs != arg0 || !integer_zerop (arg1))
 return false;
 
-  cond = last_stmt (cond_bb);
-
-  /* Cond_bb has a check for b_4 != 0 before calling the popcount
- builtin.  */
-  if (gimple_code (cond) != GIMPLE_COND
-  || gimple_cond_code (cond) != NE_EXPR
-  || !integer_zerop (gimple_cond_rhs (cond))
-  || arg != gimple_cond_lhs (cond))
-return false;
-
   /* And insert the popcount builtin and cast stmt before the cond_bb.  */
   gsi = gsi_last_bb (cond_bb);
   if (cast)
-- 
2.7.4

Re: [RFC] Fix recent popcount change is breaking

2018-07-11 Thread Kugan Vivekanandarajah

Hi Andrew,

On 11 July 2018 at 15:43, Andrew Pinski  wrote:
> On Tue, Jul 10, 2018 at 6:35 PM Kugan Vivekanandarajah
>  wrote:
>>
>> Hi Andrew,
>>
>> On 11 July 2018 at 11:19, Andrew Pinski  wrote:
>> > On Tue, Jul 10, 2018 at 6:14 PM Kugan Vivekanandarajah
>> >  wrote:
>> >>
>> >> On 10 July 2018 at 23:17, Richard Biener  
>> >> wrote:
>> >> > On Tue, Jul 10, 2018 at 3:06 PM Kugan Vivekanandarajah
>> >> >  wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> Jeff told me that the recent popcount built-in detection is causing
>> >> >> kernel build issues as
>> >> >> ERROR: "__popcountsi2"
>> >> >> [drivers/net/wireless/broadcom/brcm80211/brcmfmac/brcmfmac.ko] 
>> >> >> undefined!
>> >> >>
>> >> >> I could also reproduce this. AFIK, we should check if the libfunc is
>> >> >> defined while checking popcount?
>> >> >>
>> >> >> I am testing the attached RFC patch. Is this reasonable?
>> >> >
>> >> > It doesn't work that way, all targets have this libfunc in libgcc.  
>> >> > This means
>> >> > the kernel has to provide it.  The only thing you could do is restrict
>> >> > replacement of CALL_EXPRs (in SCEV cprop) to those the target
>> >> > natively supports.
>> >>
>> >> How about restricting it in expression_expensive_p ? Is that what you
>> >> wanted. Attached patch does this.
>> >> Bootstrap and regression testing progressing.
>> >
>> > Seems like that should go into is_inexpensive_builtin  instead which
>> > is just tested right below.
>>
>> I hought about that. is_inexpensive_builtin is used in various other
>> places including some inlining decision so wasn't sure if it is the
>> right thing. Happy to change it if that is the right thing to do.
>
> I audited all of the users (and their users if it is used in a
> wrapper) and found that is_inexpensive_builtin should return false for
> this builtin if it is a function call in the end; there are other
> builtins which should be checked the similar way but I think we should
> not going to force you to do the similar thing for those builtins.

Attached patch does this. Testing is progressing. Is This OK if no regression.

Thanks,
Kugan


>
> Thanks,
> Andrew
>
>>
>> Thanks,
>> Kugan
>> >
>> > Thanks,
>> > Andrew
>> >
>> >>
>> >> Thanks,
>> >> Kugan
>> >>
>> >> >
>> >> > Richard.
>> >> >
>> >> >> Thanks,
>> >> >> Kugan
>> >> >>
>> >> >> gcc/ChangeLog:
>> >> >>
>> >> >> 2018-07-10  Kugan Vivekanandarajah  
>> >> >>
>> >> >> * tree-ssa-loop-niter.c (number_of_iterations_popcount): Check
>> >> >> if libfunc for popcount is available.
diff --git a/gcc/builtins.c b/gcc/builtins.c
index 820d6c2..59cf567 100644
--- a/gcc/builtins.c
+++ b/gcc/builtins.c
@@ -10619,6 +10619,18 @@ is_inexpensive_builtin (tree decl)
   else if (DECL_BUILT_IN_CLASS (decl) == BUILT_IN_NORMAL)
 switch (DECL_FUNCTION_CODE (decl))
   {
+  case BUILT_IN_POPCOUNT:
+  case BUILT_IN_POPCOUNTL:
+  case BUILT_IN_POPCOUNTLL:
+ {
+   tree arg = TYPE_ARG_TYPES (TREE_TYPE (decl));
+   /* Check if opcode for popcount is available.  */
+   if (optab_handler (popcount_optab,
+  TYPE_MODE (TREE_VALUE (arg)))
+   == CODE_FOR_nothing)
+ return false;
+ }
+   return true;
   case BUILT_IN_ABS:
   CASE_BUILT_IN_ALLOCA:
   case BUILT_IN_BSWAP16:
@@ -10670,10 +10682,7 @@ is_inexpensive_builtin (tree decl)
   case BUILT_IN_VA_COPY:
   case BUILT_IN_TRAP:
   case BUILT_IN_SAVEREGS:
-  case BUILT_IN_POPCOUNTL:
-  case BUILT_IN_POPCOUNTLL:
   case BUILT_IN_POPCOUNTIMAX:
-  case BUILT_IN_POPCOUNT:
   case BUILT_IN_PARITYL:
   case BUILT_IN_PARITYLL:
   case BUILT_IN_PARITYIMAX:
diff --git a/gcc/testsuite/gcc.target/aarch64/popcount4.c 
b/gcc/testsuite/gcc.target/aarch64/popcount4.c
index e69de29..ee55b2e 100644
--- a/gcc/testsuite/gcc.target/aarch64/popcount4.c
+++ b/gcc/testsuite/gcc.target/aarch64/popcount4.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized -mgeneral-regs-only" } */
+
+int PopCount (long b) {
+int c = 0;
+
+while (b) {
+   b &= b - 1;
+   c++;
+}
+return c;
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_popcount" 0 "optimized" } } */

Re: [RFC] Fix recent popcount change is breaking

2018-07-10 Thread Kugan Vivekanandarajah

Hi Andrew,

On 11 July 2018 at 11:19, Andrew Pinski  wrote:
> On Tue, Jul 10, 2018 at 6:14 PM Kugan Vivekanandarajah
>  wrote:
>>
>> On 10 July 2018 at 23:17, Richard Biener  wrote:
>> > On Tue, Jul 10, 2018 at 3:06 PM Kugan Vivekanandarajah
>> >  wrote:
>> >>
>> >> Hi,
>> >>
>> >> Jeff told me that the recent popcount built-in detection is causing
>> >> kernel build issues as
>> >> ERROR: "__popcountsi2"
>> >> [drivers/net/wireless/broadcom/brcm80211/brcmfmac/brcmfmac.ko] undefined!
>> >>
>> >> I could also reproduce this. AFIK, we should check if the libfunc is
>> >> defined while checking popcount?
>> >>
>> >> I am testing the attached RFC patch. Is this reasonable?
>> >
>> > It doesn't work that way, all targets have this libfunc in libgcc.  This 
>> > means
>> > the kernel has to provide it.  The only thing you could do is restrict
>> > replacement of CALL_EXPRs (in SCEV cprop) to those the target
>> > natively supports.
>>
>> How about restricting it in expression_expensive_p ? Is that what you
>> wanted. Attached patch does this.
>> Bootstrap and regression testing progressing.
>
> Seems like that should go into is_inexpensive_builtin  instead which
> is just tested right below.

I hought about that. is_inexpensive_builtin is used in various other
places including some inlining decision so wasn't sure if it is the
right thing. Happy to change it if that is the right thing to do.

Thanks,
Kugan
>
> Thanks,
> Andrew
>
>>
>> Thanks,
>> Kugan
>>
>> >
>> > Richard.
>> >
>> >> Thanks,
>> >> Kugan
>> >>
>> >> gcc/ChangeLog:
>> >>
>> >> 2018-07-10  Kugan Vivekanandarajah  
>> >>
>> >> * tree-ssa-loop-niter.c (number_of_iterations_popcount): Check
>> >> if libfunc for popcount is available.

Re: [RFC] Fix recent popcount change is breaking

2018-07-10 Thread Kugan Vivekanandarajah

On 10 July 2018 at 23:17, Richard Biener  wrote:
> On Tue, Jul 10, 2018 at 3:06 PM Kugan Vivekanandarajah
>  wrote:
>>
>> Hi,
>>
>> Jeff told me that the recent popcount built-in detection is causing
>> kernel build issues as
>> ERROR: "__popcountsi2"
>> [drivers/net/wireless/broadcom/brcm80211/brcmfmac/brcmfmac.ko] undefined!
>>
>> I could also reproduce this. AFIK, we should check if the libfunc is
>> defined while checking popcount?
>>
>> I am testing the attached RFC patch. Is this reasonable?
>
> It doesn't work that way, all targets have this libfunc in libgcc.  This means
> the kernel has to provide it.  The only thing you could do is restrict
> replacement of CALL_EXPRs (in SCEV cprop) to those the target
> natively supports.

How about restricting it in expression_expensive_p ? Is that what you
wanted. Attached patch does this.
Bootstrap and regression testing progressing.

Thanks,
Kugan

>
> Richard.
>
>> Thanks,
>> Kugan
>>
>> gcc/ChangeLog:
>>
>> 2018-07-10  Kugan Vivekanandarajah  
>>
>> * tree-ssa-loop-niter.c (number_of_iterations_popcount): Check
>> if libfunc for popcount is available.
diff --git a/gcc/testsuite/gcc.target/aarch64/popcount4.c 
b/gcc/testsuite/gcc.target/aarch64/popcount4.c
index e69de29..ee55b2e 100644
--- a/gcc/testsuite/gcc.target/aarch64/popcount4.c
+++ b/gcc/testsuite/gcc.target/aarch64/popcount4.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized -mgeneral-regs-only" } */
+
+int PopCount (long b) {
+int c = 0;
+
+while (b) {
+   b &= b - 1;
+   c++;
+}
+return c;
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_popcount" 0 "optimized" } } */
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 69122f2..ec8e4ec 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -257,7 +257,9 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "rtl.h"
+#include "optabs-query.h"
 #include "tree.h"
 #include "gimple.h"
 #include "ssa.h"
@@ -282,6 +284,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-fold.h"
 #include "tree-into-ssa.h"
 #include "builtins.h"
+#include "case-cfn-macros.h"
 
 static tree analyze_scalar_evolution_1 (struct loop *, tree);
 static tree analyze_scalar_evolution_for_address_of (struct loop *loop,
@@ -3500,6 +3503,23 @@ expression_expensive_p (tree expr)
 {
   tree arg;
   call_expr_arg_iterator iter;
+  tree fndecl = get_callee_fndecl (expr);
+
+  if (fndecl && DECL_BUILT_IN_CLASS (fndecl) == BUILT_IN_NORMAL)
+   {
+ combined_fn cfn = as_combined_fn (DECL_FUNCTION_CODE (fndecl));
+ switch (cfn)
+   {
+   CASE_CFN_POPCOUNT:
+ /* Check if opcode for popcount is available.  */
+ if (optab_handler (popcount_optab,
+TYPE_MODE (TREE_TYPE (CALL_EXPR_ARG (expr, 
0
+ == CODE_FOR_nothing)
+   return true;
+   default:
+ break;
+   }
+   }
 
   if (!is_inexpensive_builtin (get_callee_fndecl (expr)))
return true;

[RFC] Fix recent popcount change is breaking

2018-07-10 Thread Kugan Vivekanandarajah

Hi,

Jeff told me that the recent popcount built-in detection is causing
kernel build issues as
ERROR: "__popcountsi2"
[drivers/net/wireless/broadcom/brcm80211/brcmfmac/brcmfmac.ko] undefined!

I could also reproduce this. AFIK, we should check if the libfunc is
defined while checking popcount?

I am testing the attached RFC patch. Is this reasonable?

Thanks,
Kugan

gcc/ChangeLog:

2018-07-10  Kugan Vivekanandarajah  

* tree-ssa-loop-niter.c (number_of_iterations_popcount): Check
if libfunc for popcount is available.
diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index f6fa2f7..2e2b9c6 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -21,6 +21,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
+#include "target.h"
 #include "rtl.h"
 #include "tree.h"
 #include "gimple.h"
@@ -42,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-chrec.h"
 #include "tree-scalar-evolution.h"
 #include "params.h"
+#include "optabs-libfuncs.h"
 #include "tree-dfa.h"
 
 
@@ -2570,6 +2572,10 @@ number_of_iterations_popcount (loop_p loop, edge exit,
   (long_long_integer_type_node))
 fn = builtin_decl_implicit (BUILT_IN_POPCOUNTLL);
 
+  /* Check if libfunc for popcount is available.  */
+  if (!optab_libfunc (popcount_optab,
+ TYPE_MODE (TREE_TYPE (src
+return false;
   /* ??? Support promoting char/short to int.  */
   if (!fn)
 return false;

Re: [PATCH 1/3][POPCOUNT] Handle COND_EXPR in expression_expensive_p

2018-07-09 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.

On 6 July 2018 at 20:17, Richard Biener  wrote:
> On Fri, Jul 6, 2018 at 11:45 AM Kugan Vivekanandarajah
>  wrote:
>>
>> Hi Richard,
>>
>> > It was rewrite_to_non_trapping_overflow available  in tree.h.  Thus
>> > final value replacement
>> > could use that before gimplifying instead of using 
>> > rewrite_to_defined_overflow
>> Thanks.
>>
>> Is the attached patch OK? I am testing this on x86_64-linux-gnu and if
>> there is no new regressions.
>
> Please clean up the control flow to
>
>   if (...)
> def = rewrite_to_non_trapping_overflow (def);
>   def = force_gimple_operand_gsi (, def, false, NULL_TREE,
> true, GSI_SAME_STMT);

I also had to add flag_trapv like we do in other places (for
flag_wrapv) when calling rewrite_to_non_trapping_overflow. Attached
patch bootstraps and there is no new regressions in x86-64-linux-gnu.
Is this OK?

Thanks,
Kugan
>
> OK with that change.
> Richard.
>
>> Thanks,
>> Kugan
>>
>> gcc/ChangeLog:
>>
>> 2018-07-06  Kugan Vivekanandarajah  
>>
>> * tree-scalar-evolution.c (final_value_replacement_loop): Use
>> rewrite_to_non_trapping_overflow instead of rewrite_to_defined_overflow.
From f3ecde5ff57d361e452965550aca94560629e784 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 6 Jul 2018 13:34:41 +1000
Subject: [PATCH] rewrite with rewrite_to_non_trapping_overflow

Change-Id: I18bb9713b4562cd3f3954c3997bb376969d8ce9b
---
 gcc/tree-scalar-evolution.c | 29 +++--
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 4b0ec02..4feb4b1 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -267,6 +267,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-iterator.h"
 #include "gimplify-me.h"
 #include "tree-cfg.h"
+#include "tree-eh.h"
 #include "tree-ssa-loop-ivopts.h"
 #include "tree-ssa-loop-manip.h"
 #include "tree-ssa-loop-niter.h"
@@ -3615,30 +3616,14 @@ final_value_replacement_loop (struct loop *loop)
   if (folded_casts && ANY_INTEGRAL_TYPE_P (TREE_TYPE (def))
 	  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (def)))
 	{
-	  gimple_seq stmts;
-	  gimple_stmt_iterator gsi2;
-	  def = force_gimple_operand (def, , true, NULL_TREE);
-	  gsi2 = gsi_start (stmts);
-	  while (!gsi_end_p (gsi2))
-	{
-	  gimple *stmt = gsi_stmt (gsi2);
-	  gimple_stmt_iterator gsi3 = gsi2;
-	  gsi_next ();
-	  gsi_remove (, false);
-	  if (is_gimple_assign (stmt)
-		  && arith_code_with_undefined_signed_overflow
-		  (gimple_assign_rhs_code (stmt)))
-		gsi_insert_seq_before (,
-   rewrite_to_defined_overflow (stmt),
-   GSI_SAME_STMT);
-	  else
-		gsi_insert_before (, stmt, GSI_SAME_STMT);
-	}
+	  bool saved_flag_trapv = flag_trapv;
+	  flag_trapv = 1;
+	  def = rewrite_to_non_trapping_overflow (def);
+	  flag_trapv = saved_flag_trapv;
 	}
-  else
-	def = force_gimple_operand_gsi (, def, false, NULL_TREE,
-	true, GSI_SAME_STMT);
 
+  def = force_gimple_operand_gsi (, def, false, NULL_TREE,
+	true, GSI_SAME_STMT);
   gassign *ass = gimple_build_assign (rslt, def);
   gsi_insert_before (, ass, GSI_SAME_STMT);
   if (dump_file)
-- 
2.7.4

Re: [PATCH 1/3][POPCOUNT] Handle COND_EXPR in expression_expensive_p

2018-07-06 Thread Kugan Vivekanandarajah

Hi Richard,

> It was rewrite_to_non_trapping_overflow available  in tree.h.  Thus
> final value replacement
> could use that before gimplifying instead of using rewrite_to_defined_overflow
Thanks.

Is the attached patch OK? I am testing this on x86_64-linux-gnu and if
there is no new regressions.

Thanks,
Kugan

gcc/ChangeLog:

2018-07-06  Kugan Vivekanandarajah  

* tree-scalar-evolution.c (final_value_replacement_loop): Use
rewrite_to_non_trapping_overflow instead of rewrite_to_defined_overflow.
From 68a4f232f6cde68751f6785059121fe116363886 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 6 Jul 2018 13:34:41 +1000
Subject: [PATCH] rewrite with rewrite_to_non_trapping_overflow

Change-Id: Ica4407eab1c2b6f4190d8c0df6308154ffad2c1f
---
 gcc/tree-scalar-evolution.c | 20 +++-
 1 file changed, 3 insertions(+), 17 deletions(-)

diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 4b0ec02..3b4f0aa 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -267,6 +267,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gimple-iterator.h"
 #include "gimplify-me.h"
 #include "tree-cfg.h"
+#include "tree-eh.h"
 #include "tree-ssa-loop-ivopts.h"
 #include "tree-ssa-loop-manip.h"
 #include "tree-ssa-loop-niter.h"
@@ -3616,24 +3617,9 @@ final_value_replacement_loop (struct loop *loop)
 	  && TYPE_OVERFLOW_UNDEFINED (TREE_TYPE (def)))
 	{
 	  gimple_seq stmts;
-	  gimple_stmt_iterator gsi2;
+	  def = rewrite_to_non_trapping_overflow (def);
 	  def = force_gimple_operand (def, , true, NULL_TREE);
-	  gsi2 = gsi_start (stmts);
-	  while (!gsi_end_p (gsi2))
-	{
-	  gimple *stmt = gsi_stmt (gsi2);
-	  gimple_stmt_iterator gsi3 = gsi2;
-	  gsi_next ();
-	  gsi_remove (, false);
-	  if (is_gimple_assign (stmt)
-		  && arith_code_with_undefined_signed_overflow
-		  (gimple_assign_rhs_code (stmt)))
-		gsi_insert_seq_before (,
-   rewrite_to_defined_overflow (stmt),
-   GSI_SAME_STMT);
-	  else
-		gsi_insert_before (, stmt, GSI_SAME_STMT);
-	}
+	  gsi_insert_seq_before (, stmts, GSI_SAME_STMT);
 	}
   else
 	def = force_gimple_operand_gsi (, def, false, NULL_TREE,
-- 
2.7.4

Re: [PATCH 0/3][POPCOUNT]

2018-07-06 Thread Kugan Vivekanandarajah

Hi Jeff,

Thanks for looking into it.

On 6 July 2018 at 08:03, Jeff Law  wrote:
> On 06/24/2018 08:41 PM, Kugan Vivekanandarajah wrote:
>> Hi Jeff,
>>
>> Thanks for the comments.
>>
>> On 23 June 2018 at 02:06, Jeff Law  wrote:
>>> On 06/22/2018 03:11 AM, Kugan Vivekanandarajah wrote:
>>>> When we set niter with maybe_zero, currently final_value_relacement
>>>> will not happen due to expression_expensive_p not handling. Patch 1
>>>> adds this.
>>>>
>>>> With that we have the following optimized gimple.
>>>>
>>>>[local count: 118111601]:
>>>>   if (b_4(D) != 0)
>>>> goto ; [89.00%]
>>>>   else
>>>> goto ; [11.00%]
>>>>
>>>>[local count: 105119324]:
>>>>   _2 = (unsigned long) b_4(D);
>>>>   _9 = __builtin_popcountl (_2);
>>>>   c_3 = b_4(D) != 0 ? _9 : 1;
>>>>
>>>>[local count: 118111601]:
>>>>   # c_12 = PHI 
>>>>
>>>> I assume that 1 in  b_4(D) != 0 ? _9 : 1; is OK (?) because when the
>>>> latch execute zero times for b_4 == 0 means that the body will execute
>>>> ones.
>>> ISTM that DOM ought to have simplified the conditional, unless there's
>>> some other way to get to bb3.  We know that b_4 is nonzero and thus c_3
>>> must have the value _9.
>> As of now, dom is not optimizing it. With the attached hack, it can be made 
>> to.
> What's strange is I'm not getting the c_3 = (b_4 != 0) ... in any of the
> dumps I'm looking at.  Instead it's c_3 = _9, which is what I would
> expect since we know that b_4 != 0
>
>
> My tests have been on x86_64 and aarch64 linux targets.  I've tried with
> patch#1 installed as well as with patch #1 and patch #2 together.
>
> What target, what flags and what patches do I need to see this?
You need the patch 1 (attaching) to get that. With Patch 2 in this
series, it will be optimized.

I haven't committed the patches yet as I am testing all the three
patches. I will commit after testing on current trunk.

Thanks,
Kugan


>
> Jeff
From 12263df77931aa55d205b9db470436848d762684 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 22 Jun 2018 14:10:26 +1000
Subject: [PATCH 1/3] generate popcount when checked for zero

Change-Id: I951e6d487268b757cbdaa8dcf671ab1377490db6
---
 gcc/gimplify.c  |  2 +-
 gcc/gimplify.h  |  1 +
 gcc/testsuite/gcc.dg/tree-ssa/pr64183.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr85073.c |  2 +-
 gcc/tree-scalar-evolution.c | 12 
 5 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 48ac92e..c86ad1a 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -3878,7 +3878,7 @@ gimplify_pure_cond_expr (tree *expr_p, gimple_seq *pre_p)
EXPR is GENERIC, while tree_could_trap_p can be called
only on GIMPLE.  */
 
-static bool
+bool
 generic_expr_could_trap_p (tree expr)
 {
   unsigned i, n;
diff --git a/gcc/gimplify.h b/gcc/gimplify.h
index dd0e4c0..62ca869 100644
--- a/gcc/gimplify.h
+++ b/gcc/gimplify.h
@@ -83,6 +83,7 @@ extern enum gimplify_status gimplify_arg (tree *, gimple_seq *, location_t,
 extern void gimplify_function_tree (tree);
 extern enum gimplify_status gimplify_va_arg_expr (tree *, gimple_seq *,
 		  gimple_seq *);
+extern bool generic_expr_could_trap_p (tree expr);
 gimple *gimplify_assign (tree, tree, gimple_seq *);
 
 #endif /* GCC_GIMPLIFY_H */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr64183.c b/gcc/testsuite/gcc.dg/tree-ssa/pr64183.c
index 7a854fc..50d0c5a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr64183.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr64183.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O3 -fno-tree-vectorize -fdump-tree-cunroll-details" } */
+/* { dg-options "-O3 -fno-tree-vectorize -fdisable-tree-sccp -fdump-tree-cunroll-details" } */
 
 int bits;
 unsigned int size;
diff --git a/gcc/testsuite/gcc.target/i386/pr85073.c b/gcc/testsuite/gcc.target/i386/pr85073.c
index 187102d..71a5d23 100644
--- a/gcc/testsuite/gcc.target/i386/pr85073.c
+++ b/gcc/testsuite/gcc.target/i386/pr85073.c
@@ -1,6 +1,6 @@
 /* PR target/85073 */
 /* { dg-do compile } */
-/* { dg-options "-O2 -mbmi" } */
+/* { dg-options "-O2 -mbmi -fdisable-tree-sccp" } */
 
 int
 foo (unsigned x)
diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 4b0ec02..8e29005 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -3508,6 +3508,18 @@ expression_expensive_p (tree expr)
   return false;
 }
 
+  if (code == COND_EXPR)
+return (expression_expensive_p (TREE_OPERAND (expr, 0))
+	|| (EXPR_P (TREE_OP

Re: [PATCH 1/3][POPCOUNT] Handle COND_EXPR in expression_expensive_p

2018-07-05 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.

On 28 June 2018 at 21:26, Richard Biener  wrote:
> On Wed, Jun 27, 2018 at 7:00 AM Kugan Vivekanandarajah
>  wrote:
>>
>> Hi Richard,
>>
>> Thanks for the review.
>>
>> On 25 June 2018 at 20:01, Richard Biener  wrote:
>> > On Fri, Jun 22, 2018 at 11:13 AM Kugan Vivekanandarajah
>> >  wrote:
>> >>
>> >> [PATCH 1/3][POPCOUNT] Handle COND_EXPR in expression_expensive_p
>> >
>> > This says that COND_EXPR itself isn't expensive.  I think we should
>> > constrain that a bit.
>> > I think a good default would be to only allow a single COND_EXPR which
>> > you can achieve
>> > by adding a bool in_cond_expr_p = false argument to the function, pass
>> > in_cond_expr_p
>> > down and pass true down from the COND_EXPR handling itself.
>> >
>> > I'm not sure if we should require either COND_EXPR arm (operand 1 or
>> > 2) to be constant
>> > or !EXPR_P (then multiple COND_EXPRs might be OK).
>> >
>> > The main idea is to avoid evaluating many expressions but only
>> > choosing one in the end.
>> >
>> > The simplest patch achieving that is sth like
>> >
>> > +  if (code == COND_EXPR)
>> > +return (expression_expensive_p (TREE_OPERAND (expr, 0))
>> >   || (EXPR_P (TREE_OPERAND (expr, 1)) && EXPR_P
>> > (TREE_OPERAND (expr, 2)))
>> > +   || expression_expensive_p (TREE_OPERAND (expr, 1))
>> > +   || expression_expensive_p (TREE_OPERAND (expr, 2)));
>> >
>> > OK with that change.
>>
>> Is || (EXPR_P (TREE_OPERAND (expr, 1)) || EXPR_P (TREE_OPERAND (expr,
>> 2))) slightly better ?
>> Attaching  with the change. Is this OK?
>
> Well, it won't allow x != 0 ? popcount (x) : 1 because popcount(x) is 
> CALL_EXPR.
>
>>
>>
>> Because, for pr81661.c, we now allow as not expensive
>> > type > size 
>> unit-size 
>> align:32 warn_if_not_align:0 symtab:0 alias-set 1
>> canonical-type 0x769455e8 precision:32 min > 0x7692dee8 -2147483648> max > 2147483647>
>> pointer_to_this >
>>
>> arg:0 
>>
>> arg:0 
>> visited
>> def_stmt a.1_10 = a;
>> version:10>
>> arg:1 >
>> arg:1 
>>
>> arg:0 > _Bool>
>>
>> arg:0 > 0x769455e8 int>
>>
>> arg:0 > 0x769455e8 int>
>> arg:0  arg:1 > 0x7694a0d8 -1>>
>> arg:1 > 0x769455e8 int>
>> visited
>> def_stmt c.2_11 = c;
>> version:11>>
>> arg:1 > 0x769455e8 int>
>> visited
>> def_stmt b.3_13 = b;
>> version:13>>
>> arg:1 > int>
>>
>> arg:0 > 0x769455e8 int>
>>
>> arg:0 > 0x76a55b28>
>>
>> arg:0 > 0x76a55b28>
>>
>> arg:0 > 
>> arg:0  arg:1
>> >>
>> arg:1 > 0x76a55b28>
>> arg:0 >>>>
>> arg:2 >>
>>
>> Which also leads to an ICE in gimplify_modify_expr. I think this is a
>> latent issue and I am trying to find the source
>
> Well, I think that's because some COND_EXPRs only gimplify to
> conditional code.  See gimplify_cond_expr:
>
>   if (gimplify_ctxp->allow_rhs_cond_expr
>   /* If either branch has side effects or could trap, it can't be
>  evaluated unconditionally.  */
>   && !TREE_SIDE_EFFECTS (then_)
>   && !generic_expr_could_trap_p (then_)
>   && !TREE_SIDE_EFFECTS (else_)
>   && !generic_expr_could_trap_p (else_))
> return gimplify_pure_cond_expr (expr_p, pre_p);
>
> so we probably have to treat TREE_SIDE_EFFECTS / generic_expr_could_trap_p as
> "expensive" as well for the purpose of final value replacement unless we are
> going to support a code-generation way different from gimplification.

Is the attached patch which does this is OK?. I had to fix couple of
testcases because now the final value replacement removed the loop for
pr64183.c and pr85073.c is popcount pattern so I just disabled it

Re: [PATCH 3/3][POPCOUNT] Remove unnecessary if condition in phiopt

2018-07-01 Thread Kugan Vivekanandarajah

Hi Richard,

On 29 June 2018 at 18:45, Richard Biener  wrote:
> On Wed, Jun 27, 2018 at 7:09 AM Kugan Vivekanandarajah
>  wrote:
>>
>> Hi Richard,
>>
>> Thanks for the review,
>>
>> On 25 June 2018 at 20:20, Richard Biener  wrote:
>> > On Fri, Jun 22, 2018 at 11:16 AM Kugan Vivekanandarajah
>> >  wrote:
>> >>
>> >> gcc/ChangeLog:
>> >
>> > @@ -1516,6 +1521,114 @@ minmax_replacement (basic_block cond_bb,
>> > basic_block middle_bb,
>> >
>> >return true;
>> >  }
>> > +/* Convert
>> > +
>> > +   
>> > +   if (b_4(D) != 0)
>> > +   goto 
>> >
>> > vertical space before the comment.
>> Done.
>>
>> >
>> > + edge e0 ATTRIBUTE_UNUSED, edge e1
>> > ATTRIBUTE_UNUSED,
>> >
>> > why pass them if they are unused?
>> Removed.
>>
>> >
>> > +  if (stmt_count != 2)
>> > +return false;
>> >
>> > so what about the case when there is no conversion?
>> Done.
>>
>> >
>> > +  /* Check that we have a popcount builtin.  */
>> > +  if (!is_gimple_call (popcount)
>> > +  || !gimple_call_builtin_p (popcount, BUILT_IN_NORMAL))
>> > +return false;
>> > +  tree fndecl = gimple_call_fndecl (popcount);
>> > +  if ((DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNT)
>> > +  && (DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNTL)
>> > +  && (DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNTLL))
>> > +return false;
>> >
>> > look at popcount handling in tree-vrp.c how to properly also handle
>> > IFN_POPCOUNT.
>> > (CASE_CFN_POPCOUNT)
>> Done.
>> >
>> > +  /* Cond_bb has a check for b_4 != 0 before calling the popcount
>> > + builtin.  */
>> > +  if (gimple_code (cond) != GIMPLE_COND
>> > +  || gimple_cond_code (cond) != NE_EXPR
>> > +  || TREE_CODE (gimple_cond_lhs (cond)) != SSA_NAME
>> > +  || rhs != gimple_cond_lhs (cond))
>> > +return false;
>> >
>> > The check for SSA_NAME is redundant.
>> > You fail to check that gimple_cond_rhs is zero.
>> Done.
>>
>> >
>> > +  /* Remove the popcount builtin and cast stmt.  */
>> > +  gsi = gsi_for_stmt (popcount);
>> > +  gsi_remove (, true);
>> > +  gsi = gsi_for_stmt (cast);
>> > +  gsi_remove (, true);
>> > +
>> > +  /* And insert the popcount builtin and cast stmt before the cond_bb.  */
>> > +  gsi = gsi_last_bb (cond_bb);
>> > +  gsi_insert_before (, popcount, GSI_NEW_STMT);
>> > +  gsi_insert_before (, cast, GSI_NEW_STMT);
>> >
>> > use gsi_move_before ().  You need to reset flow sensitive info on the
>> > LHS of the popcount call as well as on the LHS of the cast.
>> Done.
>>
>> >
>> > You fail to check the PHI operand on the false edge.  Consider
>> >
>> >  if (b != 0)
>> >res = __builtin_popcount (b);
>> >  else
>> >res = 1;
>> >
>> > You fail to check the PHI operand on the true edge.  Consider
>> >
>> >  res = 0;
>> >  if (b != 0)
>> >{
>> >   __builtin_popcount (b);
>> >   res = 2;
>> >}
>> >
>> > and using -fno-tree-dce and whatever you need to keep the
>> > popcount call in the IL.  A gimple testcase for phiopt will do.
>> >
>> > Your testcase relies on popcount detection.  Please write it
>> > using __builtin_popcount instead.  Write one with a cast and
>> > one without.
>> Added the testcases.
>>
>> Is this OK now.
>
> +  for (gsi = gsi_start_bb (middle_bb); !gsi_end_p (gsi); gsi_next ())
> +{
>
> use gsi_after_labels (middle_bb)
>
> +  popcount = last_stmt (middle_bb);
> +  if (popcount == NULL)
> +return false;
>
> after the counting this test is always false, remove it.
>
> +  /* We have a cast stmt feeding popcount builtin.  */
> +  cast = first_stmt (middle_bb);
>
> looking at the implementation of first_stmt this will
> give you a label in case the BB has one.  I think it's better
> to merge this and the above with the "counting" like
>
> gsi = gsi_start_nondebug_after_labels_bb (middle_bb);
> if (gsi_end_p (gsi))
>   return false;
> cast = gsi_stmt (gsi);
> gsi_next_nondebug ();
> if (!gsi_end_p (gsi))
>   {
> popcount = g

Re: [ABSU_EXPR] Add some of the missing patterns in match,pd

2018-06-28 Thread Kugan Vivekanandarajah

Hi Marc,

Thanks for the review.

On 28 June 2018 at 14:11, Marc Glisse  wrote:
> (why is there no mention of ABSU_EXPR in doc/* ?)

I will fix this in a separate patch.
>
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -571,10 +571,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
> (copysigns (op @0) @1)
> (copysigns @0 @1
>
> -/* abs(x)*abs(x) -> x*x.  Should be valid for all types.  */
> -(simplify
> - (mult (abs@1 @0) @1)
> - (mult @0 @0))
> +/* abs(x)*abs(x) -> x*x.  Should be valid for all types.
> +   also for absu(x)*absu(x) -> x*x.  */
> +(for op (abs absu)
> + (simplify
> +  (mult (op@1 @0) @1)
> +  (mult @0 @0)))
>
> 1) the types are wrong, it should be (convert (mult @0 @0)), or maybe
> view_convert if vectors also use absu.
> 2) this is only valid with -fwrapv (TYPE_OVERFLOW_WRAPS(TREE_TYPE(@0))),
> otherwise you are possibly replacing a multiplication on unsigned with a
> multiplication on signed that may overflow. So maybe it is actually supposed
> to be
> (mult (convert @0) (convert @0))
Indeed, I missed it. I have changed it in the attached patch.
>
>  /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
>  (for coss (COS COSH)
> @@ -1013,15 +1015,24 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>&& tree_nop_conversion_p (type, TREE_TYPE (@2)))
>(bit_xor (convert @1) (convert @2
>
> -(simplify
> - (abs (abs@1 @0))
> - @1)
> -(simplify
> - (abs (negate @0))
> - (abs @0))
> -(simplify
> - (abs tree_expr_nonnegative_p@0)
> - @0)
> +/* Convert abs (abs (X)) into abs (X).
> +   also absu (absu (X)) into absu (X).  */
> +(for op (abs absu)
> + (simplify
> +  (op (op@1 @0))
> +  @1))
>
> You cannot have (absu (absu @0)) since absu takes a signed integer and
> returns an unsigned one. (absu (abs @0)) may indeed be equivalent to
> (convert (abs @0)). Possibly (op (convert (absu @0))) could also be
> simplified if convert is a NOP.
>
> +/* Convert abs[u] (-X) -> abs[u] (X).  */
> +(for op (abs absu)
> + (simplify
> +  (op (negate @0))
> +  (op @0)))
> +
> +/* Convert abs[u] (X)  where X is nonnegative -> (X).  */
> +(for op (abs absu)
> + (simplify
> +  (op tree_expr_nonnegative_p@0)
> +  @0))
>
> Missing convert again.
>
> Where are the testcases?
I have fixed the above and added test-cases.

>
>> Bootstrap and regression testing on x86_64-linux-gnu. Is this OK if no
>> regressions.
>
>
> Does it mean you have run the tests or intend to run them in the future? It
> is confusing.
Sorry for not being clear.

I have bootstrapped and regression tested with no new regression.

Is this OK?

Thanks,
Kugan

>
> --
> Marc Glisse
From a2606af5d9ed814cbbbce19dafbb780405ca8a06 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 29 Jun 2018 10:52:46 +1000
Subject: [PATCH] add absu patterns V2

Change-Id: I002312bf23a5fb225b7ff18e98ad22822baa6bef
---
 gcc/match.pd   | 23 +++
 gcc/testsuite/gcc.dg/gimplefe-30.c | 16 
 gcc/testsuite/gcc.dg/gimplefe-31.c | 16 
 gcc/testsuite/gcc.dg/gimplefe-32.c | 14 ++
 gcc/testsuite/gcc.dg/gimplefe-33.c | 16 
 5 files changed, 85 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/gimplefe-30.c
 create mode 100644 gcc/testsuite/gcc.dg/gimplefe-31.c
 create mode 100644 gcc/testsuite/gcc.dg/gimplefe-32.c
 create mode 100644 gcc/testsuite/gcc.dg/gimplefe-33.c

diff --git a/gcc/match.pd b/gcc/match.pd
index c1e0963..7959787 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -576,6 +576,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (mult (abs@1 @0) @1)
  (mult @0 @0))
 
+/* Convert absu(x)*absu(x) -> x*x.  */
+(simplify
+ (mult (absu@1 @0) @1)
+ (mult (convert @0) (convert @0)))
+
 /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
 (for coss (COS COSH)
  copysigns (COPYSIGN)
@@ -1013,16 +1018,34 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && tree_nop_conversion_p (type, TREE_TYPE (@2)))
   (bit_xor (convert @1) (convert @2
 
+/* Convert abs (abs (X)) into abs (X).
+   also absu (absu (X)) into absu (X).  */
 (simplify
  (abs (abs@1 @0))
  @1)
+
+(simplify
+ (absu (nop_convert (absu@1 @0)))
+ @1)
+
+/* Convert abs[u] (-X) -> abs[u] (X).  */
 (simplify
  (abs (negate @0))
  (abs @0))
+
+(simplify
+ (absu (negate @0))
+ (absu @0))
+
+/* Convert abs[u] (X)  where X is nonnegative -> (X).  */
 (simplify
  (abs tree_expr_nonnegative_p@0)
  @0)
 
+(simplify
+ (absu tree_expr_nonnegative_p@0)
+ (convert @0))
+
 /* A few cases of fold-const.c negate_expr_p predicate.  */
 (match negate_expr_p
  INTEGER_CST
diff --git a/gcc/testsuite/gcc.dg/gimplefe-30.c b/gcc/testsuite/gcc.dg/gimplefe-30.c
new file mode 100644
index 000..6c25106
--- /dev/null
++

[ABSU_EXPR] Add some of the missing patterns in match,pd

2018-06-27 Thread Kugan Vivekanandarajah

Hi,

This patch adds some of the missing patterns in match.pd for ABSU_EXPR.

Bootstrap and regression testing on x86_64-linux-gnu. Is this OK if no
regressions.

Thanks,
Kugan

gcc/ChangeLog:

2018-06-28  Kugan Vivekanandarajah  

* match.pd (absu(x)*absu(x) -> x*x): Handle.
(absu(absu(X)) -> absu(X)): Likewise.
(absu(-X) -> absu(X)): Likewise.
(absu(X)  where X is nonnegative -> X): Likewise.
From 374ee7928039c16cb091bd02d5efd4c493aab86e Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Mon, 18 Jun 2018 10:51:06 +1000
Subject: [PATCH] add absu patterns

Change-Id: Ied504be83f00041a6c815d23e16a394b71445f27
---
 gcc/match.pd | 37 -
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/gcc/match.pd b/gcc/match.pd
index c1e0963..a356a92 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -571,10 +571,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(copysigns (op @0) @1)
(copysigns @0 @1
 
-/* abs(x)*abs(x) -> x*x.  Should be valid for all types.  */
-(simplify
- (mult (abs@1 @0) @1)
- (mult @0 @0))
+/* abs(x)*abs(x) -> x*x.  Should be valid for all types.
+   also for absu(x)*absu(x) -> x*x.  */
+(for op (abs absu)
+ (simplify
+  (mult (op@1 @0) @1)
+  (mult @0 @0)))
 
 /* cos(copysign(x, y)) -> cos(x).  Similarly for cosh.  */
 (for coss (COS COSH)
@@ -1013,15 +1015,24 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   && tree_nop_conversion_p (type, TREE_TYPE (@2)))
   (bit_xor (convert @1) (convert @2
 
-(simplify
- (abs (abs@1 @0))
- @1)
-(simplify
- (abs (negate @0))
- (abs @0))
-(simplify
- (abs tree_expr_nonnegative_p@0)
- @0)
+/* Convert abs (abs (X)) into abs (X).
+   also absu (absu (X)) into absu (X).  */
+(for op (abs absu)
+ (simplify
+  (op (op@1 @0))
+  @1))
+
+/* Convert abs[u] (-X) -> abs[u] (X).  */
+(for op (abs absu)
+ (simplify
+  (op (negate @0))
+  (op @0)))
+
+/* Convert abs[u] (X)  where X is nonnegative -> (X).  */
+(for op (abs absu)
+ (simplify
+  (op tree_expr_nonnegative_p@0)
+  @0))
 
 /* A few cases of fold-const.c negate_expr_p predicate.  */
 (match negate_expr_p
-- 
2.7.4

Re: [PATCH 3/3][POPCOUNT] Remove unnecessary if condition in phiopt

2018-06-26 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review,

On 25 June 2018 at 20:20, Richard Biener  wrote:
> On Fri, Jun 22, 2018 at 11:16 AM Kugan Vivekanandarajah
>  wrote:
>>
>> gcc/ChangeLog:
>
> @@ -1516,6 +1521,114 @@ minmax_replacement (basic_block cond_bb,
> basic_block middle_bb,
>
>return true;
>  }
> +/* Convert
> +
> +   
> +   if (b_4(D) != 0)
> +   goto 
>
> vertical space before the comment.
Done.

>
> + edge e0 ATTRIBUTE_UNUSED, edge e1
> ATTRIBUTE_UNUSED,
>
> why pass them if they are unused?
Removed.

>
> +  if (stmt_count != 2)
> +return false;
>
> so what about the case when there is no conversion?
Done.

>
> +  /* Check that we have a popcount builtin.  */
> +  if (!is_gimple_call (popcount)
> +  || !gimple_call_builtin_p (popcount, BUILT_IN_NORMAL))
> +return false;
> +  tree fndecl = gimple_call_fndecl (popcount);
> +  if ((DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNT)
> +  && (DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNTL)
> +  && (DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNTLL))
> +return false;
>
> look at popcount handling in tree-vrp.c how to properly also handle
> IFN_POPCOUNT.
> (CASE_CFN_POPCOUNT)
Done.
>
> +  /* Cond_bb has a check for b_4 != 0 before calling the popcount
> + builtin.  */
> +  if (gimple_code (cond) != GIMPLE_COND
> +  || gimple_cond_code (cond) != NE_EXPR
> +  || TREE_CODE (gimple_cond_lhs (cond)) != SSA_NAME
> +  || rhs != gimple_cond_lhs (cond))
> +return false;
>
> The check for SSA_NAME is redundant.
> You fail to check that gimple_cond_rhs is zero.
Done.

>
> +  /* Remove the popcount builtin and cast stmt.  */
> +  gsi = gsi_for_stmt (popcount);
> +  gsi_remove (, true);
> +  gsi = gsi_for_stmt (cast);
> +  gsi_remove (, true);
> +
> +  /* And insert the popcount builtin and cast stmt before the cond_bb.  */
> +  gsi = gsi_last_bb (cond_bb);
> +  gsi_insert_before (, popcount, GSI_NEW_STMT);
> +  gsi_insert_before (, cast, GSI_NEW_STMT);
>
> use gsi_move_before ().  You need to reset flow sensitive info on the
> LHS of the popcount call as well as on the LHS of the cast.
Done.

>
> You fail to check the PHI operand on the false edge.  Consider
>
>  if (b != 0)
>res = __builtin_popcount (b);
>  else
>res = 1;
>
> You fail to check the PHI operand on the true edge.  Consider
>
>  res = 0;
>  if (b != 0)
>{
>   __builtin_popcount (b);
>   res = 2;
>}
>
> and using -fno-tree-dce and whatever you need to keep the
> popcount call in the IL.  A gimple testcase for phiopt will do.
>
> Your testcase relies on popcount detection.  Please write it
> using __builtin_popcount instead.  Write one with a cast and
> one without.
Added the testcases.

Is this OK now.

Thanks,
Kugan
>
> Thanks,
> Richard.
>
>
>> 2018-06-22  Kugan Vivekanandarajah  
>>
>> * tree-ssa-phiopt.c (cond_removal_in_popcount_pattern): New.
>> (tree_ssa_phiopt_worker): Call cond_removal_in_popcount_pattern.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2018-06-22  Kugan Vivekanandarajah  
>>
>> * gcc.dg/tree-ssa/popcount3.c: New test.
From 5b776871d99161653dbb7a7fc353268ab3be6880 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 22 Jun 2018 14:16:21 +1000
Subject: [PATCH 3/3] improve phiopt for builtin popcount

Change-Id: Iab8861cc15590cc2603be9967ca9477a223c90a8
---
 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-16.c |  12 +++
 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-17.c |  12 +++
 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-18.c |  14 
 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-19.c |  15 
 gcc/testsuite/gcc.dg/tree-ssa/popcount3.c  |  15 
 gcc/tree-ssa-phiopt.c  | 127 +
 6 files changed, 195 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-16.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-17.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-18.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/phi-opt-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/popcount3.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-16.c b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-16.c
new file mode 100644
index 000..e7a2711
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-16.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int foo (int a)
+{
+  int c = 0;
+  if (a != 0)
+c = __builtin_popcount (a);
+  return c;
+}
+
+/* { dg-final { scan-tree-dump-not "if" "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/phi-opt-17.c b/gcc/testsuite/gcc.dg/tr

Re: [PATCH 2/3][POPCOUNT] Check if zero check is done before entering the loop

2018-06-26 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.

On 25 June 2018 at 20:02, Richard Biener  wrote:
> On Fri, Jun 22, 2018 at 11:14 AM Kugan Vivekanandarajah
>  wrote:
>>
>> gcc/ChangeLog:
>
> The canonical way is calling simplify_using_initial_conditions on the
> may_be_zero condition.
>
> Richard.
>
>> 2018-06-22  Kugan Vivekanandarajah  
>>
>> * tree-ssa-loop-niter.c (number_of_iterations_popcount): If popcount
>> argument is checked for zero before entering loop, avoid checking again.
Do you like the attached patch which does this.

Thanks,
Kugan
From 78cb0ea3d058f1d1db73f259825b8bb07eb1ca30 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 22 Jun 2018 14:11:28 +1000
Subject: [PATCH 2/3] in niter dont check for zero when it is alrealy checked

Change-Id: Ie94a35a1a3c2d8bdffd3dc54a94684c032efc7e0
---
 gcc/tree-ssa-loop-niter.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index f5ffc0f..be0cff5 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -2596,10 +2596,15 @@ number_of_iterations_popcount (loop_p loop, edge exit,
 
   niter->niter = iter;
   niter->assumptions = boolean_true_node;
+
   if (adjust)
-niter->may_be_zero = fold_build2 (EQ_EXPR, boolean_type_node, src,
+{
+  tree may_be_zero = fold_build2 (EQ_EXPR, boolean_type_node, src,
   build_zero_cst
   (TREE_TYPE (src)));
+  niter->may_be_zero =
+	simplify_using_initial_conditions (loop, may_be_zero);
+}
   else
 niter->may_be_zero = boolean_false_node;
 
-- 
2.7.4

Re: [PATCH 1/3][POPCOUNT] Handle COND_EXPR in expression_expensive_p

2018-06-26 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.

On 25 June 2018 at 20:01, Richard Biener  wrote:
> On Fri, Jun 22, 2018 at 11:13 AM Kugan Vivekanandarajah
>  wrote:
>>
>> [PATCH 1/3][POPCOUNT] Handle COND_EXPR in expression_expensive_p
>
> This says that COND_EXPR itself isn't expensive.  I think we should
> constrain that a bit.
> I think a good default would be to only allow a single COND_EXPR which
> you can achieve
> by adding a bool in_cond_expr_p = false argument to the function, pass
> in_cond_expr_p
> down and pass true down from the COND_EXPR handling itself.
>
> I'm not sure if we should require either COND_EXPR arm (operand 1 or
> 2) to be constant
> or !EXPR_P (then multiple COND_EXPRs might be OK).
>
> The main idea is to avoid evaluating many expressions but only
> choosing one in the end.
>
> The simplest patch achieving that is sth like
>
> +  if (code == COND_EXPR)
> +return (expression_expensive_p (TREE_OPERAND (expr, 0))
>   || (EXPR_P (TREE_OPERAND (expr, 1)) && EXPR_P
> (TREE_OPERAND (expr, 2)))
> +   || expression_expensive_p (TREE_OPERAND (expr, 1))
> +   || expression_expensive_p (TREE_OPERAND (expr, 2)));
>
> OK with that change.

Is || (EXPR_P (TREE_OPERAND (expr, 1)) || EXPR_P (TREE_OPERAND (expr,
2))) slightly better ?
Attaching  with the change. Is this OK?


Because, for pr81661.c, we now allow as not expensive

unit-size 
align:32 warn_if_not_align:0 symtab:0 alias-set 1
canonical-type 0x769455e8 precision:32 min  max 
pointer_to_this >

arg:0 

arg:0 
visited
def_stmt a.1_10 = a;
version:10>
arg:1 >
arg:1 

arg:0 

arg:0 

arg:0 
arg:0  arg:1 >
arg:1 
visited
def_stmt c.2_11 = c;
version:11>>
arg:1 
visited
def_stmt b.3_13 = b;
version:13>>
arg:1 

arg:0 

arg:0 

arg:0 

arg:0 
arg:0  arg:1
>>
arg:1 
arg:0 >>>>
arg:2 >>

Which also leads to an ICE in gimplify_modify_expr. I think this is a
latent issue and I am trying to find the source

the expr in gimple_modify_expr is

unit-size 
align:32 warn_if_not_align:0 symtab:0 alias-set 1
canonical-type 0x769455e8 precision:32 min  max 
pointer_to_this >
side-effects
arg:0 
used ignored SI (null):0:0 size  unit-size 
align:32 warn_if_not_align:0 context >
arg:1 

arg:0 

arg:0 

arg:0 

arg:0 

arg:0 
arg:0  arg:1
>
arg:1 
visited
def_stmt c.2_11 = c;
version:11>>>
arg:1 

arg:0 
    visited
def_stmt b.3_13 = b;
version:13>>>>>>

and the *to_p is not SSA_NAME and VAR_DECL.

Thanks,
Kugan



>
> Richard.
>
>> gcc/ChangeLog:
>>
>> 2018-06-22  Kugan Vivekanandarajah  
>>
>> * tree-scalar-evolution.c (expression_expensive_p): Handle COND_EXPR.
From 8f59f05ad21ac218834547593a7f308b4f837b1e Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 22 Jun 2018 14:10:26 +1000
Subject: [PATCH 1/3] generate popcount when checked for zero

Change-Id: Iff93a1bd58d12e5e6951bc15ebf5db2ec2c85c2e
---
 gcc/tree-scalar-evolution.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 4b0ec02..d992582 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -3508,6 +3508,13 @@ expression_expensive_p (tree expr)
   return false;
 }
 
+  if (code == COND_EXPR)
+return (expression_expensive_p (TREE_OPERAND (expr, 0))
+	|| (EXPR_P (TREE_OPERAND (expr, 1))
+		|| EXPR_P (TREE_OPERAND (expr, 2)))
+	|| expression_expensive_p (TREE_OPERAND (expr, 1))
+	|| expression_expensive_p (TREE_OPERAND (expr, 2)));
+
   switch (TREE_CODE_CLASS (code))
 {
 case tcc_binary:
-- 
2.7.4

Re: [PATCH 0/3][POPCOUNT]

2018-06-24 Thread Kugan Vivekanandarajah

Hi Bin,

On 25 June 2018 at 13:56, Bin.Cheng  wrote:
> On Mon, Jun 25, 2018 at 11:37 AM, Kugan Vivekanandarajah
>  wrote:
>> Hi Bin,
>>
>> Thanks for your comments.
>>
>> On 25 June 2018 at 11:15, Bin.Cheng  wrote:
>>> On Fri, Jun 22, 2018 at 5:11 PM, Kugan Vivekanandarajah
>>>  wrote:
>>>> When we set niter with maybe_zero, currently final_value_relacement
>>>> will not happen due to expression_expensive_p not handling. Patch 1
>>>> adds this.
>>>>
>>>> With that we have the following optimized gimple.
>>>>
>>>>[local count: 118111601]:
>>>>   if (b_4(D) != 0)
>>>> goto ; [89.00%]
>>>>   else
>>>> goto ; [11.00%]
>>>>
>>>>[local count: 105119324]:
>>>>   _2 = (unsigned long) b_4(D);
>>>>   _9 = __builtin_popcountl (_2);
>>>>   c_3 = b_4(D) != 0 ? _9 : 1;
>>>>
>>>>[local count: 118111601]:
>>>>   # c_12 = PHI 
>>>>
>>>> I assume that 1 in  b_4(D) != 0 ? _9 : 1; is OK (?) because when the
>>> No, it doesn't make much sense.  when b_4(D) == 0, the popcount
>>> computed should be 0.  Point is you can never get b_4(D) == 0 with
>>> guard condition in basic block 2.  So the result should simply be:
>>
>> When we do  calculate niter (for the copy header case), we set
>> may_be_zero as (which I think is correct)
>> niter->may_be_zero = fold_build2 (EQ_EXPR, boolean_type_node, src,
>>   build_zero_cst
>>   (TREE_TYPE (src)));
>>
>> Then in final_value_replacement_loop (struct loop *loop)
>>
>> for the PHI stmt for which we are going to do the final value replacement,
>> we analyze_scalar_evolution_in_loop which is POLYNOMIAL_CHREC.
>>
>> then we do
>> compute_overall_effect_of_inner_loop (struct loop *loop, tree evolution_fn)
>>
>> where when we do chrec_apply to the polynomial_chrec with niter from
>> popcount which also has the may_be_zero, we end up with the 1.
>> Looking at this, I am not sure if this is wrong. May be I am missing 
>> something.
> I think it is wrong.  How could you get popcount == 1 when b_4(D) ==
> 0?  Though it never happens in this case.

We dont set popcount = 1. When we set niter for popcount pattern with
niter->may_be_zero = fold_build2 (EQ_EXPR, boolean_type_node, src,
  build_zero_cst (TREE_TYPE (src)));

Because of which, we have an niter in the final_value_replacement, we have
(gdb) p debug_tree (niter)
 
unit-size 
align:64 warn_if_not_align:0 symtab:0 alias-set -1
canonical-type 0x7694d1f8 precision:64 min  max >

arg:0 
unit-size 
align:8 warn_if_not_align:0 symtab:0 alias-set -1
canonical-type 0x76945b28 precision:1 min  max >

arg:0 
visited var 
def_stmt GIMPLE_NOPvolu
version:4>
arg:1 >
arg:1 

arg:0 

arg:0 

fn 
readonly constant arg:0 >
arg:0 
arg:0 >>>
arg:1 >
arg:2  constant 0>>

Then from there then we do compute_overall_effect_of_inner_loop for
scalar evolution of PHI with niter we get the 1.

>>
>> In this testcase, before we enter the loop we have a check for (b_4(D)
>>> 0). Thus, setting niter->may_be_zero is not strictly necessary but
>> conservatively correct (?).
> Yes, but not necessarily.  Setting maybe_zero could confuse following
> optimizations and we should avoid doing that whenever possible.  If
> any pass goes wrong because it's not set conservatively, it is that
> pass' responsibility and should be fixed accordingly.  Here IMHO, we
> don't need to set it.

My patch 2 is for not setting this when we know know a_4(D) is not
zero in this path.

Thanks,
Kugan




>
> Thanks,
> bin
>>
>> Thanks,
>> Kugan
>>
>>>
>>>>[local count: 118111601]:
>>>>   if (b_4(D) != 0)
>>>> goto ; [89.00%]
>>>>   else
>>>> goto ; [11.00%]
>>>>
>>>>[local count: 105119324]:
>>>>   _2 = (unsigned long) b_4(D);
>>>>   c_3 = __builtin_popcountl (_2);
>>>>
>>>>[local count: 118111601]:
>>>>   # c_12 = PHI 
>>>
>>> I think this is the code generated if maybe_zero is not set?  which it
>>> should not be set here.
>>> For the same reason, it can be further optimized into:
>>>
>>>>[local count: 118111601]:
>

Re: [PATCH 0/3][POPCOUNT]

2018-06-24 Thread Kugan Vivekanandarajah

Hi Bin,

Thanks for your comments.

On 25 June 2018 at 11:15, Bin.Cheng  wrote:
> On Fri, Jun 22, 2018 at 5:11 PM, Kugan Vivekanandarajah
>  wrote:
>> When we set niter with maybe_zero, currently final_value_relacement
>> will not happen due to expression_expensive_p not handling. Patch 1
>> adds this.
>>
>> With that we have the following optimized gimple.
>>
>>[local count: 118111601]:
>>   if (b_4(D) != 0)
>> goto ; [89.00%]
>>   else
>> goto ; [11.00%]
>>
>>[local count: 105119324]:
>>   _2 = (unsigned long) b_4(D);
>>   _9 = __builtin_popcountl (_2);
>>   c_3 = b_4(D) != 0 ? _9 : 1;
>>
>>[local count: 118111601]:
>>   # c_12 = PHI 
>>
>> I assume that 1 in  b_4(D) != 0 ? _9 : 1; is OK (?) because when the
> No, it doesn't make much sense.  when b_4(D) == 0, the popcount
> computed should be 0.  Point is you can never get b_4(D) == 0 with
> guard condition in basic block 2.  So the result should simply be:

When we do  calculate niter (for the copy header case), we set
may_be_zero as (which I think is correct)
niter->may_be_zero = fold_build2 (EQ_EXPR, boolean_type_node, src,
  build_zero_cst
  (TREE_TYPE (src)));

Then in final_value_replacement_loop (struct loop *loop)

for the PHI stmt for which we are going to do the final value replacement,
we analyze_scalar_evolution_in_loop which is POLYNOMIAL_CHREC.

then we do
compute_overall_effect_of_inner_loop (struct loop *loop, tree evolution_fn)

where when we do chrec_apply to the polynomial_chrec with niter from
popcount which also has the may_be_zero, we end up with the 1.
Looking at this, I am not sure if this is wrong. May be I am missing something.

In this testcase, before we enter the loop we have a check for (b_4(D)
> 0). Thus, setting niter->may_be_zero is not strictly necessary but
conservatively correct (?).

Thanks,
Kugan

>
>>[local count: 118111601]:
>>   if (b_4(D) != 0)
>> goto ; [89.00%]
>>   else
>> goto ; [11.00%]
>>
>>[local count: 105119324]:
>>   _2 = (unsigned long) b_4(D);
>>   c_3 = __builtin_popcountl (_2);
>>
>>[local count: 118111601]:
>>   # c_12 = PHI 
>
> I think this is the code generated if maybe_zero is not set?  which it
> should not be set here.
> For the same reason, it can be further optimized into:
>
>>[local count: 118111601]:
>>   _2 = (unsigned long) b_4(D);
>>   c_12 = __builtin_popcountl (_2);
>>
>
>> latch execute zero times for b_4 == 0 means that the body will execute
>> ones.
> You never get zero times latch here with the aforementioned guard condition.
>
> BTW, I didn't look at following patches which could be wanted optimizations.
>
> Thanks,
> bin
>>
>> The issue here is, since we are checking if (b_4(D) != 0) before
>> entering the loop means we don't need to set maybe_zero. Patch 2
>> handles this.
>>
>> With that we have
>>[local count: 118111601]:
>>   if (b_4(D) != 0)
>> goto ; [89.00%]
>>   else
>>     goto ; [11.00%]
>>
>>[local count: 105119324]:
>>   _2 = (unsigned long) b_4(D);
>>   _9 = __builtin_popcountl (_2);
>>
>>[local count: 118111601]:
>>   # c_12 = PHI <0(2), _9(3)>
>>
>> As advised earlier, patch 3 adds phiopt support to remove this.
>>
>> Bootstrap and regression testing are ongoing.
>>
>> Is this OK for trunk.
>>
>> Thanks,
>> Kugan

Re: [PATCH 0/3][POPCOUNT]

2018-06-24 Thread Kugan Vivekanandarajah

Hi Jeff,

Thanks for the comments.

On 23 June 2018 at 02:06, Jeff Law  wrote:
> On 06/22/2018 03:11 AM, Kugan Vivekanandarajah wrote:
>> When we set niter with maybe_zero, currently final_value_relacement
>> will not happen due to expression_expensive_p not handling. Patch 1
>> adds this.
>>
>> With that we have the following optimized gimple.
>>
>>[local count: 118111601]:
>>   if (b_4(D) != 0)
>> goto ; [89.00%]
>>   else
>> goto ; [11.00%]
>>
>>[local count: 105119324]:
>>   _2 = (unsigned long) b_4(D);
>>   _9 = __builtin_popcountl (_2);
>>   c_3 = b_4(D) != 0 ? _9 : 1;
>>
>>[local count: 118111601]:
>>   # c_12 = PHI 
>>
>> I assume that 1 in  b_4(D) != 0 ? _9 : 1; is OK (?) because when the
>> latch execute zero times for b_4 == 0 means that the body will execute
>> ones.
> ISTM that DOM ought to have simplified the conditional, unless there's
> some other way to get to bb3.  We know that b_4 is nonzero and thus c_3
> must have the value _9.
As of now, dom is not optimizing it. With the attached hack, it can be made to.

>
>
>>
>> The issue here is, since we are checking if (b_4(D) != 0) before
>> entering the loop means we don't need to set maybe_zero. Patch 2
>> handles this.
>>
>> With that we have
>>[local count: 118111601]:
>>   if (b_4(D) != 0)
>> goto ; [89.00%]
>>   else
>> goto ; [11.00%]
>>
>>[local count: 105119324]:
>>   _2 = (unsigned long) b_4(D);
>>   _9 = __builtin_popcountl (_2);
>>
>>[local count: 118111601]:
>>   # c_12 = PHI <0(2), _9(3)>
>>
>> As advised earlier, patch 3 adds phiopt support to remove this.
> So if DOM or some other pass fixed up the assignment to c_3 I'd hope we
> wouldn't set maybe_zero.
>
> So I'd like to start by first determining if we should have already
> simplified the COND_EXPR in bb3.  Do you have a testcase for that?
Sorry, It is hidden in patch 3 (attaching now). You will need patch 1 as well.

Thanks,
Kugan

>
>
> jeff
diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c
index a6f176c..77ae7d1b 100644
--- a/gcc/tree-ssa-dom.c
+++ b/gcc/tree-ssa-dom.c
@@ -1991,6 +1991,25 @@ dom_opt_dom_walker::optimize_stmt (basic_block bb, 
gimple_stmt_iterator si)
}
}
 
+  if (gimple_code (stmt) == GIMPLE_ASSIGN
+ && gimple_assign_rhs_code (stmt) == COND_EXPR)
+   {
+ /* If this is a conditional stmt, see if we can optimize the
+condition.  */
+ x_vr_values = evrp_range_analyzer.get_vr_values ();
+ tree exp = gimple_assign_rhs1 (stmt);
+ tree lhs = TREE_OPERAND (exp, 0);
+ tree rhs = TREE_OPERAND (exp, 1);
+ tree val = x_vr_values->vrp_evaluate_conditional (TREE_CODE (exp),
+   lhs, rhs, stmt);
+ if (is_gimple_min_invariant (val))
+   {
+ gimple_assign_set_rhs1 (stmt, val);
+ update_stmt (stmt);
+   }
+ x_vr_values = NULL;
+   }
+
   if (gimple_code (stmt) == GIMPLE_COND)
{
  tree lhs = gimple_cond_lhs (stmt);
int PopCount (long b) {
int c = 0;

while (b) {
	b &= b - 1;
	c++;
}
return c;
}

[PATCH 3/3][POPCOUNT] Remove unnecessary if condition in phiopt

2018-06-22 Thread Kugan Vivekanandarajah

gcc/ChangeLog:

2018-06-22  Kugan Vivekanandarajah  

* tree-ssa-phiopt.c (cond_removal_in_popcount_pattern): New.
(tree_ssa_phiopt_worker): Call cond_removal_in_popcount_pattern.

gcc/testsuite/ChangeLog:

2018-06-22  Kugan Vivekanandarajah  

* gcc.dg/tree-ssa/popcount3.c: New test.
From fa2cca6b186b70668a3334c23ea4b906dac454d4 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 22 Jun 2018 14:16:21 +1000
Subject: [PATCH 3/3] improve phiopt for builtin popcount

Change-Id: Id1a5997c78fc3ceded3ed7fb0c544ce2bd1a2b34
---
 gcc/testsuite/gcc.dg/tree-ssa/popcount3.c |  15 
 gcc/tree-ssa-phiopt.c | 113 ++
 2 files changed, 128 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/popcount3.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c b/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c
new file mode 100644
index 000..293beb9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/popcount3.c
@@ -0,0 +1,15 @@
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-phiopt3 -fdump-tree-optimized" } */
+
+int PopCount (long b) {
+int c = 0;
+
+while (b) {
+	b &= b - 1;
+	c++;
+}
+return c;
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_popcount" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "if" 0 "phiopt3" } } */
diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c
index 8e94f6a..1db5226 100644
--- a/gcc/tree-ssa-phiopt.c
+++ b/gcc/tree-ssa-phiopt.c
@@ -57,6 +57,8 @@ static bool minmax_replacement (basic_block, basic_block,
 edge, edge, gimple *, tree, tree);
 static bool abs_replacement (basic_block, basic_block,
 			 edge, edge, gimple *, tree, tree);
+static bool cond_removal_in_popcount_pattern (basic_block, basic_block,
+	  edge, edge, gimple *, tree, tree);
 static bool cond_store_replacement (basic_block, basic_block, edge, edge,
 hash_set *);
 static bool cond_if_else_store_replacement (basic_block, basic_block, basic_block);
@@ -332,6 +334,9 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads)
 	cfgchanged = true;
 	  else if (abs_replacement (bb, bb1, e1, e2, phi, arg0, arg1))
 	cfgchanged = true;
+	  else if (cond_removal_in_popcount_pattern (bb, bb1, e1, e2,
+		 phi, arg0, arg1))
+	cfgchanged = true;
 	  else if (minmax_replacement (bb, bb1, e1, e2, phi, arg0, arg1))
 	cfgchanged = true;
 	}
@@ -1516,6 +1521,114 @@ minmax_replacement (basic_block cond_bb, basic_block middle_bb,
 
   return true;
 }
+/* Convert
+
+   
+   if (b_4(D) != 0)
+   goto 
+   else
+   goto 
+
+   
+   _2 = (unsigned long) b_4(D);
+   _9 = __builtin_popcountl (_2);
+
+   
+   c_12 = PHI <0(2), _9(3)>
+
+   Into
+   
+   _2 = (unsigned long) b_4(D);
+   _9 = __builtin_popcountl (_2);
+
+   
+   c_12 = PHI <_9(2)>
+*/
+
+static bool
+cond_removal_in_popcount_pattern (basic_block cond_bb, basic_block middle_bb,
+  edge e0 ATTRIBUTE_UNUSED, edge e1 ATTRIBUTE_UNUSED,
+  gimple *phi, tree arg0, tree arg1)
+{
+  gimple *cond;
+  gimple_stmt_iterator gsi;
+  gimple *popcount;
+  gimple *cast;
+  tree rhs, lhs, arg;
+  unsigned stmt_count = 0;
+
+  /* Check that
+   _2 = (unsigned long) b_4(D);
+   _9 = __builtin_popcountl (_2);
+   are the only stmts in the middle_bb.  */
+
+  for (gsi = gsi_start_bb (middle_bb); !gsi_end_p (gsi); gsi_next ())
+{
+  gimple *stmt = gsi_stmt (gsi);
+  if (is_gimple_debug (stmt))
+	continue;
+  stmt_count++;
+}
+  if (stmt_count != 2)
+return false;
+
+  cast = first_stmt (middle_bb);
+  popcount = last_stmt (middle_bb);
+  if (popcount == NULL || cast == NULL)
+return false;
+
+  /* Check that we have a popcount builtin.  */
+  if (!is_gimple_call (popcount)
+  || !gimple_call_builtin_p (popcount, BUILT_IN_NORMAL))
+return false;
+  tree fndecl = gimple_call_fndecl (popcount);
+  if ((DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNT)
+  && (DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNTL)
+  && (DECL_FUNCTION_CODE (fndecl) != BUILT_IN_POPCOUNTLL))
+return false;
+
+  /* Check that we have a cast prior to that.  */
+  if (gimple_code (cast) != GIMPLE_ASSIGN
+  || gimple_assign_rhs_code (cast) != NOP_EXPR)
+return false;
+
+  rhs = gimple_assign_rhs1 (cast);
+  lhs = gimple_get_lhs (popcount);
+  arg = gimple_call_arg (popcount, 0);
+
+  /* Result of the cast stmt is the argument to the builtin.  */
+  if (arg != gimple_assign_lhs (cast))
+return false;
+
+  if (lhs != arg0
+  && lhs != arg1)
+return false;
+
+  cond = last_stmt (cond_bb);
+
+  /* Cond_bb has a check for b_4 != 0 before calling the popcount
+ builtin.  */
+  if (gimple_code (cond) != GIMPLE_COND
+  || gimple_cond_code (cond) != NE_EXPR
+  || TREE_CODE (gimple_cond_lhs (cond)) != SSA_NAME
+  || rhs != gimple_cond_lhs (cond))
+return false;
+
+  /* Remove the p

[PATCH 2/3][POPCOUNT] Check if zero check is done before entering the loop

2018-06-22 Thread Kugan Vivekanandarajah

gcc/ChangeLog:

2018-06-22  Kugan Vivekanandarajah  

* tree-ssa-loop-niter.c (number_of_iterations_popcount): If popcount
argument is checked for zero before entering loop, avoid checking again.
From 4f2a6ad5a49eec0a1cae15e033329f889f9137b9 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 22 Jun 2018 14:11:28 +1000
Subject: [PATCH 2/3] in niter dont check for zero when it is alrealy checked

Change-Id: I98982537bca14cb99a85d0da70d33a6c044385fd
---
 gcc/tree-ssa-loop-niter.c | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-ssa-loop-niter.c b/gcc/tree-ssa-loop-niter.c
index 9365915..2299aca 100644
--- a/gcc/tree-ssa-loop-niter.c
+++ b/gcc/tree-ssa-loop-niter.c
@@ -2503,6 +2503,7 @@ number_of_iterations_popcount (loop_p loop, edge exit,
   HOST_WIDE_INT max;
   adjust = true;
   tree fn = NULL_TREE;
+  bool check_zero = true;
 
   /* Check loop terminating branch is like
  if (b != 0).  */
@@ -2590,7 +2591,37 @@ number_of_iterations_popcount (loop_p loop, edge exit,
 
   niter->niter = iter;
   niter->assumptions = boolean_true_node;
-  if (adjust)
+  if (adjust
+  && EDGE_COUNT (loop->header->preds) == 2)
+{
+  /* Sometimes, src of the popcount is checked for
+	 zero before entering the loop.  In this case we
+	 dont need to check for zero again.  */
+  edge pred_edge = EDGE_PRED (loop->header, 0);
+  gimple *stmt = last_stmt (pred_edge->src);
+
+  /* If there is an empty pre-header, go one block
+	 above.  */
+  if (!stmt
+	  && EDGE_COUNT (pred_edge->src->preds) == 1)
+	{
+	  pred_edge = EDGE_PRED (pred_edge->src, 0);
+	  stmt = last_stmt (pred_edge->src);
+	}
+
+  /* If we have the src != 0 check and if we are entering
+	 the loop when the condition is true, we can skip zero
+	 check.  */
+  if (stmt
+	  && gimple_code (stmt) == GIMPLE_COND
+	  && gimple_cond_code (stmt) == NE_EXPR
+	  && pred_edge->flags & EDGE_TRUE_VALUE
+	  && TREE_CODE (gimple_cond_lhs (stmt)) == SSA_NAME
+	  && (gimple_phi_arg_def (phi, loop_preheader_edge (loop)->dest_idx)
+	  == gimple_cond_lhs (stmt)))
+	check_zero = false;
+}
+  if (adjust && check_zero)
 niter->may_be_zero = fold_build2 (EQ_EXPR, boolean_type_node, src,
   build_zero_cst
   (TREE_TYPE (src)));
-- 
2.7.4

[PATCH 1/3][POPCOUNT] Handle COND_EXPR in expression_expensive_p

2018-06-22 Thread Kugan Vivekanandarajah

[PATCH 1/3][POPCOUNT] Handle COND_EXPR in expression_expensive_p

gcc/ChangeLog:

2018-06-22  Kugan Vivekanandarajah  

* tree-scalar-evolution.c (expression_expensive_p): Handle COND_EXPR.
From aa38b98dd97567c6032c261f19b3705abc2233b0 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Fri, 22 Jun 2018 14:10:26 +1000
Subject: [PATCH 1/3] generate popcount when checked for zero

Change-Id: I7255bf35e28222f7418852cb232246edf1fb5a39
---
 gcc/tree-scalar-evolution.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/gcc/tree-scalar-evolution.c b/gcc/tree-scalar-evolution.c
index 4b0ec02..db419a4 100644
--- a/gcc/tree-scalar-evolution.c
+++ b/gcc/tree-scalar-evolution.c
@@ -3508,6 +3508,11 @@ expression_expensive_p (tree expr)
   return false;
 }
 
+  if (code == COND_EXPR)
+return (expression_expensive_p (TREE_OPERAND (expr, 0))
+	|| expression_expensive_p (TREE_OPERAND (expr, 1))
+	|| expression_expensive_p (TREE_OPERAND (expr, 2)));
+
   switch (TREE_CODE_CLASS (code))
 {
 case tcc_binary:
-- 
2.7.4

[PATCH 0/3][POPCOUNT]

2018-06-22 Thread Kugan Vivekanandarajah

When we set niter with maybe_zero, currently final_value_relacement
will not happen due to expression_expensive_p not handling. Patch 1
adds this.

With that we have the following optimized gimple.

   [local count: 118111601]:
  if (b_4(D) != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  _2 = (unsigned long) b_4(D);
  _9 = __builtin_popcountl (_2);
  c_3 = b_4(D) != 0 ? _9 : 1;

   [local count: 118111601]:
  # c_12 = PHI 

I assume that 1 in  b_4(D) != 0 ? _9 : 1; is OK (?) because when the
latch execute zero times for b_4 == 0 means that the body will execute
ones.

The issue here is, since we are checking if (b_4(D) != 0) before
entering the loop means we don't need to set maybe_zero. Patch 2
handles this.

With that we have
   [local count: 118111601]:
  if (b_4(D) != 0)
goto ; [89.00%]
  else
goto ; [11.00%]

   [local count: 105119324]:
  _2 = (unsigned long) b_4(D);
  _9 = __builtin_popcountl (_2);

   [local count: 118111601]:
  # c_12 = PHI <0(2), _9(3)>

As advised earlier, patch 3 adds phiopt support to remove this.

Bootstrap and regression testing are ongoing.

Is this OK for trunk.

Thanks,
Kugan

Re: [RFC][PR64946] "abs" vectorization fails for char/short types

2018-06-11 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review and sorry for getting back to you late.

On 4 June 2018 at 18:38, Richard Biener  wrote:
> On Mon, Jun 4, 2018 at 10:18 AM Kugan Vivekanandarajah
>  wrote:
>>
>> Hi Richard,
>>
>> Thanks for the review.
>>
>> On 1 June 2018 at 22:20, Richard Biener  wrote:
>> > On Fri, Jun 1, 2018 at 4:12 AM Kugan Vivekanandarajah
>> >  wrote:
>> >>
>> >> Hi Richard,
>> >>
>> >> This is the revised patch based on the review and the discussion in
>> >> https://gcc.gnu.org/ml/gcc/2018-05/msg00179.html.
>> >>
>> >> In summary:
>> >> - I skipped  (element_precision (type) < element_precision (TREE_TYPE
>> >> (@0))) in the match.pd pattern as this would prevent transformation
>> >> for the case in PR.
>> >> that is, I am interested in is something like:
>> >>   char t = (char) ABS_EXPR <(int) x>
>> >> and I want to generate
>> >> char t = (char) ABSU_EXPR 
>> >>
>> >> - I also haven't added all the necessary match.pd changes for
>> >> ABSU_EXPR. I have a patch for that but will submit separately based on
>> >> this reveiw.
>> >>
>> >> - I also tried to add ABSU_EXPRsupport  in the places as necessary by
>> >> grepping for ABS_EXPR.
>> >>
>> >> - I also had to add special casing in vectorizer for ABSU_EXP as its
>> >> result is unsigned type.
>> >>
>> >> Is this OK. Patch bootstraps and the regression test is ongoing.
>> >
>> > The c/c-typeck.c:build_unary_op change looks unnecessary - the
>> > C FE should never generate this directly (the c-common one might
>> > be triggered by early folding I guess).
>>
>> The Gimple FE testcase is running into this.
>
> Ah, OK then.
>
>> >
>> > @@ -1761,6 +1762,9 @@ const_unop (enum tree_code code, tree type, tree 
>> > arg0)
>> >if (TREE_CODE (arg0) == INTEGER_CST || TREE_CODE (arg0) == REAL_CST)
>> > return fold_abs_const (arg0, type);
>> >break;
>> > +case ABSU_EXPR:
>> > +   return fold_convert (type, fold_abs_const (arg0,
>> > +  signed_type_for 
>> > (type)));
>> >
>> >  case CONJ_EXPR:
>> >
>> > I think this will get you bogus TREE_OVERFLOW flags set on ABSU (-INT_MIN).
>> >
>> > I think you want to change fold_abs_const to properly deal with arg0 being
>> > signed and type unsigned.  That is, sth like
>> >
>> > diff --git a/gcc/fold-const.c b/gcc/fold-const.c
>> > index 6f80f1b1d69..f60f9c77e91 100644
>> > --- a/gcc/fold-const.c
>> > +++ b/gcc/fold-const.c
>> > @@ -13843,18 +13843,19 @@ fold_abs_const (tree arg0, tree type)
>> >{
>> >  /* If the value is unsigned or non-negative, then the absolute 
>> > value
>> >is the same as the ordinary value.  */
>> > -   if (!wi::neg_p (wi::to_wide (arg0), TYPE_SIGN (type)))
>> > - t = arg0;
>> > +   wide_int val = wi::to_wide (arg0);
>> > +   bool overflow = false;
>> > +   if (!wi::neg_p (val, TYPE_SIGN (TREE_TYPE (arg0
>> > + ;
>> >
>> > /* If the value is negative, then the absolute value is
>> >its negation.  */
>> > else
>> > - {
>> > -   bool overflow;
>> > -   wide_int val = wi::neg (wi::to_wide (arg0), );
>> > -   t = force_fit_type (type, val, -1,
>> > -   overflow | TREE_OVERFLOW (arg0));
>> > - }
>> > + wide_int val = wi::neg (val, );
>> > +
>> > +   /* Force to the destination type, set TREE_OVERFLOW for signed
>> > +  TYPE only.  */
>> > +   t = force_fit_type (type, val, 1, overflow | TREE_OVERFLOW (arg0));
>> >}
>> >break;
>> >
>> > and then simply share the const_unop code with ABS_EXPR.
>>
>> Done.
>>
>> > diff --git a/gcc/match.pd b/gcc/match.pd
>> > index 14386da..7d7c132 100644
>> > --- a/gcc/match.pd
>> > +++ b/gcc/match.pd
>> > @@ -102,6 +102,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>> >  (match (nop_convert @0)
>> >   @0)
>> >
>> > +(simplify (abs (convert @0))
>> > + (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0

Re: [RFC][PR64946] "abs" vectorization fails for char/short types

2018-06-04 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.

On 1 June 2018 at 22:20, Richard Biener  wrote:
> On Fri, Jun 1, 2018 at 4:12 AM Kugan Vivekanandarajah
>  wrote:
>>
>> Hi Richard,
>>
>> This is the revised patch based on the review and the discussion in
>> https://gcc.gnu.org/ml/gcc/2018-05/msg00179.html.
>>
>> In summary:
>> - I skipped  (element_precision (type) < element_precision (TREE_TYPE
>> (@0))) in the match.pd pattern as this would prevent transformation
>> for the case in PR.
>> that is, I am interested in is something like:
>>   char t = (char) ABS_EXPR <(int) x>
>> and I want to generate
>> char t = (char) ABSU_EXPR 
>>
>> - I also haven't added all the necessary match.pd changes for
>> ABSU_EXPR. I have a patch for that but will submit separately based on
>> this reveiw.
>>
>> - I also tried to add ABSU_EXPRsupport  in the places as necessary by
>> grepping for ABS_EXPR.
>>
>> - I also had to add special casing in vectorizer for ABSU_EXP as its
>> result is unsigned type.
>>
>> Is this OK. Patch bootstraps and the regression test is ongoing.
>
> The c/c-typeck.c:build_unary_op change looks unnecessary - the
> C FE should never generate this directly (the c-common one might
> be triggered by early folding I guess).

The Gimple FE testcase is running into this.

>
> @@ -1761,6 +1762,9 @@ const_unop (enum tree_code code, tree type, tree arg0)
>if (TREE_CODE (arg0) == INTEGER_CST || TREE_CODE (arg0) == REAL_CST)
> return fold_abs_const (arg0, type);
>break;
> +case ABSU_EXPR:
> +   return fold_convert (type, fold_abs_const (arg0,
> +  signed_type_for (type)));
>
>  case CONJ_EXPR:
>
> I think this will get you bogus TREE_OVERFLOW flags set on ABSU (-INT_MIN).
>
> I think you want to change fold_abs_const to properly deal with arg0 being
> signed and type unsigned.  That is, sth like
>
> diff --git a/gcc/fold-const.c b/gcc/fold-const.c
> index 6f80f1b1d69..f60f9c77e91 100644
> --- a/gcc/fold-const.c
> +++ b/gcc/fold-const.c
> @@ -13843,18 +13843,19 @@ fold_abs_const (tree arg0, tree type)
>{
>  /* If the value is unsigned or non-negative, then the absolute value
>is the same as the ordinary value.  */
> -   if (!wi::neg_p (wi::to_wide (arg0), TYPE_SIGN (type)))
> - t = arg0;
> +   wide_int val = wi::to_wide (arg0);
> +   bool overflow = false;
> +   if (!wi::neg_p (val, TYPE_SIGN (TREE_TYPE (arg0
> + ;
>
> /* If the value is negative, then the absolute value is
>its negation.  */
> else
> - {
> -   bool overflow;
> -   wide_int val = wi::neg (wi::to_wide (arg0), );
> -   t = force_fit_type (type, val, -1,
> -   overflow | TREE_OVERFLOW (arg0));
> - }
> + wide_int val = wi::neg (val, );
> +
> +   /* Force to the destination type, set TREE_OVERFLOW for signed
> +  TYPE only.  */
> +   t = force_fit_type (type, val, 1, overflow | TREE_OVERFLOW (arg0));
>}
>break;
>
> and then simply share the const_unop code with ABS_EXPR.

Done.

> diff --git a/gcc/match.pd b/gcc/match.pd
> index 14386da..7d7c132 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -102,6 +102,14 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (match (nop_convert @0)
>   @0)
>
> +(simplify (abs (convert @0))
> + (if (ANY_INTEGRAL_TYPE_P (TREE_TYPE (@0))
> +  && !TYPE_UNSIGNED (TREE_TYPE (@0))
> +  && !TYPE_UNSIGNED (type))
> +  (with { tree utype = unsigned_type_for (TREE_TYPE (@0)); }
> +   (convert (absu:utype @0)
> +
> +
>
> please put a comment before the pattern.  I believe there's no
> need to check for !TYPE_UNSIGNED (type).  Note this
> also converts abs ((char)int-var) to (char)absu(int-var) which
> doesn't make sense.  The original issue you want to address
> here is the case where TYPE_PRECISION of @0 is less than
> the precision of type.  That is, you want to remove language
> introduced integer promotion of @0 which only is possible
> with ABSU.  So please do add such precision check
> (I simply suggested the bogus direction of the test).

Done.
>
> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
> index 68f4fd3..9b62583 100644
> --- a/gcc/tree-cfg.c
> +++ b/gcc/tree-cfg.c
> @@ -3685,6 +3685,12 @@ verify_gimple_assign_unary (gassign *stmt)
>  case PAREN_EXPR:
>  case CONJ_EXPR:
>break;
> +case ABSU_EXPR:
> +  if (!TYPE_UNSIGNED (lhs_type)
> + || !ANY_IN

Re: [RFC][PR82479] missing popcount builtin detection

2018-06-01 Thread Kugan Vivekanandarajah

Hi Bin,

Thanks a lo for the review.

On 1 June 2018 at 03:45, Bin.Cheng  wrote:
> On Thu, May 31, 2018 at 3:51 AM, Kugan Vivekanandarajah
>  wrote:
>> Hi Bin,
>>
>> Thanks for the review. Please find the revised patch based on the
>> review comments.
>>
>> Thanks,
>> Kugan
>>
>> On 17 May 2018 at 19:56, Bin.Cheng  wrote:
>>> On Thu, May 17, 2018 at 2:39 AM, Kugan Vivekanandarajah
>>>  wrote:
>>>> Hi Richard,
>>>>
>>>> On 6 March 2018 at 02:24, Richard Biener  
>>>> wrote:
>>>>> On Thu, Feb 8, 2018 at 1:41 AM, Kugan Vivekanandarajah
>>>>>  wrote:
>>>>>> Hi Richard,
>>>>>>
>
> Hi,
> Thanks very much for working.
>
>> +/* Utility function to check if OP is defined by a stmt
>> +   that is a val - 1.  If that is the case, set it to STMT.  */
>> +
>> +static bool
>> +ssa_defined_by_and_minus_one_stmt_p (tree op, tree val, gimple **stmt)
> This is checking if op is defined as val - 1, so name it as
> ssa_defined_by_minus_one_stmt_p?
>
>> +{
>> +  if (TREE_CODE (op) == SSA_NAME
>> +  && (*stmt = SSA_NAME_DEF_STMT (op))
>> +  && is_gimple_assign (*stmt)
>> +  && (gimple_assign_rhs_code (*stmt) == PLUS_EXPR)
>> +  && val == gimple_assign_rhs1 (*stmt)
>> +  && integer_minus_onep (gimple_assign_rhs2 (*stmt)))
>> +return true;
>> +  else
>> +return false;
> You can simply return the boolean condition.
Done.

>
>> +}
>> +
>> +/* See if LOOP is a popcout implementation of the form
> ...
>> +  rhs1 = gimple_assign_rhs1 (and_stmt);
>> +  rhs2 = gimple_assign_rhs2 (and_stmt);
>> +
>> +  if (ssa_defined_by_and_minus_one_stmt_p (rhs1, rhs2, _minus_one))
>> +rhs1 = rhs2;
>> +  else if (ssa_defined_by_and_minus_one_stmt_p (rhs2, rhs1, _minus_one))
>> +;
>> +  else
>> +return false;
>> +
>> +  /* Check the recurrence.  */
>> +  phi = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (and_minus_one));
> So gimple_assign_rhs1 (and_minus_one) == rhs1 is always true?  Please
> use rhs1 directly.

Done.
>> +  gimple *src_phi = SSA_NAME_DEF_STMT (rhs2);
> I think this is checking wrong thing and is redundant.  Either rhs2
> equals to rhs1 or is defined as (rhs1 - 1).  For (rhs2 == rhs1) case,
> the check duplicates checking on phi; for the latter, it's never a PHI
> stmt and shouldn't be checked.
>
>> +  if (gimple_code (phi) != GIMPLE_PHI
>> +  || gimple_code (src_phi) != GIMPLE_PHI)
>> +return false;
>> +
>> +  dest = gimple_assign_lhs (count_stmt);
>> +  tree fn = builtin_decl_implicit (BUILT_IN_POPCOUNT);
>> +  tree src = gimple_phi_arg_def (src_phi, loop_preheader_edge 
>> (loop)->dest_idx);
>> +  if (adjust)
>> +iter = fold_build2 (MINUS_EXPR, TREE_TYPE (dest),
>> +build_call_expr (fn, 1, src),
>> +build_int_cst (TREE_TYPE (dest), 1));
>> +  else
>> +iter = build_call_expr (fn, 1, src);
> Note tree-ssa-loop-niters.c always use unsigned_type_for (IV-type) as
> niters type.  Though unsigned type is unnecessary in this case, but
> better to follow existing behavior?

Done.
>
>> +  max = int_cst_value (TYPE_MAX_VALUE (TREE_TYPE (dest)));
> As richi suggested, max should be the number of bits in type of IV.
>
>> +
>> +  niter->assumptions = boolean_false_node;
> Redundant.

Not sure I understand. If I remove this,  I am getting ICE
(segmentation fault). What is the expectation here?

>> +  niter->control.base = NULL_TREE;
>> +  niter->control.step = NULL_TREE;
>> +  niter->control.no_overflow = false;
>> +  niter->niter = iter;
>> +  niter->assumptions = boolean_true_node;
>> +  niter->may_be_zero = boolean_false_node;
>> +  niter->max = max;
>> +  niter->bound = NULL_TREE;
>> +  niter->cmp = ERROR_MARK;
>> +  return true;
>> +}
>> +
>> +
> Appology if these are nitpickings.
Thanks for the review. I am happy to make the changes needed to get it
to how it should be :)

Thanks,
Kugan

>
> Thanks,
> bin
From f45179c777d846731d2d899a142c45a36ab35fd1 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah 
Date: Thu, 10 May 2018 21:41:53 +1000
Subject: [PATCH] popcount

Change-Id: I10c1f990e5b9c9900cf7c93678df924f0463b72e
---
 gcc/ipa-fnsummary.c   |   4 +
 gcc/testsuite/gcc.dg/tree-ssa/popcount.c  |  41 
 gcc/testsuite/gcc.dg/tree-ssa/popcount2.c |  29 ++
 gcc/testsuite/gcc.dg/tree-ssa/popcount3.c |  28 ++
 gcc/tree-scalar-evolution

Re: [RFC][PR64946] "abs" vectorization fails for char/short types

2018-05-31 Thread Kugan Vivekanandarajah

Hi Richard,

This is the revised patch based on the review and the discussion in
https://gcc.gnu.org/ml/gcc/2018-05/msg00179.html.

In summary:
- I skipped  (element_precision (type) < element_precision (TREE_TYPE
(@0))) in the match.pd pattern as this would prevent transformation
for the case in PR.
that is, I am interested in is something like:
  char t = (char) ABS_EXPR <(int) x>
and I want to generate
char t = (char) ABSU_EXPR 

- I also haven't added all the necessary match.pd changes for
ABSU_EXPR. I have a patch for that but will submit separately based on
this reveiw.

- I also tried to add ABSU_EXPRsupport  in the places as necessary by
grepping for ABS_EXPR.

- I also had to add special casing in vectorizer for ABSU_EXP as its
result is unsigned type.

Is this OK. Patch bootstraps and the regression test is ongoing.

Thanks,
Kugan


On 18 May 2018 at 12:36, Kugan Vivekanandarajah
 wrote:
> Hi Richard,
>
> Thanks for the review. I am revising the patch based on Andrew's comments too.
>
> On 17 May 2018 at 20:36, Richard Biener  wrote:
>> On Thu, May 17, 2018 at 4:56 AM Andrew Pinski  wrote:
>>
>>> On Wed, May 16, 2018 at 7:14 PM, Kugan Vivekanandarajah
>>>  wrote:
>>> > As mentioned in the PR, I am trying to add ABSU_EXPR to fix this
>>> > issue. In the attached patch, in fold_cond_expr_with_comparison I am
>>> > generating ABSU_EXPR for these cases. As I understand, absu_expr is
>>> > well defined in RTL. So, the issue is generating absu_expr  and
>>> > transferring to RTL in the correct way. I am not sure I am not doing
>>> > all that is needed. I will clean up and add more test-cases based on
>>> > the feedback.
>>
>>
>>> diff --git a/gcc/optabs-tree.c b/gcc/optabs-tree.c
>>> index 71e172c..2b812e5 100644
>>> --- a/gcc/optabs-tree.c
>>> +++ b/gcc/optabs-tree.c
>>> @@ -235,6 +235,7 @@ optab_for_tree_code (enum tree_code code, const_tree
>> type,
>>> return trapv ? negv_optab : neg_optab;
>>
>>>   case ABS_EXPR:
>>> +case ABSU_EXPR:
>>> return trapv ? absv_optab : abs_optab;
>>
>>
>>> This part is not correct, it should something like this:
>>
>>>   case ABS_EXPR:
>>> return trapv ? absv_optab : abs_optab;
>>> +case ABSU_EXPR:
>>> +   return abs_optab ;
>>
>>> Because ABSU is not undefined at the TYPE_MAX.
>>
>> Also
>>
>> /* Unsigned abs is simply the operand.  Testing here means we don't
>>   risk generating incorrect code below.  */
>> -  if (TYPE_UNSIGNED (type))
>> +  if (TYPE_UNSIGNED (type)
>> + && (code != ABSU_EXPR))
>>  return op0;
>>
>> is wrong.  ABSU of an unsigned number is still just that number.
>>
>> The change to fold_cond_expr_with_comparison looks odd to me
>> (premature optimization).  It should be done separately - it seems
>> you are doing
>
> FE seems to be using this to generate ABS_EXPR from
> c_fully_fold_internal to fold_build3_loc and so on. I changed this to
> generate ABSU_EXPR for the case in the testcase. So the question
> should be, in what cases do we need ABS_EXPR and in what cases do we
> need ABSU_EXPR. It is not very clear to me.
>
>
>>
>> (simplify (abs (convert @0)) (convert (absu @0)))
>>
>> here.
>>
>> You touch one other place in fold-const.c but there seem to be many
>> more that need ABSU_EXPR handling (you touched the one needed
>> for correctness) - esp. you should at least handle constant folding
>> in const_unop and the nonnegative predicate.
>
> OK.
>>
>> @@ -3167,6 +3167,9 @@ verify_expr (tree *tp, int *walk_subtrees, void *data
>> ATTRIBUTE_UNUSED)
>> CHECK_OP (0, "invalid operand to unary operator");
>> break;
>>
>> +case ABSU_EXPR:
>> +  break;
>> +
>>   case REALPART_EXPR:
>>   case IMAGPART_EXPR:
>>
>> verify_expr is no more.  Did you test this recently against trunk?
>
> This patch is against slightly older trunk. I will rebase it.
>
>>
>> @@ -3937,6 +3940,9 @@ verify_gimple_assign_unary (gassign *stmt)
>>   case PAREN_EXPR:
>>   case CONJ_EXPR:
>> break;
>> +case ABSU_EXPR:
>> +  /* FIXME.  */
>> +  return false;
>>
>> no - please not!  Please add verification here - ABSU should be only
>> called on INTEGRAL, vector or complex INTEGRAL types and the
>> type of the LHS should be always the unsigned variant of the
>&g

Re: [RFC][PR82479] missing popcount builtin detection

2018-05-30 Thread Kugan Vivekanandarajah

Hi Bin,

Thanks for the review. Please find the revised patch based on the
review comments.

Thanks,
Kugan

On 17 May 2018 at 19:56, Bin.Cheng  wrote:
> On Thu, May 17, 2018 at 2:39 AM, Kugan Vivekanandarajah
>  wrote:
>> Hi Richard,
>>
>> On 6 March 2018 at 02:24, Richard Biener  wrote:
>>> On Thu, Feb 8, 2018 at 1:41 AM, Kugan Vivekanandarajah
>>>  wrote:
>>>> Hi Richard,
>>>>
>>>> On 1 February 2018 at 23:21, Richard Biener  
>>>> wrote:
>>>>> On Thu, Feb 1, 2018 at 5:07 AM, Kugan Vivekanandarajah
>>>>>  wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> On 31 January 2018 at 21:39, Richard Biener  
>>>>>> wrote:
>>>>>>> On Wed, Jan 31, 2018 at 11:28 AM, Kugan Vivekanandarajah
>>>>>>>  wrote:
>>>>>>>> Hi Richard,
>>>>>>>>
>>>>>>>> Thanks for the review.
>>>>>>>> On 25 January 2018 at 20:04, Richard Biener 
>>>>>>>>  wrote:
>>>>>>>>> On Wed, Jan 24, 2018 at 10:56 PM, Kugan Vivekanandarajah
>>>>>>>>>  wrote:
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> Here is a patch for popcount builtin detection similar to LLVM. I
>>>>>>>>>> would like to queue this for review for next stage 1.
>>>>>>>>>>
>>>>>>>>>> 1. This is done part of loop-distribution and effective for -O3 and 
>>>>>>>>>> above.
>>>>>>>>>> 2. This does not distribute loop to detect popcount (like
>>>>>>>>>> memcpy/memmove). I dont think that happens in practice. Please 
>>>>>>>>>> correct
>>>>>>>>>> me if I am wrong.
>>>>>>>>>
>>>>>>>>> But then it has no business inside loop distribution but instead is
>>>>>>>>> doing final value
>>>>>>>>> replacement, right?  You are pattern-matching the whole loop after 
>>>>>>>>> all.  I think
>>>>>>>>> final value replacement would already do the correct thing if you
>>>>>>>>> teached number of
>>>>>>>>> iteration analysis that niter for
>>>>>>>>>
>>>>>>>>>[local count: 955630224]:
>>>>>>>>>   # b_11 = PHI 
>>>>>>>>>   _1 = b_11 + -1;
>>>>>>>>>   b_8 = _1 & b_11;
>>>>>>>>>   if (b_8 != 0)
>>>>>>>>> goto ; [89.00%]
>>>>>>>>>   else
>>>>>>>>> goto ; [11.00%]
>>>>>>>>>
>>>>>>>>>[local count: 850510900]:
>>>>>>>>>   goto ; [100.00%]
>>>>>>>>
>>>>>>>> I am looking into this approach. What should be the scalar evolution
>>>>>>>> for b_8 (i.e. b & (b -1) in a loop) should be? This is not clear to me
>>>>>>>> and can this be represented with the scev?
>>>>>>>
>>>>>>> No, it's not affine and thus cannot be represented.  You only need the
>>>>>>> scalar evolution of the counting IV which is already handled and
>>>>>>> the number of iteration analysis needs to handle the above IV - this
>>>>>>> is the missing part.
>>>>>> Thanks for the clarification. I am now matching this loop pattern in
>>>>>> number_of_iterations_exit when number_of_iterations_exit_assumptions
>>>>>> fails. If the pattern matches, I am inserting the _builtin_popcount in
>>>>>> the loop preheater and setting the loop niter with this. This will be
>>>>>> used by the final value replacement. Is this what you wanted?
>>>>>
>>>>> No, you shouldn't insert a popcount stmt but instead the niter
>>>>> GENERIC tree should be a CALL_EXPR to popcount with the
>>>>> appropriate argument.
>>>>
>>>> Thats what I tried earlier but ran into some ICEs. I wasn't sure if
>>>> niter in tree_niter_desc can take such.
>>>>
>>>> Attached patch now does this. Also had to add support for CALL_EXPR in
>

Re: [RFC][PR64946] "abs" vectorization fails for char/short types

2018-05-17 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review. I am revising the patch based on Andrew's comments too.

On 17 May 2018 at 20:36, Richard Biener <richard.guent...@gmail.com> wrote:
> On Thu, May 17, 2018 at 4:56 AM Andrew Pinski <pins...@gmail.com> wrote:
>
>> On Wed, May 16, 2018 at 7:14 PM, Kugan Vivekanandarajah
>> <kugan.vivekanandara...@linaro.org> wrote:
>> > As mentioned in the PR, I am trying to add ABSU_EXPR to fix this
>> > issue. In the attached patch, in fold_cond_expr_with_comparison I am
>> > generating ABSU_EXPR for these cases. As I understand, absu_expr is
>> > well defined in RTL. So, the issue is generating absu_expr  and
>> > transferring to RTL in the correct way. I am not sure I am not doing
>> > all that is needed. I will clean up and add more test-cases based on
>> > the feedback.
>
>
>> diff --git a/gcc/optabs-tree.c b/gcc/optabs-tree.c
>> index 71e172c..2b812e5 100644
>> --- a/gcc/optabs-tree.c
>> +++ b/gcc/optabs-tree.c
>> @@ -235,6 +235,7 @@ optab_for_tree_code (enum tree_code code, const_tree
> type,
>> return trapv ? negv_optab : neg_optab;
>
>>   case ABS_EXPR:
>> +case ABSU_EXPR:
>> return trapv ? absv_optab : abs_optab;
>
>
>> This part is not correct, it should something like this:
>
>>   case ABS_EXPR:
>> return trapv ? absv_optab : abs_optab;
>> +case ABSU_EXPR:
>> +   return abs_optab ;
>
>> Because ABSU is not undefined at the TYPE_MAX.
>
> Also
>
> /* Unsigned abs is simply the operand.  Testing here means we don't
>   risk generating incorrect code below.  */
> -  if (TYPE_UNSIGNED (type))
> +  if (TYPE_UNSIGNED (type)
> + && (code != ABSU_EXPR))
>  return op0;
>
> is wrong.  ABSU of an unsigned number is still just that number.
>
> The change to fold_cond_expr_with_comparison looks odd to me
> (premature optimization).  It should be done separately - it seems
> you are doing

FE seems to be using this to generate ABS_EXPR from
c_fully_fold_internal to fold_build3_loc and so on. I changed this to
generate ABSU_EXPR for the case in the testcase. So the question
should be, in what cases do we need ABS_EXPR and in what cases do we
need ABSU_EXPR. It is not very clear to me.


>
> (simplify (abs (convert @0)) (convert (absu @0)))
>
> here.
>
> You touch one other place in fold-const.c but there seem to be many
> more that need ABSU_EXPR handling (you touched the one needed
> for correctness) - esp. you should at least handle constant folding
> in const_unop and the nonnegative predicate.

OK.
>
> @@ -3167,6 +3167,9 @@ verify_expr (tree *tp, int *walk_subtrees, void *data
> ATTRIBUTE_UNUSED)
> CHECK_OP (0, "invalid operand to unary operator");
> break;
>
> +case ABSU_EXPR:
> +  break;
> +
>   case REALPART_EXPR:
>   case IMAGPART_EXPR:
>
> verify_expr is no more.  Did you test this recently against trunk?

This patch is against slightly older trunk. I will rebase it.

>
> @@ -3937,6 +3940,9 @@ verify_gimple_assign_unary (gassign *stmt)
>   case PAREN_EXPR:
>   case CONJ_EXPR:
> break;
> +case ABSU_EXPR:
> +  /* FIXME.  */
> +  return false;
>
> no - please not!  Please add verification here - ABSU should be only
> called on INTEGRAL, vector or complex INTEGRAL types and the
> type of the LHS should be always the unsigned variant of the
> argument type.

OK.
>
> if (is_gimple_val (cond_expr))
>   return cond_expr;
>
> -  if (TREE_CODE (cond_expr) == ABS_EXPR)
> +  if (TREE_CODE (cond_expr) == ABS_EXPR
> +  || TREE_CODE (cond_expr) == ABSU_EXPR)
>   {
> rhs1 = TREE_OPERAND (cond_expr, 1);
> STRIP_USELESS_TYPE_CONVERSION (rhs1);
>
> err, but the next line just builds a ABS_EXPR ...
>
> How did you identify spots that need adjustment?  I would expect that
> once folding generates ABSU_EXPR that you need to adjust frontends
> (C++ constexpr handling for example).  Also I miss adjustments
> to gimple-pretty-print.c and the GIMPLE FE parser.

I will add this.
>
> recursively grepping throughout the whole gcc/ tree doesn't reveal too many
> cases of ABS_EXPR so I think it's reasonable to audit all of them.
>
> I also miss some trivial absu simplifications in match.pd.  There are not
> a lot of abs cases but similar ones would be good to have initially.

I will add them in the next version.

Thanks,
Kugan

>
> Thanks for tackling this!
> Richard.
>
>> Thanks,
>> Andrew
>
>> >
>> > Thanks,
>> > Kugan
>> >
>> &

[RFC][PR64946] "abs" vectorization fails for char/short types

2018-05-16 Thread Kugan Vivekanandarajah

As mentioned in the PR, I am trying to add ABSU_EXPR to fix this
issue. In the attached patch, in fold_cond_expr_with_comparison I am
generating ABSU_EXPR for these cases. As I understand, absu_expr is
well defined in RTL. So, the issue is generating absu_expr  and
transferring to RTL in the correct way. I am not sure I am not doing
all that is needed. I will clean up and add more test-cases based on
the feedback.

Thanks,
Kugan


gcc/ChangeLog:

2018-05-13  Kugan Vivekanandarajah  <kugan.vivekanandara...@linaro.org>

* expr.c (expand_expr_real_2): Handle ABSU_EXPR.
* fold-const.c (fold_cond_expr_with_comparison): Generate ABSU_EXPR
(fold_unary_loc): Handle ABSU_EXPR.
* optabs-tree.c (optab_for_tree_code): Likewise.
* tree-cfg.c (verify_expr): Likewise.
(verify_gimple_assign_unary):  Likewise.
* tree-if-conv.c (fold_build_cond_expr):  Likewise.
* tree-inline.c (estimate_operator_cost):  Likewise.
* tree-pretty-print.c (dump_generic_node):  Likewise.
* tree.def (ABSU_EXPR): New.

gcc/testsuite/ChangeLog:

2018-05-13  Kugan Vivekanandarajah  <kugan.vivekanandara...@linaro.org>

* gcc.dg/absu.c: New test.
diff --git a/gcc/expr.c b/gcc/expr.c
index 5e3d9a5..67f8dd1 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9063,6 +9063,7 @@ expand_expr_real_2 (sepops ops, rtx target, machine_mode 
tmode,
   return REDUCE_BIT_FIELD (temp);
 
 case ABS_EXPR:
+case ABSU_EXPR:
   op0 = expand_expr (treeop0, subtarget,
 VOIDmode, EXPAND_NORMAL);
   if (modifier == EXPAND_STACK_PARM)
@@ -9074,7 +9075,8 @@ expand_expr_real_2 (sepops ops, rtx target, machine_mode 
tmode,
 
   /* Unsigned abs is simply the operand.  Testing here means we don't
 risk generating incorrect code below.  */
-  if (TYPE_UNSIGNED (type))
+  if (TYPE_UNSIGNED (type)
+ && (code != ABSU_EXPR))
return op0;
 
   return expand_abs (mode, op0, target, unsignedp,
diff --git a/gcc/fold-const.c b/gcc/fold-const.c
index 3a99b66..6e80178 100644
--- a/gcc/fold-const.c
+++ b/gcc/fold-const.c
@@ -5324,8 +5324,17 @@ fold_cond_expr_with_comparison (location_t loc, tree 
type,
   case GT_EXPR:
if (TYPE_UNSIGNED (TREE_TYPE (arg1)))
  break;
-   tem = fold_build1_loc (loc, ABS_EXPR, TREE_TYPE (arg1), arg1);
-   return fold_convert_loc (loc, type, tem);
+   if (TREE_CODE (arg1) == NOP_EXPR)
+ {
+   arg1 = TREE_OPERAND (arg1, 0);
+   tem = fold_build1_loc (loc, ABSU_EXPR, unsigned_type_for 
(arg1_type), arg1);
+   return fold_convert_loc (loc, type, tem);
+ }
+   else
+ {
+   tem = fold_build1_loc (loc, ABS_EXPR, TREE_TYPE (arg1), arg1);
+   return fold_convert_loc (loc, type, tem);
+ }
   case UNLE_EXPR:
   case UNLT_EXPR:
if (flag_trapping_math)
@@ -7698,7 +7707,8 @@ fold_unary_loc (location_t loc, enum tree_code code, tree 
type, tree op0)
   if (arg0)
 {
   if (CONVERT_EXPR_CODE_P (code)
- || code == FLOAT_EXPR || code == ABS_EXPR || code == NEGATE_EXPR)
+ || code == FLOAT_EXPR || code == ABS_EXPR
+ || code == ABSU_EXPR || code == NEGATE_EXPR)
{
  /* Don't use STRIP_NOPS, because signedness of argument type
 matters.  */
diff --git a/gcc/optabs-tree.c b/gcc/optabs-tree.c
index 71e172c..2b812e5 100644
--- a/gcc/optabs-tree.c
+++ b/gcc/optabs-tree.c
@@ -235,6 +235,7 @@ optab_for_tree_code (enum tree_code code, const_tree type,
   return trapv ? negv_optab : neg_optab;
 
 case ABS_EXPR:
+case ABSU_EXPR:
   return trapv ? absv_optab : abs_optab;
 
 default:
diff --git a/gcc/testsuite/gcc.dg/absu.c b/gcc/testsuite/gcc.dg/absu.c
index e69de29..43e651b 100644
--- a/gcc/testsuite/gcc.dg/absu.c
+++ b/gcc/testsuite/gcc.dg/absu.c
@@ -0,0 +1,39 @@
+
+/* { dg-do run  } */
+/* { dg-options "-O0" } */
+
+#include 
+#define ABS(x) (((x) >= 0) ? (x) : -(x))
+
+#define DEF_TEST(TYPE) \
+void foo_##TYPE (signed TYPE x, unsigned TYPE y){  \
+TYPE t = ABS (x);  \
+if (t != y)\
+   __builtin_abort (); \
+}  \
+
+DEF_TEST (char);
+DEF_TEST (short);
+DEF_TEST (int);
+DEF_TEST (long);
+void main ()
+{
+  foo_char (SCHAR_MIN + 1, SCHAR_MAX);
+  foo_char (0, 0);
+  foo_char (SCHAR_MAX, SCHAR_MAX);
+
+  foo_int (-1, 1);
+  foo_int (0, 0);
+  foo_int (INT_MAX, INT_MAX);
+  foo_int (INT_MIN + 1, INT_MAX);
+
+  foo_short (-1, 1);
+  foo_short (0, 0);
+  foo_short (SHRT_MAX, SHRT_MAX);
+  foo_short (SHRT_MIN + 1, SHRT_MAX);
+
+  foo_long (-1, 1);
+  foo_long (0, 0);
+  foo_long (LONG_MAX, LONG_MAX);
+  foo_long (LONG_MIN + 1, LONG_MAX);
+}
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 9485f73..59a115c 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -3167,6 +3167,9 @@ verify_expr (tree *t

Re: [RFC][PR82479] missing popcount builtin detection

2018-05-16 Thread Kugan Vivekanandarajah

Hi Richard,

On 6 March 2018 at 02:24, Richard Biener <richard.guent...@gmail.com> wrote:
> On Thu, Feb 8, 2018 at 1:41 AM, Kugan Vivekanandarajah
> <kugan.vivekanandara...@linaro.org> wrote:
>> Hi Richard,
>>
>> On 1 February 2018 at 23:21, Richard Biener <richard.guent...@gmail.com> 
>> wrote:
>>> On Thu, Feb 1, 2018 at 5:07 AM, Kugan Vivekanandarajah
>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>> Hi Richard,
>>>>
>>>> On 31 January 2018 at 21:39, Richard Biener <richard.guent...@gmail.com> 
>>>> wrote:
>>>>> On Wed, Jan 31, 2018 at 11:28 AM, Kugan Vivekanandarajah
>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> Thanks for the review.
>>>>>> On 25 January 2018 at 20:04, Richard Biener <richard.guent...@gmail.com> 
>>>>>> wrote:
>>>>>>> On Wed, Jan 24, 2018 at 10:56 PM, Kugan Vivekanandarajah
>>>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> Here is a patch for popcount builtin detection similar to LLVM. I
>>>>>>>> would like to queue this for review for next stage 1.
>>>>>>>>
>>>>>>>> 1. This is done part of loop-distribution and effective for -O3 and 
>>>>>>>> above.
>>>>>>>> 2. This does not distribute loop to detect popcount (like
>>>>>>>> memcpy/memmove). I dont think that happens in practice. Please correct
>>>>>>>> me if I am wrong.
>>>>>>>
>>>>>>> But then it has no business inside loop distribution but instead is
>>>>>>> doing final value
>>>>>>> replacement, right?  You are pattern-matching the whole loop after all. 
>>>>>>>  I think
>>>>>>> final value replacement would already do the correct thing if you
>>>>>>> teached number of
>>>>>>> iteration analysis that niter for
>>>>>>>
>>>>>>>[local count: 955630224]:
>>>>>>>   # b_11 = PHI <b_5(5), b_8(6)>
>>>>>>>   _1 = b_11 + -1;
>>>>>>>   b_8 = _1 & b_11;
>>>>>>>   if (b_8 != 0)
>>>>>>> goto ; [89.00%]
>>>>>>>   else
>>>>>>> goto ; [11.00%]
>>>>>>>
>>>>>>>[local count: 850510900]:
>>>>>>>   goto ; [100.00%]
>>>>>>
>>>>>> I am looking into this approach. What should be the scalar evolution
>>>>>> for b_8 (i.e. b & (b -1) in a loop) should be? This is not clear to me
>>>>>> and can this be represented with the scev?
>>>>>
>>>>> No, it's not affine and thus cannot be represented.  You only need the
>>>>> scalar evolution of the counting IV which is already handled and
>>>>> the number of iteration analysis needs to handle the above IV - this
>>>>> is the missing part.
>>>> Thanks for the clarification. I am now matching this loop pattern in
>>>> number_of_iterations_exit when number_of_iterations_exit_assumptions
>>>> fails. If the pattern matches, I am inserting the _builtin_popcount in
>>>> the loop preheater and setting the loop niter with this. This will be
>>>> used by the final value replacement. Is this what you wanted?
>>>
>>> No, you shouldn't insert a popcount stmt but instead the niter
>>> GENERIC tree should be a CALL_EXPR to popcount with the
>>> appropriate argument.
>>
>> Thats what I tried earlier but ran into some ICEs. I wasn't sure if
>> niter in tree_niter_desc can take such.
>>
>> Attached patch now does this. Also had to add support for CALL_EXPR in
>> few places to handle niter with CALL_EXPR. Does this look OK?
>
> Overall this looks ok - the patch includes changes in places that I don't 
> think
> need changes such as chrec_convert_1 or extract_ops_from_tree.
> The expression_expensive_p change should be more specific than making
> all calls inexpensive as well.

Changed it.

>
> The verify_ssa change looks bogus, you do
>
> +  dest = gimple_phi_result (count_phi);
> +  tree var = make_ssa_name (TREE_TYPE (dest), NULL);
> +  tree fn = builtin_decl_implicit (BU

Re: [PR63185][RFC] Improve DSE with branches

2018-05-15 Thread Kugan Vivekanandarajah

Hi Richard,

On 15 May 2018 at 19:20, Richard Biener <rguent...@suse.de> wrote:
> On Tue, 15 May 2018, Richard Biener wrote:
>
>> On Mon, 14 May 2018, Kugan Vivekanandarajah wrote:
>>
>> > Hi,
>> >
>> > Attached patch handles PR63185 when we reach PHI with temp != NULLL.
>> > We could see the PHI and if there isn't any uses for PHI that is
>> > interesting, we could ignore that ?
>> >
>> > Bootstrapped and regression tested on x86_64-linux-gnu.
>> > Is this OK?
>>
>> No, as Jeff said we can't do it this way.
>>
>> If we end up with multiple VDEFs in the walk of defvar immediate
>> uses we know we are dealing with a CFG fork.  We can't really
>> ignore any of the paths but we have to
>>
>>  a) find the merge point (and the associated VDEF)
>>  b) verify for each each chain of VDEFs with associated VUSEs
>> up to that merge VDEF that we have no uses of the to classify
>> store and collect (partial) kills
>>  c) intersect kill info and continue walking from the merge point
>>
>> in b) there's the optional possibility to find sinking opportunities
>> in case we have kills on some paths but uses on others.  This is why
>> DSE should be really merged with (store) sinking.
>>
>> So if we want to enhance DSEs handling of branches then we need
>> to refactor the simple dse_classify_store function.  Let me take
>> an attempt at this today.
>
> First (baby) step is the following - it arranges to collect the
> defs we need to continue walking from and implements trivial
> reduction by stopping at (full) kills.  This allows us to handle
> the new testcase (which was already handled in the very late DSE
> pass with the help of sinking the store).

Thanks for the explanation and thanks for working on this.

>
> I took the opportunity to kill the use_stmt parameter of
> dse_classify_store as the only user is only looking for whether
> the kills were all clobbers which I added a new parameter for.
>
> I didn't adjust the byte-tracking case fully (I'm not fully understanding
> the code in the case of a use and I'm not sure whether it's worth
> doing the def reduction with byte-tracking).

Since byte tracking is tracking the killed subset of bytes in the
original def, in your patch can we calculate the bytes killed by each
defs[i] and then find the intersection for the iteration. If we don't
have complete kill, we can use this to set live bytes and iterate till
all the live_bytes are cleared like it was done earlier.

Thanks,
Kugan

>
> Your testcase can be handled by reducing the PHI and the call def
> by seeing that the only use of a candidate def is another def
> we have already processed.  Not sure if worth special-casing though,
> I'd rather have a go at "recursing".  That will be the next
> patch.
>
> Bootstrap & regtest running on x86_64-unknown-linux-gnu.
>
> Richard.
>
> 2018-05-15  Richard Biener  <rguent...@suse.de>
>
> * tree-ssa-dse.c (dse_classify_store): Remove use_stmt parameter,
> add by_clobber_p one.  Change algorithm to collect all defs
> representing uses we need to walk and try reducing them to
> a single one before failing.
> (dse_dom_walker::dse_optimize_stmt): Adjust.
>
> * gcc.dg/tree-ssa/ssa-dse-31.c: New testcase.
>
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c 
> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
> new file mode 100644
> index 000..9ea9268fe1d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O -fdump-tree-dse1-details" } */
> +
> +void g();
> +void f(int n, char *p)
> +{
> +  *p = 0;
> +  if (n)
> +{
> +  *p = 1;
> +  g();
> +}
> +  *p = 2;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Deleted dead store" 1 "dse1" } } */
> diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
> index 9220fea7f2e..126592a1d13 100644
> --- a/gcc/tree-ssa-dse.c
> +++ b/gcc/tree-ssa-dse.c
> @@ -516,18 +516,21 @@ live_bytes_read (ao_ref use_ref, ao_ref *ref, sbitmap 
> live)
>  }
>
>  /* A helper of dse_optimize_stmt.
> -   Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate
> -   statement *USE_STMT that may prove STMT to be dead.
> -   Return TRUE if the above conditions are met, otherwise FALSE.  */
> +   Given a GIMPLE_ASSIGN in STMT that writes to REF, classify it
> +   according to downstream uses and defs.  Sets *BY_CLOBBER_P to true
> +   if only clobber statements influenced the classification result.
> +   Returns the classif

[PR63185][RFC] Improve DSE with branches

2018-05-13 Thread Kugan Vivekanandarajah

Hi,

Attached patch handles PR63185 when we reach PHI with temp != NULLL.
We could see the PHI and if there isn't any uses for PHI that is
interesting, we could ignore that ?

Bootstrapped and regression tested on x86_64-linux-gnu.
Is this OK?

Thanks,
Kugan


gcc/ChangeLog:

2018-05-14  Kugan Vivekanandarajah  <kug...@linaro.org>

* tree-ssa-dse.c (phi_dosent_define_nor_use_p): New.
(dse_classify_store): Use phi_dosent_define_nor_use_p.

gcc/testsuite/ChangeLog:

2018-05-14  Kugan Vivekanandarajah  <kug...@linaro.org>

* gcc.dg/tree-ssa/ssa-dse-33.c: New test.
From a69caa24d9c1914b7617a937e84c3b612ffe6d9b Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Wed, 9 May 2018 16:26:16 +1000
Subject: [PATCH] PR63185

Change-Id: I9d307884add10d5b5ff07aa31dd86cb83b2388ec
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-33.c | 13 +
 gcc/tree-ssa-dse.c | 30 +-
 2 files changed, 42 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-33.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-33.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-33.c
new file mode 100644
index 000..46cb7d1
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-33.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-dse-details" } */
+
+void g();
+void f(int n)
+{
+char *p = malloc(1024);
+memset (p, 8, 1024);
+if (n)
+  g();
+}
+
+/* { dg-final { scan-tree-dump-times "Deleted dead calls" 1 "dse1"} } */
diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
index 9220fea..e7a4637 100644
--- a/gcc/tree-ssa-dse.c
+++ b/gcc/tree-ssa-dse.c
@@ -515,6 +515,30 @@ live_bytes_read (ao_ref use_ref, ao_ref *ref, sbitmap live)
   return true;
 }
 
+/*  Return true if there isnt any VDEF or VUSE by following PHI.  */
+
+static bool
+phi_dosent_define_nor_use_p (ao_ref *ref, gphi *phi)
+{
+  gimple *phi_use;
+  imm_use_iterator ui;
+  tree def = PHI_RESULT (phi);
+  bool ok = true;
+
+  FOR_EACH_IMM_USE_STMT (phi_use, ui, def)
+{
+  if (ref_maybe_used_by_stmt_p (phi_use, ref)
+	  || gimple_vdef (phi_use)
+	  || gimple_code (phi_use) == GIMPLE_PHI)
+	{
+	  ok = false;
+	  BREAK_FROM_IMM_USE_STMT (ui);
+	}
+}
+
+  return ok;
+}
+
 /* A helper of dse_optimize_stmt.
Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate
statement *USE_STMT that may prove STMT to be dead.
@@ -570,6 +594,9 @@ dse_classify_store (ao_ref *ref, gimple *stmt, gimple **use_stmt,
 	  else if (gimple_code (use_stmt) == GIMPLE_PHI)
 	{
 	  if (temp
+		  && phi_dosent_define_nor_use_p (ref, as_a  (use_stmt)))
+		;
+	  else if (temp
 		  /* Make sure we are not in a loop latch block.  */
 		  || gimple_bb (stmt) == gimple_bb (use_stmt)
 		  || dominated_by_p (CDI_DOMINATORS,
@@ -585,7 +612,8 @@ dse_classify_store (ao_ref *ref, gimple *stmt, gimple **use_stmt,
 	  /* Do not consider the PHI as use if it dominates the
 	 stmt defining the virtual operand we are processing,
 		 we have processed it already in this case.  */
-	  if (gimple_bb (defvar_def) != gimple_bb (use_stmt)
+	  if (!temp
+		  && gimple_bb (defvar_def) != gimple_bb (use_stmt)
 		  && !dominated_by_p (CDI_DOMINATORS,
   gimple_bb (defvar_def),
   gimple_bb (use_stmt)))
-- 
2.7.4

Re: [RFC] Improve tree DSE

2018-05-13 Thread Kugan Vivekanandarajah

Hi Richard,

>> Given the simple testcases you add I wonder if you want a cheaper
>> implementation,
>> namely check that when reaching a loop PHI the only aliasing stmt in
>> its use-chain
>> is the use_stmt you reached the PHI from.  That would avoid this and the 
>> tests
>> for the store being redundant and simplify the patch considerably.

Tried implementing above in the attached patch.  Bootstrapped on
x86_64-linux-gnu. Full testing is ongoing.

Thanks,
Kugan

gcc/ChangeLog:

2018-05-14  Kugan Vivekanandarajah  <kug...@linaro.org>

* tree-ssa-dse.c (phi_aliases_stmt_only): New.
(dse_classify_store): Use phi_aliases_stmt_only.

gcc/testsuite/ChangeLog:

2018-05-14  Kugan Vivekanandarajah  <kug...@linaro.org>

* gcc.dg/tree-ssa/ssa-dse-31.c: New test.
* gcc.dg/tree-ssa/ssa-dse-32.c: New test.
From 102b1dd676446055fb881daa1fee4e96b6fe676d Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Wed, 9 May 2018 08:57:23 +1000
Subject: [PATCH] improve dse Change-Id:
 If23529a3ede8230b26de8d60c1e0c5141be8edb7

---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c | 16 +++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c | 23 +
 gcc/tree-ssa-dse.c | 33 +++---
 3 files changed, 69 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
new file mode 100644
index 000..e4d71b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
@@ -0,0 +1,16 @@
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-dse-details" } */
+#define SIZE 4
+
+int main ()
+{
+  static float a[SIZE];
+  int i;
+  for (i = 0; i < SIZE; i++)
+   __builtin_memset ((void *) a, 0, sizeof(float)*3);
+   __builtin_memset ((void *) a, 0, sizeof(float)*SIZE);
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Deleted dead calls" 1 "dse1"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c
new file mode 100644
index 000..3d8fd5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c
@@ -0,0 +1,23 @@
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-dse-details" } */
+#define SIZE 4
+
+void s4 (float *restrict a)
+{
+  (void) __builtin_memset ((void *) a, 0, sizeof(float)*SIZE);
+}
+
+
+int main ()
+{
+  int i;
+  float a[10];
+  printf("Start\n");
+  for (i = 0; i < SIZE; i++)
+s4 (a);
+  printf("Done\n");
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Deleted dead calls" 1 "dse1"} } */
diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
index 9220fea..6522a94 100644
--- a/gcc/tree-ssa-dse.c
+++ b/gcc/tree-ssa-dse.c
@@ -515,6 +515,28 @@ live_bytes_read (ao_ref use_ref, ao_ref *ref, sbitmap live)
   return true;
 }
 
+/* Return true if PHI stmt aliases only STMT1. */
+
+static bool
+phi_aliases_stmt_only (gphi *phi, gimple *stmt1)
+{
+  gimple *phi_use;
+  imm_use_iterator ui2;
+  tree def = PHI_RESULT (phi);
+  bool ok = true;
+
+  FOR_EACH_IMM_USE_STMT (phi_use, ui2, def)
+{
+  if (phi_use != stmt1)
+	{
+	  ok = false;
+	  BREAK_FROM_IMM_USE_STMT (ui2);
+	}
+}
+
+  return ok;
+}
+
 /* A helper of dse_optimize_stmt.
Given a GIMPLE_ASSIGN in STMT that writes to REF, find a candidate
statement *USE_STMT that may prove STMT to be dead.
@@ -571,9 +593,14 @@ dse_classify_store (ao_ref *ref, gimple *stmt, gimple **use_stmt,
 	{
 	  if (temp
 		  /* Make sure we are not in a loop latch block.  */
-		  || gimple_bb (stmt) == gimple_bb (use_stmt)
-		  || dominated_by_p (CDI_DOMINATORS,
- gimple_bb (stmt), gimple_bb (use_stmt))
+		  || ((gimple_bb (stmt) == gimple_bb (use_stmt)
+		   || dominated_by_p (CDI_DOMINATORS,
+	  gimple_bb (stmt), gimple_bb (use_stmt)))
+		  /* When reaching a loop PHI, the only aliasing stmt
+			 in its use-chain is the stmt you reached the
+			 PHI is OK.  */
+		  && !phi_aliases_stmt_only (as_a  (use_stmt),
+		 stmt))
 		  /* We can look through PHIs to regions post-dominating
 		 the DSE candidate stmt.  */
 		  || !dominated_by_p (CDI_POST_DOMINATORS,
-- 
2.7.4

Re: [RFC] Improve tree DSE

2018-05-01 Thread Kugan Vivekanandarajah

Hi Jeff,

Thanks for the review.

On 2 May 2018 at 01:43, Jeff Law <l...@redhat.com> wrote:
> On 04/09/2018 06:52 PM, Kugan Vivekanandarajah wrote:
>> I would like to queue this patch for stage1 review.
>>
>> In DSE, while in dse_classify_store, as soon as we see a PHI use
>> statement that is part of the loop, we are immediately giving up.
>>
>> As far as I understand, this can be improved. Attached patch is trying
>> to walk the uses of the PHI statement (by recursively calling
>> dse_classify_store) and then making sure the obtained store is indeed
>> redundant.
>>
>> This is partly as reported in one of the testcase from PR44612. But
>> this PR is about other issues that is not handled in this patch.
>>
>> Bootstrapped and regression tested on aarch64-linux-gnu with no new 
>> regressions.
>>
>> Is this OK for next stage1?
>>
>> Thanks,
>> Kugan
>>
>>
>> gcc/ChangeLog:
>>
>> 2018-04-10  Kugan Vivekanandarajah  <kug...@linaro.org>
>>
>> * tree-ssa-dse.c (dse_classify_store): Handle recursive PHI.
>> (dse_dom_walker::dse_optimize_stmt): Update call dse_classify_store.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2018-04-10  Kugan Vivekanandarajah  <kug...@linaro.org>
>>
>> * gcc.dg/tree-ssa/ssa-dse-31.c: New test.
>> * gcc.dg/tree-ssa/ssa-dse-32.c: New test.
>>
>>
>> 0001-improve-dse.patch
>>
>>
>> From 5751eaff3d1c263e8631d5a07e43fecaaa0e9d26 Mon Sep 17 00:00:00 2001
>> From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
>> Date: Tue, 10 Apr 2018 09:49:10 +1000
>> Subject: [PATCH] improve dse
>>
>> ---
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c | 16 ++
>>  gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c | 23 ++
>>  gcc/tree-ssa-dse.c | 51 
>> --
>>  3 files changed, 81 insertions(+), 9 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c
>>
>
>> diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
>> index 9220fea..3513fda 100644
>> --- a/gcc/tree-ssa-dse.c
>> +++ b/gcc/tree-ssa-dse.c
>> @@ -521,11 +521,11 @@ live_bytes_read (ao_ref use_ref, ao_ref *ref, sbitmap 
>> live)
>> Return TRUE if the above conditions are met, otherwise FALSE.  */
>>
>>  static dse_store_status
>> -dse_classify_store (ao_ref *ref, gimple *stmt, gimple **use_stmt,
>> - bool byte_tracking_enabled, sbitmap live_bytes)
>> +dse_classify_store (ao_ref *ref, gimple *stmt_outer, gimple *stmt,
>> + gimple **use_stmt, bool byte_tracking_enabled,
>> + sbitmap live_bytes, unsigned cnt = 0)
>>  {
>>gimple *temp;
>> -  unsigned cnt = 0;
>>
>>*use_stmt = NULL;
>>
>> @@ -556,9 +556,11 @@ dse_classify_store (ao_ref *ref, gimple *stmt, gimple 
>> **use_stmt,
>>   {
>> cnt++;
>>
>> +   if (use_stmt == stmt_outer)
>> + continue;
> So is this really safe?  This seems to catch the case where the
> recursive call stumbles onto the same statement we're already
> processing.  ie, we've followed a loop backedge.
>
> ISTM that further analysis here  is needed -- don't you have to make
> sure that USE_STMT does not read from REF?  It could be a memmove call
> for example.
I think you are right. This has to be handled.

>
> I'm also struggling a little bit to see much value in handling this
> case.  In the included testcases we've got a memset in a loop where the
> args do not vary across the loop iterations and there are no reads from
> the memory location within the loop. How realistic is that?

I was looking into another case from an application but that was not
handled partly due to limitations of alias analysis and thought that
this could be handled. If you think that this is not going to happen
often in practice, I agree that this is not worth the trouble.

>
>
> If you're looking to improve DSE, the cases in 63185, 64380 and 79958
> may be interesting.

Thanks for the pointers. I will have a look at them.

Thanks,
Kugan

>

[RFC] Improve tree DSE

2018-04-09 Thread Kugan Vivekanandarajah

I would like to queue this patch for stage1 review.

In DSE, while in dse_classify_store, as soon as we see a PHI use
statement that is part of the loop, we are immediately giving up.

As far as I understand, this can be improved. Attached patch is trying
to walk the uses of the PHI statement (by recursively calling
dse_classify_store) and then making sure the obtained store is indeed
redundant.

This is partly as reported in one of the testcase from PR44612. But
this PR is about other issues that is not handled in this patch.

Bootstrapped and regression tested on aarch64-linux-gnu with no new regressions.

Is this OK for next stage1?

Thanks,
Kugan


gcc/ChangeLog:

2018-04-10  Kugan Vivekanandarajah  <kug...@linaro.org>

* tree-ssa-dse.c (dse_classify_store): Handle recursive PHI.
(dse_dom_walker::dse_optimize_stmt): Update call dse_classify_store.

gcc/testsuite/ChangeLog:

2018-04-10  Kugan Vivekanandarajah  <kug...@linaro.org>

* gcc.dg/tree-ssa/ssa-dse-31.c: New test.
* gcc.dg/tree-ssa/ssa-dse-32.c: New test.
From 5751eaff3d1c263e8631d5a07e43fecaaa0e9d26 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Tue, 10 Apr 2018 09:49:10 +1000
Subject: [PATCH] improve dse

---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c | 16 ++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c | 23 ++
 gcc/tree-ssa-dse.c | 51 --
 3 files changed, 81 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
new file mode 100644
index 000..e4d71b2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-31.c
@@ -0,0 +1,16 @@
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-dse-details" } */
+#define SIZE 4
+
+int main ()
+{
+  static float a[SIZE];
+  int i;
+  for (i = 0; i < SIZE; i++)
+   __builtin_memset ((void *) a, 0, sizeof(float)*3);
+   __builtin_memset ((void *) a, 0, sizeof(float)*SIZE);
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Deleted dead calls" 1 "dse1"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c
new file mode 100644
index 000..3d8fd5f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dse-32.c
@@ -0,0 +1,23 @@
+
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-dse-details" } */
+#define SIZE 4
+
+void s4 (float *restrict a)
+{
+  (void) __builtin_memset ((void *) a, 0, sizeof(float)*SIZE);
+}
+
+
+int main ()
+{
+  int i;
+  float a[10];
+  printf("Start\n");
+  for (i = 0; i < SIZE; i++)
+s4 (a);
+  printf("Done\n");
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Deleted dead calls" 1 "dse1"} } */
diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
index 9220fea..3513fda 100644
--- a/gcc/tree-ssa-dse.c
+++ b/gcc/tree-ssa-dse.c
@@ -521,11 +521,11 @@ live_bytes_read (ao_ref use_ref, ao_ref *ref, sbitmap live)
Return TRUE if the above conditions are met, otherwise FALSE.  */
 
 static dse_store_status
-dse_classify_store (ao_ref *ref, gimple *stmt, gimple **use_stmt,
-		bool byte_tracking_enabled, sbitmap live_bytes)
+dse_classify_store (ao_ref *ref, gimple *stmt_outer, gimple *stmt,
+		gimple **use_stmt, bool byte_tracking_enabled,
+		sbitmap live_bytes, unsigned cnt = 0)
 {
   gimple *temp;
-  unsigned cnt = 0;
 
   *use_stmt = NULL;
 
@@ -556,9 +556,11 @@ dse_classify_store (ao_ref *ref, gimple *stmt, gimple **use_stmt,
 	{
 	  cnt++;
 
+	  if (use_stmt == stmt_outer)
+	continue;
 	  /* If we ever reach our DSE candidate stmt again fail.  We
 	 cannot handle dead stores in loops.  */
-	  if (use_stmt == stmt)
+	  else if (use_stmt == stmt)
 	{
 	  fail = true;
 	  BREAK_FROM_IMM_USE_STMT (ui);
@@ -572,8 +574,6 @@ dse_classify_store (ao_ref *ref, gimple *stmt, gimple **use_stmt,
 	  if (temp
 		  /* Make sure we are not in a loop latch block.  */
 		  || gimple_bb (stmt) == gimple_bb (use_stmt)
-		  || dominated_by_p (CDI_DOMINATORS,
- gimple_bb (stmt), gimple_bb (use_stmt))
 		  /* We can look through PHIs to regions post-dominating
 		 the DSE candidate stmt.  */
 		  || !dominated_by_p (CDI_POST_DOMINATORS,
@@ -582,8 +582,41 @@ dse_classify_store (ao_ref *ref, gimple *stmt, gimple **use_stmt,
 		  fail = true;
 		  BREAK_FROM_IMM_USE_STMT (ui);
 		}
+	  else if (dominated_by_p (CDI_DOMINATORS,
+   gimple_bb (stmt), gimple_bb (use_stmt)))
+		{
+		  gphi *phi = as_a  (use_stmt);
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (PHI_RESULT (phi));
+		  enum dse_store_status status = DSE_STORE_LIVE;
+		  ao_ref use_ref;
+		  gimple *inner_use_stmt;
+
+		  /* If stmt dominates PHI stmt, fol

Re: [RFC][PR82479] missing popcount builtin detection

2018-03-08 Thread Kugan Vivekanandarajah

Hi Richard and Bin,

Thanks for your comments. I will revise the patch and post it as soon
as stage-1 opens.

Kugan

On 7 March 2018 at 22:25, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Wed, Mar 7, 2018 at 8:26 AM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>> On Tue, Mar 6, 2018 at 5:20 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Mon, Mar 5, 2018 at 3:24 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Thu, Feb 8, 2018 at 1:41 AM, Kugan Vivekanandarajah
>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>> Hi Richard,
>>>>>
>>>>> On 1 February 2018 at 23:21, Richard Biener <richard.guent...@gmail.com> 
>>>>> wrote:
>>>>>> On Thu, Feb 1, 2018 at 5:07 AM, Kugan Vivekanandarajah
>>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>>> Hi Richard,
>>>>>>>
>>>>>>> On 31 January 2018 at 21:39, Richard Biener 
>>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>>> On Wed, Jan 31, 2018 at 11:28 AM, Kugan Vivekanandarajah
>>>>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>>>>> Hi Richard,
>>>>>>>>>
>>>>>>>>> Thanks for the review.
>>>>>>>>> On 25 January 2018 at 20:04, Richard Biener 
>>>>>>>>> <richard.guent...@gmail.com> wrote:
>>>>>>>>>> On Wed, Jan 24, 2018 at 10:56 PM, Kugan Vivekanandarajah
>>>>>>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> Here is a patch for popcount builtin detection similar to LLVM. I
>>>>>>>>>>> would like to queue this for review for next stage 1.
>>>>>>>>>>>
>>>>>>>>>>> 1. This is done part of loop-distribution and effective for -O3 and 
>>>>>>>>>>> above.
>>>>>>>>>>> 2. This does not distribute loop to detect popcount (like
>>>>>>>>>>> memcpy/memmove). I dont think that happens in practice. Please 
>>>>>>>>>>> correct
>>>>>>>>>>> me if I am wrong.
>>>>>>>>>>
>>>>>>>>>> But then it has no business inside loop distribution but instead is
>>>>>>>>>> doing final value
>>>>>>>>>> replacement, right?  You are pattern-matching the whole loop after 
>>>>>>>>>> all.  I think
>>>>>>>>>> final value replacement would already do the correct thing if you
>>>>>>>>>> teached number of
>>>>>>>>>> iteration analysis that niter for
>>>>>>>>>>
>>>>>>>>>>[local count: 955630224]:
>>>>>>>>>>   # b_11 = PHI <b_5(5), b_8(6)>
>>>>>>>>>>   _1 = b_11 + -1;
>>>>>>>>>>   b_8 = _1 & b_11;
>>>>>>>>>>   if (b_8 != 0)
>>>>>>>>>> goto ; [89.00%]
>>>>>>>>>>   else
>>>>>>>>>> goto ; [11.00%]
>>>>>>>>>>
>>>>>>>>>>[local count: 850510900]:
>>>>>>>>>>   goto ; [100.00%]
>>>>>>>>>
>>>>>>>>> I am looking into this approach. What should be the scalar evolution
>>>>>>>>> for b_8 (i.e. b & (b -1) in a loop) should be? This is not clear to me
>>>>>>>>> and can this be represented with the scev?
>>>>>>>>
>>>>>>>> No, it's not affine and thus cannot be represented.  You only need the
>>>>>>>> scalar evolution of the counting IV which is already handled and
>>>>>>>> the number of iteration analysis needs to handle the above IV - this
>>>>>>>> is the missing part.
>>>>>>> Thanks for the clarification. I am now matching this loop pattern in
>>>>>>> number_of_iterations_exit when number_of_iterations_exit_assumptions
>>>>>>> fails. If the pattern matc

Re: [AARCH64] Disable pc relative literal load irrespective of TARGET_FIX_ERR_A53_84341

2018-03-06 Thread Kugan Vivekanandarajah

Hi James,

This patch has to be backported to gcc-7 branch as the build error for
521.wrf  with LTO is there too (for the same reason). This patch has
been on trunk for some time now. So, is this  OK to backport this
patch gcc-7 branch?


Thanks,
Kugan

On 30 August 2017 at 15:19, Kugan Vivekanandarajah
<kugan.vivekanandara...@linaro.org> wrote:
> Hi James,
>
> On 29 August 2017 at 21:31, James Greenhalgh <james.greenha...@arm.com> wrote:
>> On Tue, Jun 27, 2017 at 11:20:02AM +1000, Kugan Vivekanandarajah wrote:
>>> https://gcc.gnu.org/ml/gcc-patches/2016-03/msg00614.html  added this
>>> workaround to get kernel building with when TARGET_FIX_ERR_A53_843419
>>> is enabled.
>>>
>>> This was added to support building kernel loadable modules. In kernel,
>>> when CONFIG_ARM64_ERRATUM_843419 is selected, the relocation needed
>>> for ADRP/LDR (R_AARCH64_ADR_PREL_PG_HI21 and
>>> R_AARCH64_ADR_PREL_PG_HI21_NC are removed from the kernel to avoid
>>> loading objects with possibly offending sequence). Thus, it could only
>>> support pc relative literal loads.
>>>
>>> However, the following patch was posted to kernel to add
>>> -mpc-relative-literal-loads
>>> http://www.spinics.net/lists/arm-kernel/msg476149.html
>>>
>>> -mpc-relative-literal-loads is unconditionally added to the kernel
>>> build as can be seen from:
>>> https://github.com/torvalds/linux/blob/master/arch/arm64/Makefile
>>>
>>> Therefore this patch removes the hunk so that applications like
>>> SPECcpu2017's 521/621.wrf can be built (with LTO in this case) without
>>> -mno-pc-relative-literal-loads
>>>
>>> Bootstrapped and regression tested on aarch64-linux-gnu with no new 
>>> regressions.
>>>
>>> Is this OK for trunk?
>>
>> Hi Kugan,
>>
>> I'm struggling a little to convince myself that this is correct. I think
>> the argument is that this was a workaround for one very particular issue
>> with the linux kernel module loader, which hard faults by refusing to
>> handle these relocations when in a workaround mode for Erratum 843419.
>
> Yes.
>
>> Most of these relocations won't occur because the kernel builds with
>> -mcmodel=large, but literals would always use a small model style
>> adrp/load, unless we were using -mpc-relative-literal-loads . So, this
>> workaround enabled -mpc-relative-literal-loads always when we were in
>> a workaround mode, thus allowing the kernel loader to continue.
>>
>> The argument for removing it then, is that with the kernel now always using
>> -mpc-relative-literal-loads there is no reason for this workaround to stay
>> in place. The linkers which we will use will apply the workaround if needed.
>
> Yes.
>
>> Testcases and a detailed problem report of the build failures you had seen in
>> 521.wrf would have helped me get closer to understanding this, and made
>> review substantially easier.
>
> Sorry for not being clear with this. Unfortunately 521.wrf  was
> showing this error with LTO so I could not reproduce with a small
> enough test case.
>
>> Am I on the right track?
>>
>> If so, this is OK for trunk. If not, please can you expand on what is going
>> on in this patch.
>
> Thanks for the review.
> Kugan
>
>>
>> Thanks,
>> James
>>
>>
>>>
>>> Thanks,
>>> Kugan
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>> 2017-06-27  Kugan Vivekanandarajah  <kug...@linaro.org>
>>>
>>> * gcc.target/aarch64/pr63304_1.c: Remove-mno-fix-cortex-a53-843419.
>>>
>>> gcc/ChangeLog:
>>>
>>> 2017-06-27  Kugan Vivekanandarajah  <kug...@linaro.org>
>>>
>>> * config/aarch64/aarch64.c (aarch64_override_options_after_change_1):
>>> Disable pc relative literal load irrespective of 
>>> TARGET_FIX_ERR_A53_84341
>>> for default.
>>
>>

Re: [RFC] Tree loop unroller pass

2018-02-19 Thread Kugan Vivekanandarajah

Hi Richard,

On 16 February 2018 at 22:56, Richard Biener <richard.guent...@gmail.com> wrote:
> On Thu, Feb 15, 2018 at 11:30 PM, Kugan Vivekanandarajah
> <kugan.vivekanandara...@linaro.org> wrote:
>> Hi Wilko,
>>
>> Thanks for your comments.
>>
>> On 14 February 2018 at 00:05, Wilco Dijkstra <wilco.dijks...@arm.com> wrote:
>>> Hi Kugan,
>>>
>>>> Based on the previous discussions, I tried to implement a tree loop
>>>> unroller for partial unrolling. I would like to queue this RFC patches
>>>> for next stage1 review.
>>>
>>> This is a great plan - GCC urgently requires a good unroller!
>
> How so?
>
>>>> * Cost-model for selecting the loop uses the same params used
>>>> elsewhere in related optimizations. I was told that keeping this same
>>>> would allow better tuning for all the optimizations.
>>>
>>> I'd advise against using the existing params as is. Unrolling by 8x by 
>>> default is
>>> way too aggressive and counterproductive. It was perhaps OK for in-order 
>>> cores
>>> 20 years ago, but not today. The goal of unrolling is to create more ILP in 
>>> small
>>> loops, not to generate huge blocks of repeated code which definitely won't 
>>> fit in
>>> micro-op caches and loop buffers...
>>>
>> OK, I will create separate params. It is possible that I misunderstood
>> it in the first place.
>
> To generate more ILP for modern out-of-order processors you need to be
> able to do followup transforms that remove dependences.  So rather than
> inventing magic params we should look at those transforms and key
> unrolling on them.  Like we do in predictive commoning or other passes
> that end up performing unrolling as part of their transform.
>
> Our measurements on x86 concluded that unrolling isn't worth it, in fact
> it very often hurts.  That was of course with saner params than the defaults
> of the RTL unroller.

My preliminary benchmarking with x86 using default params slows no
overall gain. Some gains and some regressions. I didn't play with the
parameters to see if it improves.

But for AArch64 - Falkor (with follow up tag collision avoidance for
prefetching), we did see gains (again we could do better here):

SPECint_base2006  1.37%
SPECint_base2006  -0.73%
SPECspeed2017_int_base -0.1%
SPECspeed2017_fp_base 0.89%
SPECrate2017_fp_base 1.72%

We also noticed that sometimes the gains for  passes like prefetch
loop array comes mainly from unrolling rather than the software
prefetches.

>
> Often you even have to fight with followup passes doing stuff that ends up
> inreasing register pressure too much so we end up spilling.
If we can have an approximate register pressure model that can be used
while deciding unrolling factor, it might help to some extend. I saw
Bin posting some patches for register pressure calculation. Do you
think using that here will be helpful?

In general, I agree that cost model can be more accurate but getting
the right information within acceptable computation cost is the trick.
Do you have any preference on cost model if we decided to have
separate loop unroller pass? I.e., what information from loop should
we use other than the usual parameters we have?

>
>>
>>> Also we need to enable this by default, at least with -O3, maybe even for 
>>> small
>>> (or rather tiny) loops in -O2 like LLVM does.
>> It is enabled for -O3 and above now.
>
> So _please_ first get testcases we know unrolling will be beneficial on
> and _also_ have a thorough description _why_.

I will try to analyse the benchmarks whose performance is improving
and create test cases.

>
>>>
>>>> * I have also implemented an option to limit loops based on memory
>>>> streams. i.e., some micro-architectures where limiting the resulting
>>>> memory streams is preferred and used  to limit unrolling factor.
>>>
>>> I'm not convinced this is needed once you tune the parameters for unrolling.
>>> If you have say 4 read streams you must have > 10 instructions already so
>>> you may want to unroll this 2x in -O3, but definitely not 8x. So I see the 
>>> streams
>>> issue as a problem caused by too aggressive unroll settings. I think if you
>>> address that first, you're unlikely going to have an issue with too many 
>>> streams.
>>>
>>
>> I will experiment with some microbenchmarks. I still think that it
>> will be useful for some micro-architectures. Thats why, it its not
>> enabled by default. If a back-end thinks that it is useful, they can
>> enable limiting unroll factor based on memory str

Re: [RFC] Tree loop unroller pass

2018-02-15 Thread Kugan Vivekanandarajah

Hi Wilko,

Thanks for your comments.

On 14 February 2018 at 00:05, Wilco Dijkstra <wilco.dijks...@arm.com> wrote:
> Hi Kugan,
>
>> Based on the previous discussions, I tried to implement a tree loop
>> unroller for partial unrolling. I would like to queue this RFC patches
>> for next stage1 review.
>
> This is a great plan - GCC urgently requires a good unroller!
>
>> * Cost-model for selecting the loop uses the same params used
>> elsewhere in related optimizations. I was told that keeping this same
>> would allow better tuning for all the optimizations.
>
> I'd advise against using the existing params as is. Unrolling by 8x by 
> default is
> way too aggressive and counterproductive. It was perhaps OK for in-order cores
> 20 years ago, but not today. The goal of unrolling is to create more ILP in 
> small
> loops, not to generate huge blocks of repeated code which definitely won't 
> fit in
> micro-op caches and loop buffers...
>
OK, I will create separate params. It is possible that I misunderstood
it in the first place.


> Also we need to enable this by default, at least with -O3, maybe even for 
> small
> (or rather tiny) loops in -O2 like LLVM does.
It is enabled for -O3 and above now.

>
>> * I have also implemented an option to limit loops based on memory
>> streams. i.e., some micro-architectures where limiting the resulting
>> memory streams is preferred and used  to limit unrolling factor.
>
> I'm not convinced this is needed once you tune the parameters for unrolling.
> If you have say 4 read streams you must have > 10 instructions already so
> you may want to unroll this 2x in -O3, but definitely not 8x. So I see the 
> streams
> issue as a problem caused by too aggressive unroll settings. I think if you
> address that first, you're unlikely going to have an issue with too many 
> streams.
>

I will experiment with some microbenchmarks. I still think that it
will be useful for some micro-architectures. Thats why, it its not
enabled by default. If a back-end thinks that it is useful, they can
enable limiting unroll factor based on memory streams.

>> * I expect that there will be some cost-model changes might be needed
>> to handle (or provide ability to handle) various loop preferences of
>> the micro-architectures. I am sending this patch for review early to
>> get feedbacks on this.
>
> Yes it should be feasible to have settings based on backend preference
> and optimization level (so O3/Ofast will unroll more than O2).
>
>> * Position of the pass in passes.def can also be changed. Example,
>> unrolling before SLP.
>
> As long as it runs before IVOpt so we get base+immediate addressing modes.
Thats what I am doing now.

Thanks,
Kugan

>
> Wilco

Re: [RFC][AARCH64] Machine reorg pass for aarch64/Falkor to handle prefetcher tag collision

2018-02-15 Thread Kugan Vivekanandarajah

Hi,

On 14 February 2018 at 09:47, Kugan Vivekanandarajah
<kugan.vivekanandara...@linaro.org> wrote:
> Hi Kyrill,
>
> On 13 February 2018 at 20:47, Kyrill  Tkachov
> <kyrylo.tkac...@foss.arm.com> wrote:
>> Hi Kugan,
>>
>> On 12/02/18 23:58, Kugan Vivekanandarajah wrote:
>>>
>>> Implements a machine reorg pass for aarch64/Falkor to handle
>>> prefetcher tag collision. This is strictly not part of the loop
>>> unroller but for Falkor, unrolling can make h/w prefetcher performing
>>> badly if there are too much tag collisions based on the discussions in
>>> https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.
>>>
>>
>> Could you expand a bit more on what transformation exactly this pass does?
>
> This is similar to what LLVM does in https://reviews.llvm.org/D35366.
>
> Falkor hardware prefetcher works well when signature of the prefetches
> (or tags as computed in the patch - similar to LLVM) are different for
> different memory streams. If different memory streams  have the same
> signature, it can result in bad performance. This machine reorg pass
> tries to change the signature of memory loads by changing the base
> register with a free register.
>
>> From my understanding the loads that use the same base
>> register and offset and have the same destination register
>> are considered part of the same stream by the hardware prefetcher, so for
>> example:
>> ldr x0, [x1, 16] (load1)
>> ... (set x1 to something else)
>> ldr x0, [x1, 16] (load2)
>>
>> will cause the prefetcher to think that both loads are part of the same
>> stream,
>> so this pass tries to rewrite the sequence into:
>> ldr x0, [x1, 16]
>> ... (set x1 to something else)
>> mov tmp, x1
>> ldr x0, [tmp, 16]
>>
>> Where the tag/signature is the combination of destination x0, base x1 and
>> offset 16.
>> Is this a fair description?
>
> This is precisely what is happening.
>
>>
>> I've got some comments on the patch itself
>>
>>> gcc/ChangeLog:
>>>
>>> 2018-02-12  Kugan Vivekanandarajah <kug...@linaro.org>
>>>
>>> * config/aarch64/aarch64.c (iv_p): New.
>>> (strided_load_p): Likwise.
>>> (make_tag): Likesie.
>>> (get_load_info): Likewise.
>>> (aarch64_reorg): Likewise.
>>> (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook.
>>
>>
>> New functions need function comments describing the arguments at least.
>> Functions like make_tag, get_load_info etc can get tricky to maintain
>> without
>> some documentation on what they are supposed to accept and return.
>
> I wil add the comments.
>
>>
>> I think the pass should be enabled at certain optimisation levels, say -O2?
>> I don't think it would be desirable at -Os since it creates extra moves that
>> increase code size.
>
> Ok, I will change this.
>
>>
>> That being said, I would recommend you implement this as an aarch64-specific
>> pass,
>> in a similar way to cortex-a57-fma-steering.c. That way you can register it
>> in
>> aarch64-passes.def and have flexibility as to when exactly the pass gets to
>> run
>> (i.e. you wouldn't be limited by when machine_reorg gets run).
>>
>> Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way
>> of
>> gating this pass. Do it in a similar way to the FMA steering pass that is,
>> define a new flag in aarch64-tuning-flags.def and use it in the tune_flags
>> field
>> of the falkor tuning struct.
>
> Ok, I will revise the patch.

Here is the revised patch.

Thanks,
Kugan

gcc/ChangeLog:

2018-02-15  Kugan Vivekanandarajah  <kug...@linaro.org>

* config.gcc: Add falkor-tag-collision-avoidance.o to extra_objs for
aarch64-*-*.
* config/aarch64/aarch64-protos.h
(make_pass_tag_collision_avoidance): Declare.
* config/aarch64/aarch64-passes.def: Insert tag collision avoidance pass.
* config/aarch64/aarch64-tuning-flags.def
(AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION): Define.
* config/aarch64/aarch64.c (qdf24xx_tunings): Add
AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION.
* config/aarch64/falkor-tag-collision-avoidance.c: New file.
* config/aarch64/t-aarch64: Add falkor-tag-collision-avoidance.o.


>
>
> Thanks,
> Kugan
>
>>
>> Hope this helps,
>> Kyrill
diff --git a/gcc/config.gcc b/gcc/config.gcc
index eca156a..c3f3e1a 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -303,7 +303,7 @@ aarch64*-*-*)
extra_headers="arm_fp16.h arm_neon.h arm_acle.h"
c

Re: [RFC][AARCH64] Machine reorg pass for aarch64/Falkor to handle prefetcher tag collision

2018-02-13 Thread Kugan Vivekanandarajah

Hi Kyrill,

On 13 February 2018 at 20:47, Kyrill  Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
> Hi Kugan,
>
> On 12/02/18 23:58, Kugan Vivekanandarajah wrote:
>>
>> Implements a machine reorg pass for aarch64/Falkor to handle
>> prefetcher tag collision. This is strictly not part of the loop
>> unroller but for Falkor, unrolling can make h/w prefetcher performing
>> badly if there are too much tag collisions based on the discussions in
>> https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.
>>
>
> Could you expand a bit more on what transformation exactly this pass does?

This is similar to what LLVM does in https://reviews.llvm.org/D35366.

Falkor hardware prefetcher works well when signature of the prefetches
(or tags as computed in the patch - similar to LLVM) are different for
different memory streams. If different memory streams  have the same
signature, it can result in bad performance. This machine reorg pass
tries to change the signature of memory loads by changing the base
register with a free register.

> From my understanding the loads that use the same base
> register and offset and have the same destination register
> are considered part of the same stream by the hardware prefetcher, so for
> example:
> ldr x0, [x1, 16] (load1)
> ... (set x1 to something else)
> ldr x0, [x1, 16] (load2)
>
> will cause the prefetcher to think that both loads are part of the same
> stream,
> so this pass tries to rewrite the sequence into:
> ldr x0, [x1, 16]
> ... (set x1 to something else)
> mov tmp, x1
> ldr x0, [tmp, 16]
>
> Where the tag/signature is the combination of destination x0, base x1 and
> offset 16.
> Is this a fair description?

This is precisely what is happening.

>
> I've got some comments on the patch itself
>
>> gcc/ChangeLog:
>>
>> 2018-02-12  Kugan Vivekanandarajah <kug...@linaro.org>
>>
>> * config/aarch64/aarch64.c (iv_p): New.
>> (strided_load_p): Likwise.
>> (make_tag): Likesie.
>> (get_load_info): Likewise.
>> (aarch64_reorg): Likewise.
>> (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook.
>
>
> New functions need function comments describing the arguments at least.
> Functions like make_tag, get_load_info etc can get tricky to maintain
> without
> some documentation on what they are supposed to accept and return.

I wil add the comments.

>
> I think the pass should be enabled at certain optimisation levels, say -O2?
> I don't think it would be desirable at -Os since it creates extra moves that
> increase code size.

Ok, I will change this.

>
> That being said, I would recommend you implement this as an aarch64-specific
> pass,
> in a similar way to cortex-a57-fma-steering.c. That way you can register it
> in
> aarch64-passes.def and have flexibility as to when exactly the pass gets to
> run
> (i.e. you wouldn't be limited by when machine_reorg gets run).
>
> Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way
> of
> gating this pass. Do it in a similar way to the FMA steering pass that is,
> define a new flag in aarch64-tuning-flags.def and use it in the tune_flags
> field
> of the falkor tuning struct.

Ok, I will revise the patch.


Thanks,
Kugan

>
> Hope this helps,
> Kyrill

Re: [RFC] Adds a target hook

2018-02-13 Thread Kugan Vivekanandarajah

Hi Kyrill,

Thanks for the review.

On 13 February 2018 at 20:58, Kyrill  Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
> Hi Kugan,
>
> On 12/02/18 23:53, Kugan Vivekanandarajah wrote:
>>
>> Adds a target hook TARGET_HW_MAX_MEM_READ_STREAMS. Loop unroller, if
>> defined, will try to limit the unrolling factor based on this.
>>
>
> Could you elaborate a bit on this, in particular how is this different
> from the PARAM_SIMULTANEOUS_PREFETCHES param that describes
> "the number of prefetches that can run at the same time".
> The descriptions seem very similar to me...

You are right that they are similar. I wanted to keep it separate
because not all the micro-architectures might prefer limiting unroll
factor this way. If we keep this separate, we will have the option to
disable this without affecting the rest.

> Incidentally, since this is expected to always be an integer, maybe
> make it into a param so it is consistent with the other prefetch-related
> tuning numbers?

Ok, I will change it into param in the next iteration.

Thanks,
Kugan
>
> Thanks,
> Kyrill
>
>
>>
>> gcc/ChangeLog:
>>
>> 2018-02-12  Kugan Vivekanandarajah <kug...@linaro.org>
>>
>> * doc/tm.texi.in (TARGET_HW_MAX_MEM_READ_STREAMS): Dcoument.
>> * doc/tm.texi: Regenerate.
>> * target.def (hw_max_mem_read_streams): New target hook.
>
>

[RFC][AARCH64] Machine reorg pass for aarch64/Falkor to handle prefetcher tag collision

2018-02-12 Thread Kugan Vivekanandarajah

Implements a machine reorg pass for aarch64/Falkor to handle
prefetcher tag collision. This is strictly not part of the loop
unroller but for Falkor, unrolling can make h/w prefetcher performing
badly if there are too much tag collisions based on the discussions in
https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.

gcc/ChangeLog:

2018-02-12  Kugan Vivekanandarajah  <kug...@linaro.org>

* config/aarch64/aarch64.c (iv_p): New.
(strided_load_p): Likwise.
(make_tag): Likesie.
(get_load_info): Likewise.
(aarch64_reorg): Likewise.
(TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook.
From 0cd4f5acb2117c739ba81bb4b8b71af499107812 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Mon, 12 Feb 2018 10:44:53 +1100
Subject: [PATCH 4/4] reorg-for-tag-collision

Change-Id: Ic6e42d54268c9112ec1c25de577ca92c1808eeff
---
 gcc/config/aarch64/aarch64.c | 353 +++
 1 file changed, 353 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 1ce2a0c..48e7c54 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -71,6 +71,7 @@
 #include "selftest.h"
 #include "selftest-rtl.h"
 #include "rtx-vector-builder.h"
+#include "cfgrtl.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -17203,6 +17204,355 @@ aarch64_select_early_remat_modes (sbitmap modes)
 }
 }
 
+static bool
+iv_p (rtx reg, struct loop *loop)
+{
+  df_ref adef;
+  unsigned regno = REGNO (reg);
+  bool def_in_loop = false;
+  bool def_out_loop = false;
+
+  if (GET_MODE_CLASS (GET_MODE (reg)) != MODE_INT)
+return false;
+
+  for (adef = DF_REG_DEF_CHAIN (regno); adef; adef = DF_REF_NEXT_REG (adef))
+{
+  if (!DF_REF_INSN_INFO (adef)
+	  || !NONDEBUG_INSN_P (DF_REF_INSN (adef)))
+	continue;
+
+  basic_block bb = DF_REF_BB (adef);
+  if (dominated_by_p (CDI_DOMINATORS, bb, loop->header)
+	  && bb->loop_father == loop)
+	{
+	  rtx_insn *insn = DF_REF_INSN (adef);
+	  recog_memoized (insn);
+	  rtx pat = PATTERN (insn);
+	  if (GET_CODE (pat) != SET)
+	continue;
+	  rtx x = SET_SRC (pat);
+	  if (GET_CODE (x) == ZERO_EXTRACT
+	  || GET_CODE (x) == ZERO_EXTEND
+	  || GET_CODE (x) == SIGN_EXTEND)
+	x = XEXP (x, 0);
+	  if (MEM_P (x))
+	continue;
+	  if (GET_CODE (x) == POST_INC
+	  || GET_CODE (x) == POST_DEC
+	  || GET_CODE (x) == PRE_INC
+	  || GET_CODE (x) == PRE_DEC)
+	def_in_loop = true;
+	  else if (BINARY_P (x))
+	def_in_loop = true;
+	}
+  if (dominated_by_p (CDI_DOMINATORS, loop->header, bb))
+	def_out_loop = true;
+  if (def_in_loop && def_out_loop)
+	return true;
+}
+  return false;
+}
+
+/* Return true if X is a strided load.  */
+
+static bool
+strided_load_p (rtx x,
+		struct loop *loop,
+		bool *pre_post,
+		rtx *base,
+		rtx *offset)
+{
+  /* Loadded value is extended, get src.  */
+  if (GET_CODE (x) == ZERO_EXTRACT
+  || GET_CODE (x) == ZERO_EXTEND
+  || GET_CODE (x) == SIGN_EXTEND)
+x = XEXP (x, 0);
+
+  /* If it is not MEM_P, it is not lodade from mem.  */
+  if (!MEM_P (x))
+return false;
+
+  /* Get the src of MEM_P.  */
+  x = XEXP (x, 0);
+
+  /* If it is a post/pre increment, get the src.  */
+  if (GET_CODE (x) == POST_INC
+  || GET_CODE (x) == POST_DEC
+  || GET_CODE (x) == PRE_INC
+  || GET_CODE (x) == PRE_DEC)
+{
+  x = XEXP (x, 0);
+  *pre_post = true;
+}
+
+  /* get base and offset depending on the type.  */
+  if (REG_P (x)
+  || UNARY_P (x))
+{
+  if (!REG_P (x))
+	x = XEXP (x, 0);
+  if (REG_P (x)
+	  && iv_p (x, loop))
+	{
+	  *base = x;
+	  return true;
+	}
+}
+  else if (BINARY_P (x))
+{
+  rtx reg1, reg2;
+  reg1 = XEXP (x, 0);
+
+  if (REG_P (reg1)
+	  && REGNO (reg1) == SP_REGNUM)
+	return false;
+  reg2 = XEXP (x, 1);
+
+  if (REG_P (reg1)
+	  && iv_p (reg1, loop))
+	{
+
+	  *base = reg1;
+	  *offset = reg2;
+	  return true;
+	}
+
+  if (REG_P (reg1)
+	  && REG_P (reg2)
+	  && iv_p (reg2, loop))
+	{
+	  *base = reg1;
+	  *offset = reg2;
+	  return true;
+	}
+}
+  return false;
+}
+
+static unsigned
+make_tag (unsigned dest, unsigned base, unsigned offset)
+{
+  return (dest & 0xf)
+| ((base & 0xf) << 4)
+| ((offset & 0x3f) << 8);
+}
+
+
+/* Return true if X INSN is a strided load.  */
+
+static bool
+get_load_info (rtx_insn *insn,
+	   struct loop *loop,
+	   bool *pre_post,
+	   rtx *base,
+	   rtx *dest,
+	   rtx *offset)
+{
+  subrtx_var_iterator::array_type array;
+  if (!INSN_P (insn) || recog_memoized (insn) < 0)
+return false;
+  rtx pat = PATTERN (insn);
+  switch (GET_CODE (pat))
+{
+case PARALLEL:
+	{
+	  for (int j = 0; j < XVECLEN (pat, 0); ++j)
+

[RFC][AARCH64] Implements target hook

2018-02-12 Thread Kugan Vivekanandarajah

Implements target hook TARGET_HW_MAX_MEM_READ_STREAMS for aarch64

gcc/ChangeLog:

2018-02-12  Kugan Vivekanandarajah  <kug...@linaro.org>

* config/aarch64/aarch64-protos.h (struct cpu_prefetch_tune): Add
  new entry hw_prefetchers_avail.
* config/aarch64/aarch64.c (aarch64_hw_max_mem_read_streams):
  Implement new target hook.
(TARGET_HW_MAX_MEM_READ_STREAMS): Likewise.
From 3529cf5b85d7282b1829d53652f03d0945359ad6 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Mon, 12 Feb 2018 10:44:26 +1100
Subject: [PATCH 3/4] add-prefetchers-availabl

Change-Id: I68af62d7be56255574a9c3f636b2d338f918b4e1
---
 gcc/config/aarch64/aarch64-protos.h |  1 +
 gcc/config/aarch64/aarch64.c| 26 --
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 2d705d2..2e3b2a1 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -231,6 +231,7 @@ struct cpu_prefetch_tune
   const int l1_cache_line_size;
   const int l2_cache_size;
   const int default_opt_level;
+  const int hw_prefetchers_avail;
 };
 
 struct tune_params
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 2e70f3a..1ce2a0c 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -547,7 +547,8 @@ static const cpu_prefetch_tune generic_prefetch_tune =
   -1,			/* l1_cache_size  */
   -1,			/* l1_cache_line_size  */
   -1,			/* l2_cache_size  */
-  -1			/* default_opt_level  */
+  -1,			/* default_opt_level  */
+  -1			/* default hw_prefetchers_avail */
 };
 
 static const cpu_prefetch_tune exynosm1_prefetch_tune =
@@ -556,7 +557,8 @@ static const cpu_prefetch_tune exynosm1_prefetch_tune =
   -1,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
   -1,			/* l2_cache_size  */
-  -1			/* default_opt_level  */
+  -1,			/* default_opt_level  */
+  -1			/* default hw_prefetchers_avail */
 };
 
 static const cpu_prefetch_tune qdf24xx_prefetch_tune =
@@ -565,7 +567,8 @@ static const cpu_prefetch_tune qdf24xx_prefetch_tune =
   32,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
   1024,			/* l2_cache_size  */
-  -1			/* default_opt_level  */
+  -1,			/* default_opt_level  */
+  7			/* hw_prefetchers_avail */
 };
 
 static const cpu_prefetch_tune thunderxt88_prefetch_tune =
@@ -574,7 +577,8 @@ static const cpu_prefetch_tune thunderxt88_prefetch_tune =
   32,			/* l1_cache_size  */
   128,			/* l1_cache_line_size  */
   16*1024,		/* l2_cache_size  */
-  3			/* default_opt_level  */
+  3,			/* default_opt_level  */
+  -1			/* default hw_prefetchers_avail */
 };
 
 static const cpu_prefetch_tune thunderx_prefetch_tune =
@@ -583,7 +587,8 @@ static const cpu_prefetch_tune thunderx_prefetch_tune =
   32,			/* l1_cache_size  */
   128,			/* l1_cache_line_size  */
   -1,			/* l2_cache_size  */
-  -1			/* default_opt_level  */
+  -1,			/* default_opt_level  */
+  -1			/* default hw_prefetchers_avail */
 };
 
 static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
@@ -592,7 +597,8 @@ static const cpu_prefetch_tune thunderx2t99_prefetch_tune =
   32,			/* l1_cache_size  */
   64,			/* l1_cache_line_size  */
   256,			/* l2_cache_size  */
-  -1			/* default_opt_level  */
+  -1,			/* default_opt_level  */
+  -1			/* default hw_prefetchers_avail */
 };
 
 static const struct tune_params generic_tunings =
@@ -17143,6 +17149,11 @@ aarch64_sched_can_speculate_insn (rtx_insn *insn)
 	return true;
 }
 }
+static int
+aarch64_hw_max_mem_read_streams ()
+{
+  return aarch64_tune_params.prefetch->hw_prefetchers_avail;
+}
 
 /* Implement TARGET_COMPUTE_PRESSURE_CLASSES.  */
 
@@ -17661,6 +17672,9 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_SELECT_EARLY_REMAT_MODES
 #define TARGET_SELECT_EARLY_REMAT_MODES aarch64_select_early_remat_modes
 
+#undef TARGET_HW_MAX_MEM_READ_STREAMS
+#define TARGET_HW_MAX_MEM_READ_STREAMS aarch64_hw_max_mem_read_streams
+
 #if CHECKING_P
 #undef TARGET_RUN_TARGET_SELFTESTS
 #define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests
-- 
2.7.4

[RFC] Tree Loop Unroller Pass

2018-02-12 Thread Kugan Vivekanandarajah

Implements tree loop unroller using the infrastructure provided.

gcc/ChangeLog:

2018-02-12  Kugan Vivekanandarajah  <kug...@linaro.org>

* Makefile.in (OBJS): Add tree-ssa-loop-unroll.o.
* common.opt (ftree-loop-unroll): New option.
* passes.def: Add pass_tree_loop_uroll
* timevar.def (TV_TREE_LOOP_UNROLL): Add.
* tree-pass.h (make_pass_tree_loop_unroll): Declare.
* tree-ssa-loop-unroll.c: New file.
From 71baaf8393dd79a98b4c0216e56d87083caf0177 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Mon, 12 Feb 2018 10:44:00 +1100
Subject: [PATCH 2/4] tree-loop-unroller

Change-Id: I58c25b5f2e796d4166af3ea4e50a0f4a3078b6c2
---
 gcc/Makefile.in|   1 +
 gcc/common.opt |   4 +
 gcc/passes.def |   1 +
 gcc/timevar.def|   1 +
 gcc/tree-pass.h|   1 +
 gcc/tree-ssa-loop-unroll.c | 268 +
 6 files changed, 276 insertions(+)
 create mode 100644 gcc/tree-ssa-loop-unroll.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 374bf3e..de3c146 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1536,6 +1536,7 @@ OBJS = \
 	tree-ssa-loop-im.o \
 	tree-ssa-loop-ivcanon.o \
 	tree-ssa-loop-ivopts.o \
+	tree-ssa-loop-unroll.o \
 	tree-ssa-loop-manip.o \
 	tree-ssa-loop-niter.o \
 	tree-ssa-loop-prefetch.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index b20a9aa..ea47b8c 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1770,6 +1770,10 @@ fivopts
 Common Report Var(flag_ivopts) Init(1) Optimization
 Optimize induction variables on trees.
 
+ftree-loop-unroll
+Common Report Var(flag_tree_loop_unroll) Init(1) Optimization
+Perform loop unrolling in gimple.
+
 fjump-tables
 Common Var(flag_jump_tables) Init(1) Optimization
 Use jump tables for sufficiently large switch statements.
diff --git a/gcc/passes.def b/gcc/passes.def
index 9802f08..57f7cc2 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -302,6 +302,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_predcom);
 	  NEXT_PASS (pass_complete_unroll);
 	  NEXT_PASS (pass_slp_vectorize);
+	  NEXT_PASS (pass_tree_loop_unroll);
 	  NEXT_PASS (pass_loop_prefetch);
 	  /* Run IVOPTs after the last pass that uses data-reference analysis
 	 as that doesn't handle TARGET_MEM_REFs.  */
diff --git a/gcc/timevar.def b/gcc/timevar.def
index 91221ae..a6bb847 100644
--- a/gcc/timevar.def
+++ b/gcc/timevar.def
@@ -202,6 +202,7 @@ DEFTIMEVAR (TV_TREE_LOOP_DISTRIBUTION, "tree loop distribution")
 DEFTIMEVAR (TV_CHECK_DATA_DEPS   , "tree check data dependences")
 DEFTIMEVAR (TV_TREE_PREFETCH	 , "tree prefetching")
 DEFTIMEVAR (TV_TREE_LOOP_IVOPTS	 , "tree iv optimization")
+DEFTIMEVAR (TV_TREE_LOOP_UNROLL , "tree loop unroll")
 DEFTIMEVAR (TV_PREDCOM		 , "predictive commoning")
 DEFTIMEVAR (TV_TREE_CH		 , "tree copy headers")
 DEFTIMEVAR (TV_TREE_SSA_UNCPROP	 , "tree SSA uncprop")
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 93a6a99..2c0740f 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -388,6 +388,7 @@ extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_tree_loop_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ch_vect (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop-unroll.c b/gcc/tree-ssa-loop-unroll.c
new file mode 100644
index 000..04cf092
--- /dev/null
+++ b/gcc/tree-ssa-loop-unroll.c
@@ -0,0 +1,268 @@
+
+/* Tree Loop Unroller.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it
+under the terms of the GNU General Public License as published by the
+Free Software Foundation; either version 3, or (at your option) any
+later version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "hash-set.h"
+#include "machmode.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "target.h"
+#include "gimp

[RFC] Adds a target hook

2018-02-12 Thread Kugan Vivekanandarajah

Adds a target hook TARGET_HW_MAX_MEM_READ_STREAMS. Loop unroller, if
defined, will try to limit the unrolling factor based on this.


gcc/ChangeLog:

2018-02-12  Kugan Vivekanandarajah  <kug...@linaro.org>

* doc/tm.texi.in (TARGET_HW_MAX_MEM_READ_STREAMS): Dcoument.
* doc/tm.texi: Regenerate.
* target.def (hw_max_mem_read_streams): New target hook.
From 95287a11980ff64ee473406d832d75f96204c6e9 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Mon, 12 Feb 2018 10:42:29 +1100
Subject: [PATCH 1/4] add-target-hook

Change-Id: I1789769c27786babc6a071d12049c72d7afed00e
---
 gcc/doc/tm.texi| 6 ++
 gcc/doc/tm.texi.in | 2 ++
 gcc/target.def | 9 +
 3 files changed, 17 insertions(+)

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 7f02b0d..08f4e2a 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11718,6 +11718,12 @@ is required only when the target has special constraints like maximum
 number of memory accesses.
 @end deftypefn
 
+@deftypefn {Target Hook} signed TARGET_HW_MAX_MEM_READ_STREAMS (void)
+This target hook returns the maximum number of memory read streams
+ that hw prefers.  Tree loop unroller will use this while deciding
+ unroll factor.
+@end deftypefn
+
 @defmac POWI_MAX_MULTS
 If defined, this macro is interpreted as a signed integer C expression
 that specifies the maximum number of floating point multiplications
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 90c24be..e222372 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7927,6 +7927,8 @@ build_type_attribute_variant (@var{mdecl},
 
 @hook TARGET_LOOP_UNROLL_ADJUST
 
+@hook TARGET_HW_MAX_MEM_READ_STREAMS
+
 @defmac POWI_MAX_MULTS
 If defined, this macro is interpreted as a signed integer C expression
 that specifies the maximum number of floating point multiplications
diff --git a/gcc/target.def b/gcc/target.def
index aeb41df..29295ae 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2751,6 +2751,15 @@ number of memory accesses.",
  unsigned, (unsigned nunroll, struct loop *loop),
  NULL)
 
+/* Return a new value for loop unroll size.  */
+DEFHOOK
+(hw_max_mem_read_streams,
+ "This target hook returns the maximum number of memory read streams\n\
+ that hw prefers.  Tree loop unroller will use this while deciding\n\
+ unroll factor.",
+ signed, (void),
+ NULL)
+
 /* True if X is a legitimate MODE-mode immediate operand.  */
 DEFHOOK
 (legitimate_constant_p,
-- 
2.7.4

[RFC] Tree loop unroller pass

2018-02-12 Thread Kugan Vivekanandarajah

Hi All,

Based on the previous discussions, I tried to implement a tree loop
unroller for partial unrolling. I would like to queue this RFC patches
for next stage1 review.

In summary:

* Cost-model for selecting the loop uses the same params used
elsewhere in related optimizations. I was told that keeping this same
would allow better tuning for all the optimizations.

* I have also implemented an option to limit loops based on memory
streams. i.e., some micro-architectures where limiting the resulting
memory streams is preferred and used  to limit unrolling factor.

* I have tested this on variants of aarch64 and the results are
promising. I am in the process of running benchmarks on x86. I will
update the results later.

* I expect that there will be some cost-model changes might be needed
to handle (or provide ability to handle) various loop preferences of
the micro-architectures. I am sending this patch for review early to
get feedbacks on this.

* Position of the pass in passes.def can also be changed. Example,
unrolling before SLP.

* I have bootstrapped and regression tested on aarch64-linux-gnu.
There are no execution errors or ICEs. There are some testsuite
differences as expected. Few of them needs further evaluation and I am
doing that now.

Patches are organized as:

Patch1: Adds a target hook TARGET_HW_MAX_MEM_READ_STREAMS. Loop
unroller, if defined, will try to limit the unrolling factor based on
this.

Patch2: Implements tree loop unroller using the infrastructure
provided. Pass itself is very simple.

Patch3: Implements target hook TARGET_HW_MAX_MEM_READ_STREAMS for aarch64.

Patch4: Implements a machine reorg pass for aarch64/Falkor to handle
prefetcher tag collision. This is strictly not part of the loop
unroller but for Falkor, unrolling can make h/w prefetcher performing
badly if there are too-much tag collisions based on the discussions in
https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.

Thanks,
Kugan

Re: [RFC][PR82479] missing popcount builtin detection

2018-02-07 Thread Kugan Vivekanandarajah

Hi Richard,

On 1 February 2018 at 23:21, Richard Biener <richard.guent...@gmail.com> wrote:
> On Thu, Feb 1, 2018 at 5:07 AM, Kugan Vivekanandarajah
> <kugan.vivekanandara...@linaro.org> wrote:
>> Hi Richard,
>>
>> On 31 January 2018 at 21:39, Richard Biener <richard.guent...@gmail.com> 
>> wrote:
>>> On Wed, Jan 31, 2018 at 11:28 AM, Kugan Vivekanandarajah
>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>> Hi Richard,
>>>>
>>>> Thanks for the review.
>>>> On 25 January 2018 at 20:04, Richard Biener <richard.guent...@gmail.com> 
>>>> wrote:
>>>>> On Wed, Jan 24, 2018 at 10:56 PM, Kugan Vivekanandarajah
>>>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> Here is a patch for popcount builtin detection similar to LLVM. I
>>>>>> would like to queue this for review for next stage 1.
>>>>>>
>>>>>> 1. This is done part of loop-distribution and effective for -O3 and 
>>>>>> above.
>>>>>> 2. This does not distribute loop to detect popcount (like
>>>>>> memcpy/memmove). I dont think that happens in practice. Please correct
>>>>>> me if I am wrong.
>>>>>
>>>>> But then it has no business inside loop distribution but instead is
>>>>> doing final value
>>>>> replacement, right?  You are pattern-matching the whole loop after all.  
>>>>> I think
>>>>> final value replacement would already do the correct thing if you
>>>>> teached number of
>>>>> iteration analysis that niter for
>>>>>
>>>>>[local count: 955630224]:
>>>>>   # b_11 = PHI <b_5(5), b_8(6)>
>>>>>   _1 = b_11 + -1;
>>>>>   b_8 = _1 & b_11;
>>>>>   if (b_8 != 0)
>>>>> goto ; [89.00%]
>>>>>   else
>>>>> goto ; [11.00%]
>>>>>
>>>>>[local count: 850510900]:
>>>>>   goto ; [100.00%]
>>>>
>>>> I am looking into this approach. What should be the scalar evolution
>>>> for b_8 (i.e. b & (b -1) in a loop) should be? This is not clear to me
>>>> and can this be represented with the scev?
>>>
>>> No, it's not affine and thus cannot be represented.  You only need the
>>> scalar evolution of the counting IV which is already handled and
>>> the number of iteration analysis needs to handle the above IV - this
>>> is the missing part.
>> Thanks for the clarification. I am now matching this loop pattern in
>> number_of_iterations_exit when number_of_iterations_exit_assumptions
>> fails. If the pattern matches, I am inserting the _builtin_popcount in
>> the loop preheater and setting the loop niter with this. This will be
>> used by the final value replacement. Is this what you wanted?
>
> No, you shouldn't insert a popcount stmt but instead the niter
> GENERIC tree should be a CALL_EXPR to popcount with the
> appropriate argument.

Thats what I tried earlier but ran into some ICEs. I wasn't sure if
niter in tree_niter_desc can take such.

Attached patch now does this. Also had to add support for CALL_EXPR in
few places to handle niter with CALL_EXPR. Does this look OK?

Thanks,
Kugan


gcc/ChangeLog:

2018-02-08  Kugan Vivekanandarajah  <kug...@linaro.org>

* gimple-expr.c (extract_ops_from_tree): Handle CALL_EXPR.
* tree-chrec.c (chrec_convert_1): Likewise.
* tree-scalar-evolution.c (expression_expensive_p): Likewise.
* tree-ssa-loop-ivopts.c (contains_abnormal_ssa_name_p): Likewise.
* tree-ssa-loop-niter.c (check_popcount_pattern): New.
(number_of_iterations_exit): Record niter for popcount patern.
* tree-ssa.c (verify_ssa): Check stmt to be non NULL.

gcc/testsuite/ChangeLog:

2018-02-08  Kugan Vivekanandarajah  <kug...@linaro.org>

* gcc.dg/tree-ssa/popcount.c: New test.


>
>> In general, there is also a condition gating the loop like
>>
>> if (b_4 != 0)
>>   goto loop;
>> else
>>   end:
>>
>> This of course will not be removed by final value replacement. Since
>> popcount (0) is defined, this is redundant and could be removed but
>> not removed.
>
> Yeah, that's probably sth for another pass though.  I suppose the
> end: case just uses zero in which case you'll have a PHI and you
> can optimize
>
>   if (b != 0)
> return popcount (b);
>   return 0;
>
> in phiopt.
>
> Richard.
>
>> Than

Re: [RFC][PR82479] missing popcount builtin detection

2018-01-31 Thread Kugan Vivekanandarajah

Hi Richard,

On 31 January 2018 at 21:39, Richard Biener <richard.guent...@gmail.com> wrote:
> On Wed, Jan 31, 2018 at 11:28 AM, Kugan Vivekanandarajah
> <kugan.vivekanandara...@linaro.org> wrote:
>> Hi Richard,
>>
>> Thanks for the review.
>> On 25 January 2018 at 20:04, Richard Biener <richard.guent...@gmail.com> 
>> wrote:
>>> On Wed, Jan 24, 2018 at 10:56 PM, Kugan Vivekanandarajah
>>> <kugan.vivekanandara...@linaro.org> wrote:
>>>> Hi All,
>>>>
>>>> Here is a patch for popcount builtin detection similar to LLVM. I
>>>> would like to queue this for review for next stage 1.
>>>>
>>>> 1. This is done part of loop-distribution and effective for -O3 and above.
>>>> 2. This does not distribute loop to detect popcount (like
>>>> memcpy/memmove). I dont think that happens in practice. Please correct
>>>> me if I am wrong.
>>>
>>> But then it has no business inside loop distribution but instead is
>>> doing final value
>>> replacement, right?  You are pattern-matching the whole loop after all.  I 
>>> think
>>> final value replacement would already do the correct thing if you
>>> teached number of
>>> iteration analysis that niter for
>>>
>>>[local count: 955630224]:
>>>   # b_11 = PHI <b_5(5), b_8(6)>
>>>   _1 = b_11 + -1;
>>>   b_8 = _1 & b_11;
>>>   if (b_8 != 0)
>>> goto ; [89.00%]
>>>   else
>>> goto ; [11.00%]
>>>
>>>[local count: 850510900]:
>>>   goto ; [100.00%]
>>
>> I am looking into this approach. What should be the scalar evolution
>> for b_8 (i.e. b & (b -1) in a loop) should be? This is not clear to me
>> and can this be represented with the scev?
>
> No, it's not affine and thus cannot be represented.  You only need the
> scalar evolution of the counting IV which is already handled and
> the number of iteration analysis needs to handle the above IV - this
> is the missing part.
Thanks for the clarification. I am now matching this loop pattern in
number_of_iterations_exit when number_of_iterations_exit_assumptions
fails. If the pattern matches, I am inserting the _builtin_popcount in
the loop preheater and setting the loop niter with this. This will be
used by the final value replacement. Is this what you wanted?

In general, there is also a condition gating the loop like

if (b_4 != 0)
  goto loop;
else
  end:

This of course will not be removed by final value replacement. Since
popcount (0) is defined, this is redundant and could be removed but
not removed.

Thanks,
Kugan

>
> Richard.
>
>> Thanks,
>> Kugan
>>>
>>> is related to popcount (b_5).
>>>
>>> Richard.
>>>
>>>> Bootstrapped and regression tested on aarch64-linux-gnu with no new 
>>>> regressions.
>>>>
>>>> Thanks,
>>>> Kugan
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>> 2018-01-25  Kugan Vivekanandarajah  <kug...@linaro.org>
>>>>
>>>> PR middle-end/82479
>>>> * tree-loop-distribution.c (handle_popcount): New.
>>>> (pass_loop_distribution::execute): Use handle_popcount.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> 2018-01-25  Kugan Vivekanandarajah  <kug...@linaro.org>
>>>>
>>>> PR middle-end/82479
>>>> * gcc.dg/tree-ssa/popcount.c: New test.

Re: [RFC][PR82479] missing popcount builtin detection

2018-01-31 Thread Kugan Vivekanandarajah

Hi Richard,

Thanks for the review.
On 25 January 2018 at 20:04, Richard Biener <richard.guent...@gmail.com> wrote:
> On Wed, Jan 24, 2018 at 10:56 PM, Kugan Vivekanandarajah
> <kugan.vivekanandara...@linaro.org> wrote:
>> Hi All,
>>
>> Here is a patch for popcount builtin detection similar to LLVM. I
>> would like to queue this for review for next stage 1.
>>
>> 1. This is done part of loop-distribution and effective for -O3 and above.
>> 2. This does not distribute loop to detect popcount (like
>> memcpy/memmove). I dont think that happens in practice. Please correct
>> me if I am wrong.
>
> But then it has no business inside loop distribution but instead is
> doing final value
> replacement, right?  You are pattern-matching the whole loop after all.  I 
> think
> final value replacement would already do the correct thing if you
> teached number of
> iteration analysis that niter for
>
>[local count: 955630224]:
>   # b_11 = PHI <b_5(5), b_8(6)>
>   _1 = b_11 + -1;
>   b_8 = _1 & b_11;
>   if (b_8 != 0)
> goto ; [89.00%]
>   else
> goto ; [11.00%]
>
>[local count: 850510900]:
>   goto ; [100.00%]

I am looking into this approach. What should be the scalar evolution
for b_8 (i.e. b & (b -1) in a loop) should be? This is not clear to me
and can this be represented with the scev?

Thanks,
Kugan
>
> is related to popcount (b_5).
>
> Richard.
>
>> Bootstrapped and regression tested on aarch64-linux-gnu with no new 
>> regressions.
>>
>> Thanks,
>> Kugan
>>
>> gcc/ChangeLog:
>>
>> 2018-01-25  Kugan Vivekanandarajah  <kug...@linaro.org>
>>
>> PR middle-end/82479
>> * tree-loop-distribution.c (handle_popcount): New.
>> (pass_loop_distribution::execute): Use handle_popcount.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2018-01-25  Kugan Vivekanandarajah  <kug...@linaro.org>
>>
>> PR middle-end/82479
>> * gcc.dg/tree-ssa/popcount.c: New test.

[RFC][PR82479] missing popcount builtin detection

2018-01-24 Thread Kugan Vivekanandarajah

Hi All,

Here is a patch for popcount builtin detection similar to LLVM. I
would like to queue this for review for next stage 1.

1. This is done part of loop-distribution and effective for -O3 and above.
2. This does not distribute loop to detect popcount (like
memcpy/memmove). I dont think that happens in practice. Please correct
me if I am wrong.

Bootstrapped and regression tested on aarch64-linux-gnu with no new regressions.

Thanks,
Kugan

gcc/ChangeLog:

2018-01-25  Kugan Vivekanandarajah  <kug...@linaro.org>

PR middle-end/82479
* tree-loop-distribution.c (handle_popcount): New.
(pass_loop_distribution::execute): Use handle_popcount.

gcc/testsuite/ChangeLog:

2018-01-25  Kugan Vivekanandarajah  <kug...@linaro.org>

PR middle-end/82479
* gcc.dg/tree-ssa/popcount.c: New test.
From 9fa09af4b7013c6207e59a4920c82f089bfe45c2 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org>
Date: Wed, 24 Jan 2018 08:50:08 +1100
Subject: [PATCH] pocount builtin detection

Change-Id: Ic6e175f9cc9a69bd417936a4845c2c046fd446b4

Change-Id: I680eb107445660c60a5d38f5d7300ab1a3243bf5

Change-Id: Ia9f0df89e05520091dc7797195098118768c7ac2
---
 gcc/testsuite/gcc.dg/tree-ssa/popcount.c |  41 +
 gcc/tree-loop-distribution.c | 145 +++
 2 files changed, 186 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/popcount.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/popcount.c b/gcc/testsuite/gcc.dg/tree-ssa/popcount.c
new file mode 100644
index 000..86a66cb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/popcount.c
@@ -0,0 +1,41 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-optimized" } */
+
+extern int foo (int);
+
+int PopCount (long b) {
+int c = 0;
+b++;
+
+while (b) {
+	b &= b - 1;
+	c++;
+}
+return c;
+}
+int PopCount2 (long b) {
+int c = 0;
+
+while (b) {
+	b &= b - 1;
+	c++;
+}
+foo (c);
+return foo (c);
+}
+
+void PopCount3 (long b1) {
+
+for (long i = 0; i < b1; ++i)
+  {
+	long b = i;
+	int c = 0;
+	while (b) {
+	b &= b - 1;
+	c++;
+	}
+	foo (c);
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "__builtin_popcount" 3 "optimized" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index a3d76e4..1060700 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1585,6 +1585,148 @@ classify_builtin_ldst (loop_p loop, struct graph *rdg, partition *partition,
   return;
 }
 
+/* See if loop is a popcout implementation of the form
+
+int c = 0;
+while (b) {
+	b = b & (b - 1);
+	c++;
+}
+
+If so, convert this into c = __builtin_popcount (b)
+return true if we did, false otherwise.  */
+
+
+static bool
+handle_popcount (loop_p loop)
+{
+  tree lhs, rhs;
+  tree dest, src;
+  gimple *and_minus_one;
+  int count = 0;
+  gphi *count_phi;
+  gimple *fn_call;
+  gimple *use_stmt;
+  use_operand_p use_p;
+  imm_use_iterator iter;
+  gimple_stmt_iterator gsi;
+
+  /* Check loop terminating branch is like
+ if (b != 0).  */
+  gimple *stmt = last_stmt (loop->header);
+  if (!stmt
+  || gimple_code (stmt) != GIMPLE_COND
+  || !zerop (gimple_cond_rhs (stmt)))
+return false;
+
+  /* Cheeck "b = b & (b - 1)" is calculated.  */
+  lhs = gimple_cond_lhs (stmt);
+  gimple *and_stmt = SSA_NAME_DEF_STMT (lhs);
+  if (gimple_assign_rhs_code (and_stmt) != BIT_AND_EXPR)
+return false;
+  lhs = gimple_assign_rhs1 (and_stmt);
+  rhs = gimple_assign_rhs2 (and_stmt);
+  if (TREE_CODE (lhs) == SSA_NAME
+  && (and_minus_one = SSA_NAME_DEF_STMT (lhs))
+  && is_gimple_assign (and_minus_one)
+  && (gimple_assign_rhs_code (and_minus_one) == PLUS_EXPR)
+  && integer_minus_onep (gimple_assign_rhs2 (and_minus_one)))
+  lhs = rhs;
+  else if (TREE_CODE (rhs) == SSA_NAME
+  && (and_minus_one = SSA_NAME_DEF_STMT (rhs))
+  && is_gimple_assign (and_minus_one)
+  && (gimple_assign_rhs_code (and_minus_one) == PLUS_EXPR)
+  && integer_minus_onep (gimple_assign_rhs2 (and_minus_one)))
+  ;
+  else
+return false;
+  if ((gimple_assign_rhs1 (and_stmt) != gimple_assign_rhs1 (and_minus_one))
+  && (gimple_assign_rhs2 (and_stmt) != gimple_assign_rhs1 (and_minus_one)))
+return false;
+
+  /* Check the recurrence.  */
+  gimple *phi = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (and_minus_one));
+  gimple *src_phi = SSA_NAME_DEF_STMT (lhs);
+  if (gimple_code (phi) != GIMPLE_PHI
+  || gimple_code (src_phi) != GIMPLE_PHI)
+return false;
+
+  /* Check the loop closed SSA definition for just the variable c defined in
+ loop.  */
+  src = gimple_phi_arg_def (src_phi, loop_preheader_edge (loop)->dest_idx);
+  basic_block bb = single_exit (loop)->dest;
+  for (gphi_iterator g

Re: [PATCH][arm] XFAIL advsimd-intrinsics/vld1x2.c

2018-01-15 Thread Kugan Vivekanandarajah

Hi Kyrill,

Sorry for the breakage and thanks for fixing the testcase.

Thanks,
Kugan

On 12 January 2018 at 02:33, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com>
wrote:

> Hi all,
>
> This recently added test fails on arm. We haven't implemented these
> intrinsics for arm
> (any volunteers?) so for now let's XFAIL these on that target.
> Also, the float64 versions of these intrinsics are not supposed to be
> available on arm
> so this patch slightly adjusts the test to not include them for aarch32.
> In any case the entire test is XFAILed on arm, so this doesn't have any
> noticeable
> effect.
>
> The same number of tests (PASS) still occur on aarch64 but now they appear
> as XFAIL
> rather than FAIL on arm.
>
> Ok for trunk? (from an aarch64 perspective).
>
> Thanks,
> Kyrill
>
> 2018-01-11  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>
> * gcc.target/aarch64/advsimd-intrinsics/vld1x2.c: Make float64
> tests specific to aarch64.  XFAIL test on arm.
>

1 2 3 4 5 6 >

1 - 100 of 527 matches

Mail list logo