Re: [PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache

2023-02-16 Thread Richard Biener via Gcc-patches
On Thu, Feb 16, 2023 at 3:34 PM Andrew MacLeod  wrote:
>
>
> On 2/16/23 02:55, Richard Biener wrote:
> > On Wed, Feb 15, 2023 at 6:07 PM Andrew MacLeod via Gcc-patches
> >  wrote:
> >> This patch implements the suggestion that we have an alternative
> >> ssa-cache which does not zero memory, and instead uses a bitmap to track
> >> whether a value is currently set or not.  It roughly mimics what
> >> path_range_query was doing internally.
> >>
> >> For sparsely used cases, expecially in large programs, this is more
> >> efficient.  I changed path_range_query to use this, and removed it old
> >> bitmap (and a hack or two around PHI calculations), and also utilized
> >> this is the assume_query class.
> >>
> >> Performance wise, the patch doesn't affect VRP (since that still uses
> >> the original version).  Switching to the lazy version caused a slowdown
> >> of 2.5% across VRP.
> >>
> >> There was a noticeable improvement elsewhere.,  across 230 GCC source
> >> files, threading ran over 12% faster!.  Overall compilation improved by
> >> 0.3%  Not sure it makes much difference in compiler.i, but it shouldn't
> >> hurt.
> >>
> >> bootstraps on x86_64-pc-linux-gnu with no regressions.   OK for trunk?
> >> or do you want to wait for the next release...
> > I see
> >
> > @@ -365,16 +335,8 @@ path_range_query::compute_ranges_in_phis (basic_block 
> > bb)
> >
> > Value_Range r (TREE_TYPE (name));
> > if (range_defined_in_block (r, name, bb))
> > -   {
> > - unsigned v = SSA_NAME_VERSION (name);
> > - set_cache (r, name);
> > - bitmap_set_bit (phi_set, v);
> > - // Pretend we don't have a cache entry for this name until
> > - // we're done with all PHIs.
> > - bitmap_clear_bit (m_has_cache_entry, v);
> > -   }
> > +   m_cache.set_global_range (name, r);
> >   }
> > -  bitmap_ior_into (m_has_cache_entry, phi_set);
> >   }
> >
> >   // Return TRUE if relations may be invalidated after crossing edge E.
> >
> > which I think is not correct - if we have
> >
> >   # _1 = PHI <..., _2>
> >   # _2 = PHI <..., _1>
> >
> > then their effects are supposed to be executed in parallel, that is,
> > both PHI argument _2 and _1 are supposed to see the "old" version.
> > The previous code tried to make sure the range of the new _1 doesn't
> > get seen when processing the argument _1 in the definition of _2.
> >
> > The new version drops this, possibly resulting in wrong-code.
>
> This is dropped because it is actually handled properly in
> range_defined_in_block now.  (which I think Aldy was describing).
>
> It didnt make sense to me why it was handled here like this, so I traced
> through the call chain to find out if it was still actually needed and
> discussed it with Aldy.  I think it was mostly a leftover wart.

Ah, thanks for checking.

> >
> > While I think it's appropriate to sort out compile-time issues like this
> > during stage4 at least the above makes me think it should be defered
> > to next stage1.
>
> I am happy to defer it since its a marginal increase anyway.

Sure - thus OK for stage1.

Thanks,
Richard.

>
> Andrew
>
>


Re: [PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache

2023-02-16 Thread Andrew MacLeod via Gcc-patches



On 2/16/23 02:55, Richard Biener wrote:

On Wed, Feb 15, 2023 at 6:07 PM Andrew MacLeod via Gcc-patches
 wrote:

This patch implements the suggestion that we have an alternative
ssa-cache which does not zero memory, and instead uses a bitmap to track
whether a value is currently set or not.  It roughly mimics what
path_range_query was doing internally.

For sparsely used cases, expecially in large programs, this is more
efficient.  I changed path_range_query to use this, and removed it old
bitmap (and a hack or two around PHI calculations), and also utilized
this is the assume_query class.

Performance wise, the patch doesn't affect VRP (since that still uses
the original version).  Switching to the lazy version caused a slowdown
of 2.5% across VRP.

There was a noticeable improvement elsewhere.,  across 230 GCC source
files, threading ran over 12% faster!.  Overall compilation improved by
0.3%  Not sure it makes much difference in compiler.i, but it shouldn't
hurt.

bootstraps on x86_64-pc-linux-gnu with no regressions.   OK for trunk?
or do you want to wait for the next release...

I see

@@ -365,16 +335,8 @@ path_range_query::compute_ranges_in_phis (basic_block bb)

Value_Range r (TREE_TYPE (name));
if (range_defined_in_block (r, name, bb))
-   {
- unsigned v = SSA_NAME_VERSION (name);
- set_cache (r, name);
- bitmap_set_bit (phi_set, v);
- // Pretend we don't have a cache entry for this name until
- // we're done with all PHIs.
- bitmap_clear_bit (m_has_cache_entry, v);
-   }
+   m_cache.set_global_range (name, r);
  }
-  bitmap_ior_into (m_has_cache_entry, phi_set);
  }

  // Return TRUE if relations may be invalidated after crossing edge E.

which I think is not correct - if we have

  # _1 = PHI <..., _2>
  # _2 = PHI <..., _1>

then their effects are supposed to be executed in parallel, that is,
both PHI argument _2 and _1 are supposed to see the "old" version.
The previous code tried to make sure the range of the new _1 doesn't
get seen when processing the argument _1 in the definition of _2.

The new version drops this, possibly resulting in wrong-code.


This is dropped because it is actually handled properly in 
range_defined_in_block now.  (which I think Aldy was describing).


It didnt make sense to me why it was handled here like this, so I traced 
through the call chain to find out if it was still actually needed and 
discussed it with Aldy.  I think it was mostly a leftover wart.




While I think it's appropriate to sort out compile-time issues like this
during stage4 at least the above makes me think it should be defered
to next stage1.


I am happy to defer it since its a marginal increase anyway.

Andrew




Re: [PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache

2023-02-16 Thread Aldy Hernandez via Gcc-patches




On 2/16/23 08:55, Richard Biener wrote:

On Wed, Feb 15, 2023 at 6:07 PM Andrew MacLeod via Gcc-patches
 wrote:


This patch implements the suggestion that we have an alternative
ssa-cache which does not zero memory, and instead uses a bitmap to track
whether a value is currently set or not.  It roughly mimics what
path_range_query was doing internally.

For sparsely used cases, expecially in large programs, this is more
efficient.  I changed path_range_query to use this, and removed it old
bitmap (and a hack or two around PHI calculations), and also utilized
this is the assume_query class.

Performance wise, the patch doesn't affect VRP (since that still uses
the original version).  Switching to the lazy version caused a slowdown
of 2.5% across VRP.

There was a noticeable improvement elsewhere.,  across 230 GCC source
files, threading ran over 12% faster!.  Overall compilation improved by
0.3%  Not sure it makes much difference in compiler.i, but it shouldn't
hurt.

bootstraps on x86_64-pc-linux-gnu with no regressions.   OK for trunk?
or do you want to wait for the next release...


I see

@@ -365,16 +335,8 @@ path_range_query::compute_ranges_in_phis (basic_block bb)

Value_Range r (TREE_TYPE (name));
if (range_defined_in_block (r, name, bb))
-   {
- unsigned v = SSA_NAME_VERSION (name);
- set_cache (r, name);
- bitmap_set_bit (phi_set, v);
- // Pretend we don't have a cache entry for this name until
- // we're done with all PHIs.
- bitmap_clear_bit (m_has_cache_entry, v);
-   }
+   m_cache.set_global_range (name, r);
  }
-  bitmap_ior_into (m_has_cache_entry, phi_set);
  }

  // Return TRUE if relations may be invalidated after crossing edge E.

which I think is not correct - if we have

  # _1 = PHI <..., _2>
  # _2 = PHI <..., _1>

then their effects are supposed to be executed in parallel, that is,
both PHI argument _2 and _1 are supposed to see the "old" version.
The previous code tried to make sure the range of the new _1 doesn't
get seen when processing the argument _1 in the definition of _2.


Yes, the effects should appear in parallel, but ssa_range_in_phi() which 
is the only thing range_defined_in_block does for PHIs, is guaranteed to 
not do any additional cache lookups.  The comment there should be 
adjusted to make this clear:


// Since PHIs are calculated in parallel at the beginning of the
// block, we must be careful to never save anything to the cache here.
// It is the caller's responsibility to adjust the cache.  Also,
// calculating the PHI's range must not trigger additional lookups.

We should instead say:

"we must be careful to never set or access the cache here"...

This was the original intent, but a subtle access to the cache crept in 
here:


  // Try to fold the phi exclusively with global or cached values.
  // This will get things like PHI <5(99), 6(88)>.  We do this by
  // calling range_of_expr with no context.
  unsigned nargs = gimple_phi_num_args (phi);
  Value_Range arg_range (TREE_TYPE (name));
  r.set_undefined ();
  for (size_t i = 0; i < nargs; ++i)
{
  tree arg = gimple_phi_arg_def (phi, i);
  if (range_of_expr (arg_range, arg, /*stmt=*/NULL))

This range_of_expr call will indeed access the cache incorrectly, but 
Andrew fixed that here:


@@ -264,7 +236,7 @@ path_range_query::ssa_range_in_phi (vrange , gphi 
*phi)

   for (size_t i = 0; i < nargs; ++i)
{
  tree arg = gimple_phi_arg_def (phi, i);
- if (range_of_expr (arg_range, arg, /*stmt=*/NULL))
+ if (m_ranger.range_of_expr (arg_range, arg, /*stmt=*/NULL))
r.union_ (arg_range);
  else
{

...thus ensuring that function never uses the cache.  All the lookups 
are done with the global ranger at either the path entry or globally as 
above (with stmt=NULL).


I believe the switch from range_of_expr to m_ranger.range_of_expr is 
safe, as the original code was added to handle silly things like PHI 
<5(99), 6(88)> which shouldn't need path aware ranges.


As you've found out, the update to the cache in this case was not 
obvious at all.  Perhaps it should also be commented:


"It is safe to set the cache here, as range_defined_in_block for PHIs 
(ssa_range_in_phi) is guaranteed not to do any cache lookups."




The new version drops this, possibly resulting in wrong-code.

While I think it's appropriate to sort out compile-time issues like this
during stage4 at least the above makes me think it should be defered
to next stage1.


I defer to the release managers as to whether this is safe in light of 
my explanation above :).


Aldy



Re: [PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache

2023-02-15 Thread Richard Biener via Gcc-patches
On Wed, Feb 15, 2023 at 6:07 PM Andrew MacLeod via Gcc-patches
 wrote:
>
> This patch implements the suggestion that we have an alternative
> ssa-cache which does not zero memory, and instead uses a bitmap to track
> whether a value is currently set or not.  It roughly mimics what
> path_range_query was doing internally.
>
> For sparsely used cases, expecially in large programs, this is more
> efficient.  I changed path_range_query to use this, and removed it old
> bitmap (and a hack or two around PHI calculations), and also utilized
> this is the assume_query class.
>
> Performance wise, the patch doesn't affect VRP (since that still uses
> the original version).  Switching to the lazy version caused a slowdown
> of 2.5% across VRP.
>
> There was a noticeable improvement elsewhere.,  across 230 GCC source
> files, threading ran over 12% faster!.  Overall compilation improved by
> 0.3%  Not sure it makes much difference in compiler.i, but it shouldn't
> hurt.
>
> bootstraps on x86_64-pc-linux-gnu with no regressions.   OK for trunk?
> or do you want to wait for the next release...

I see

@@ -365,16 +335,8 @@ path_range_query::compute_ranges_in_phis (basic_block bb)

   Value_Range r (TREE_TYPE (name));
   if (range_defined_in_block (r, name, bb))
-   {
- unsigned v = SSA_NAME_VERSION (name);
- set_cache (r, name);
- bitmap_set_bit (phi_set, v);
- // Pretend we don't have a cache entry for this name until
- // we're done with all PHIs.
- bitmap_clear_bit (m_has_cache_entry, v);
-   }
+   m_cache.set_global_range (name, r);
 }
-  bitmap_ior_into (m_has_cache_entry, phi_set);
 }

 // Return TRUE if relations may be invalidated after crossing edge E.

which I think is not correct - if we have

 # _1 = PHI <..., _2>
 # _2 = PHI <..., _1>

then their effects are supposed to be executed in parallel, that is,
both PHI argument _2 and _1 are supposed to see the "old" version.
The previous code tried to make sure the range of the new _1 doesn't
get seen when processing the argument _1 in the definition of _2.

The new version drops this, possibly resulting in wrong-code.

While I think it's appropriate to sort out compile-time issues like this
during stage4 at least the above makes me think it should be defered
to next stage1.

Richard.

>
> Andrew


[PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache

2023-02-15 Thread Andrew MacLeod via Gcc-patches
This patch implements the suggestion that we have an alternative 
ssa-cache which does not zero memory, and instead uses a bitmap to track 
whether a value is currently set or not.  It roughly mimics what 
path_range_query was doing internally.


For sparsely used cases, expecially in large programs, this is more 
efficient.  I changed path_range_query to use this, and removed it old 
bitmap (and a hack or two around PHI calculations), and also utilized 
this is the assume_query class.


Performance wise, the patch doesn't affect VRP (since that still uses 
the original version).  Switching to the lazy version caused a slowdown 
of 2.5% across VRP.


There was a noticeable improvement elsewhere.,  across 230 GCC source 
files, threading ran over 12% faster!.  Overall compilation improved by 
0.3%  Not sure it makes much difference in compiler.i, but it shouldn't 
hurt.


bootstraps on x86_64-pc-linux-gnu with no regressions.   OK for trunk?  
or do you want to wait for the next release...


Andrew
From a4736b402d95b184659846ba308ce51f708472d1 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Wed, 8 Feb 2023 17:43:45 -0500
Subject: [PATCH 1/2] Create a lazy ssa_cache

Sparsely used ssa name caches can benefit from using a bitmap to
determine if a name already has an entry.  Utilize it in the path query
and remove its private bitmap for tracking the same info.
Also use it in the "assume" query class.

	* gimple-range-cache.cc (ssa_global_cache::clear_global_range): Do
	not clear the vector on an out of range query.
	(ssa_lazy_cache::set_global_range): New.
	* gimple-range-cache.h (class ssa_lazy_cache): New.
	(ssa_lazy_cache::ssa_lazy_cache): New.
	(ssa_lazy_cache::~ssa_lazy_cache): New.
	(ssa_lazy_cache::get_global_range): New.
	(ssa_lazy_cache::clear_global_range): New.
	(ssa_lazy_cache::clear): New.
	(ssa_lazy_cache::dump): New.
	* gimple-range-path.cc (path_range_query::path_range_query): Do
	not allocate a ssa_global_cache object not has_cache bitmap.
	(path_range_query::~path_range_query): Do not free objects.
	(path_range_query::clear_cache): Remove.
	(path_range_query::get_cache): Adjust.
	(path_range_query::set_cache): Remove.
	(path_range_query::dump): Don't call through a pointer.
	(path_range_query::internal_range_of_expr): Set cache directly.
	(path_range_query::reset_path): Clear cache directly.
	(path_range_query::ssa_range_in_phi): Fold with globals only.
	(path_range_query::compute_ranges_in_phis): Simply set range.
	(path_range_query::compute_ranges_in_block): Call cache directly.
	* gimple-range-path.h (class path_range_query): Replace bitmap
	and cache pointer with lazy cache object.
	* gimple-range.h (class assume_query): Use ssa_lazy_cache.
---
 gcc/gimple-range-cache.cc | 24 --
 gcc/gimple-range-cache.h  | 33 +++-
 gcc/gimple-range-path.cc  | 66 +--
 gcc/gimple-range-path.h   |  7 +
 gcc/gimple-range.h|  2 +-
 5 files changed, 70 insertions(+), 62 deletions(-)

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index 546262c4794..9bfbdb2c9b3 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -525,14 +525,14 @@ ssa_global_cache::set_global_range (tree name, const vrange )
   return m != NULL;
 }
 
-// Set the range for NAME to R in the glonbal cache.
+// Set the range for NAME to R in the global cache.
 
 void
 ssa_global_cache::clear_global_range (tree name)
 {
   unsigned v = SSA_NAME_VERSION (name);
   if (v >= m_tab.length ())
-m_tab.safe_grow_cleared (num_ssa_names + 1);
+return;
   m_tab[v] = NULL;
 }
 
@@ -579,6 +579,26 @@ ssa_global_cache::dump (FILE *f)
 fputc ('\n', f);
 }
 
+
+// Set range of NAME to R in a lazy cache.  Return FALSE if it did not already
+// have a range.
+
+bool
+ssa_lazy_cache::set_global_range (tree name, const vrange )
+{
+  unsigned v = SSA_NAME_VERSION (name);
+  if (!bitmap_set_bit (active_p, v))
+{
+  // There is already an entry, simply set it.
+  gcc_checking_assert (v < m_tab.length ());
+  return ssa_global_cache::set_global_range (name, r);
+}
+  if (v >= m_tab.length ())
+m_tab.safe_grow (num_ssa_names + 1);
+  m_tab[v] = m_range_allocator->clone (r);
+  return false;
+}
+
 // --
 
 
diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h
index 4ff435dc5c1..f1799b45738 100644
--- a/gcc/gimple-range-cache.h
+++ b/gcc/gimple-range-cache.h
@@ -62,11 +62,42 @@ public:
   void clear_global_range (tree name);
   void clear ();
   void dump (FILE *f = stderr);
-private:
+protected:
   vec m_tab;
   vrange_allocator *m_range_allocator;
 };
 
+// This is the same as global cache, except it maintains an active bitmap
+// rather than depending on a zero'd out vector of pointers.  This is better
+// for sparsely/lightly used caches.
+// It could be made a fully derived class, but at this point there doesnt seem
+// to be a