Re: [PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache
On Thu, Feb 16, 2023 at 3:34 PM Andrew MacLeod wrote: > > > On 2/16/23 02:55, Richard Biener wrote: > > On Wed, Feb 15, 2023 at 6:07 PM Andrew MacLeod via Gcc-patches > > wrote: > >> This patch implements the suggestion that we have an alternative > >> ssa-cache which does not zero memory, and instead uses a bitmap to track > >> whether a value is currently set or not. It roughly mimics what > >> path_range_query was doing internally. > >> > >> For sparsely used cases, expecially in large programs, this is more > >> efficient. I changed path_range_query to use this, and removed it old > >> bitmap (and a hack or two around PHI calculations), and also utilized > >> this is the assume_query class. > >> > >> Performance wise, the patch doesn't affect VRP (since that still uses > >> the original version). Switching to the lazy version caused a slowdown > >> of 2.5% across VRP. > >> > >> There was a noticeable improvement elsewhere., across 230 GCC source > >> files, threading ran over 12% faster!. Overall compilation improved by > >> 0.3% Not sure it makes much difference in compiler.i, but it shouldn't > >> hurt. > >> > >> bootstraps on x86_64-pc-linux-gnu with no regressions. OK for trunk? > >> or do you want to wait for the next release... > > I see > > > > @@ -365,16 +335,8 @@ path_range_query::compute_ranges_in_phis (basic_block > > bb) > > > > Value_Range r (TREE_TYPE (name)); > > if (range_defined_in_block (r, name, bb)) > > - { > > - unsigned v = SSA_NAME_VERSION (name); > > - set_cache (r, name); > > - bitmap_set_bit (phi_set, v); > > - // Pretend we don't have a cache entry for this name until > > - // we're done with all PHIs. > > - bitmap_clear_bit (m_has_cache_entry, v); > > - } > > + m_cache.set_global_range (name, r); > > } > > - bitmap_ior_into (m_has_cache_entry, phi_set); > > } > > > > // Return TRUE if relations may be invalidated after crossing edge E. > > > > which I think is not correct - if we have > > > > # _1 = PHI <..., _2> > > # _2 = PHI <..., _1> > > > > then their effects are supposed to be executed in parallel, that is, > > both PHI argument _2 and _1 are supposed to see the "old" version. > > The previous code tried to make sure the range of the new _1 doesn't > > get seen when processing the argument _1 in the definition of _2. > > > > The new version drops this, possibly resulting in wrong-code. > > This is dropped because it is actually handled properly in > range_defined_in_block now. (which I think Aldy was describing). > > It didnt make sense to me why it was handled here like this, so I traced > through the call chain to find out if it was still actually needed and > discussed it with Aldy. I think it was mostly a leftover wart. Ah, thanks for checking. > > > > While I think it's appropriate to sort out compile-time issues like this > > during stage4 at least the above makes me think it should be defered > > to next stage1. > > I am happy to defer it since its a marginal increase anyway. Sure - thus OK for stage1. Thanks, Richard. > > Andrew > >
Re: [PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache
On 2/16/23 02:55, Richard Biener wrote: On Wed, Feb 15, 2023 at 6:07 PM Andrew MacLeod via Gcc-patches wrote: This patch implements the suggestion that we have an alternative ssa-cache which does not zero memory, and instead uses a bitmap to track whether a value is currently set or not. It roughly mimics what path_range_query was doing internally. For sparsely used cases, expecially in large programs, this is more efficient. I changed path_range_query to use this, and removed it old bitmap (and a hack or two around PHI calculations), and also utilized this is the assume_query class. Performance wise, the patch doesn't affect VRP (since that still uses the original version). Switching to the lazy version caused a slowdown of 2.5% across VRP. There was a noticeable improvement elsewhere., across 230 GCC source files, threading ran over 12% faster!. Overall compilation improved by 0.3% Not sure it makes much difference in compiler.i, but it shouldn't hurt. bootstraps on x86_64-pc-linux-gnu with no regressions. OK for trunk? or do you want to wait for the next release... I see @@ -365,16 +335,8 @@ path_range_query::compute_ranges_in_phis (basic_block bb) Value_Range r (TREE_TYPE (name)); if (range_defined_in_block (r, name, bb)) - { - unsigned v = SSA_NAME_VERSION (name); - set_cache (r, name); - bitmap_set_bit (phi_set, v); - // Pretend we don't have a cache entry for this name until - // we're done with all PHIs. - bitmap_clear_bit (m_has_cache_entry, v); - } + m_cache.set_global_range (name, r); } - bitmap_ior_into (m_has_cache_entry, phi_set); } // Return TRUE if relations may be invalidated after crossing edge E. which I think is not correct - if we have # _1 = PHI <..., _2> # _2 = PHI <..., _1> then their effects are supposed to be executed in parallel, that is, both PHI argument _2 and _1 are supposed to see the "old" version. The previous code tried to make sure the range of the new _1 doesn't get seen when processing the argument _1 in the definition of _2. The new version drops this, possibly resulting in wrong-code. This is dropped because it is actually handled properly in range_defined_in_block now. (which I think Aldy was describing). It didnt make sense to me why it was handled here like this, so I traced through the call chain to find out if it was still actually needed and discussed it with Aldy. I think it was mostly a leftover wart. While I think it's appropriate to sort out compile-time issues like this during stage4 at least the above makes me think it should be defered to next stage1. I am happy to defer it since its a marginal increase anyway. Andrew
Re: [PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache
On 2/16/23 08:55, Richard Biener wrote: On Wed, Feb 15, 2023 at 6:07 PM Andrew MacLeod via Gcc-patches wrote: This patch implements the suggestion that we have an alternative ssa-cache which does not zero memory, and instead uses a bitmap to track whether a value is currently set or not. It roughly mimics what path_range_query was doing internally. For sparsely used cases, expecially in large programs, this is more efficient. I changed path_range_query to use this, and removed it old bitmap (and a hack or two around PHI calculations), and also utilized this is the assume_query class. Performance wise, the patch doesn't affect VRP (since that still uses the original version). Switching to the lazy version caused a slowdown of 2.5% across VRP. There was a noticeable improvement elsewhere., across 230 GCC source files, threading ran over 12% faster!. Overall compilation improved by 0.3% Not sure it makes much difference in compiler.i, but it shouldn't hurt. bootstraps on x86_64-pc-linux-gnu with no regressions. OK for trunk? or do you want to wait for the next release... I see @@ -365,16 +335,8 @@ path_range_query::compute_ranges_in_phis (basic_block bb) Value_Range r (TREE_TYPE (name)); if (range_defined_in_block (r, name, bb)) - { - unsigned v = SSA_NAME_VERSION (name); - set_cache (r, name); - bitmap_set_bit (phi_set, v); - // Pretend we don't have a cache entry for this name until - // we're done with all PHIs. - bitmap_clear_bit (m_has_cache_entry, v); - } + m_cache.set_global_range (name, r); } - bitmap_ior_into (m_has_cache_entry, phi_set); } // Return TRUE if relations may be invalidated after crossing edge E. which I think is not correct - if we have # _1 = PHI <..., _2> # _2 = PHI <..., _1> then their effects are supposed to be executed in parallel, that is, both PHI argument _2 and _1 are supposed to see the "old" version. The previous code tried to make sure the range of the new _1 doesn't get seen when processing the argument _1 in the definition of _2. Yes, the effects should appear in parallel, but ssa_range_in_phi() which is the only thing range_defined_in_block does for PHIs, is guaranteed to not do any additional cache lookups. The comment there should be adjusted to make this clear: // Since PHIs are calculated in parallel at the beginning of the // block, we must be careful to never save anything to the cache here. // It is the caller's responsibility to adjust the cache. Also, // calculating the PHI's range must not trigger additional lookups. We should instead say: "we must be careful to never set or access the cache here"... This was the original intent, but a subtle access to the cache crept in here: // Try to fold the phi exclusively with global or cached values. // This will get things like PHI <5(99), 6(88)>. We do this by // calling range_of_expr with no context. unsigned nargs = gimple_phi_num_args (phi); Value_Range arg_range (TREE_TYPE (name)); r.set_undefined (); for (size_t i = 0; i < nargs; ++i) { tree arg = gimple_phi_arg_def (phi, i); if (range_of_expr (arg_range, arg, /*stmt=*/NULL)) This range_of_expr call will indeed access the cache incorrectly, but Andrew fixed that here: @@ -264,7 +236,7 @@ path_range_query::ssa_range_in_phi (vrange , gphi *phi) for (size_t i = 0; i < nargs; ++i) { tree arg = gimple_phi_arg_def (phi, i); - if (range_of_expr (arg_range, arg, /*stmt=*/NULL)) + if (m_ranger.range_of_expr (arg_range, arg, /*stmt=*/NULL)) r.union_ (arg_range); else { ...thus ensuring that function never uses the cache. All the lookups are done with the global ranger at either the path entry or globally as above (with stmt=NULL). I believe the switch from range_of_expr to m_ranger.range_of_expr is safe, as the original code was added to handle silly things like PHI <5(99), 6(88)> which shouldn't need path aware ranges. As you've found out, the update to the cache in this case was not obvious at all. Perhaps it should also be commented: "It is safe to set the cache here, as range_defined_in_block for PHIs (ssa_range_in_phi) is guaranteed not to do any cache lookups." The new version drops this, possibly resulting in wrong-code. While I think it's appropriate to sort out compile-time issues like this during stage4 at least the above makes me think it should be defered to next stage1. I defer to the release managers as to whether this is safe in light of my explanation above :). Aldy
Re: [PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache
On Wed, Feb 15, 2023 at 6:07 PM Andrew MacLeod via Gcc-patches wrote: > > This patch implements the suggestion that we have an alternative > ssa-cache which does not zero memory, and instead uses a bitmap to track > whether a value is currently set or not. It roughly mimics what > path_range_query was doing internally. > > For sparsely used cases, expecially in large programs, this is more > efficient. I changed path_range_query to use this, and removed it old > bitmap (and a hack or two around PHI calculations), and also utilized > this is the assume_query class. > > Performance wise, the patch doesn't affect VRP (since that still uses > the original version). Switching to the lazy version caused a slowdown > of 2.5% across VRP. > > There was a noticeable improvement elsewhere., across 230 GCC source > files, threading ran over 12% faster!. Overall compilation improved by > 0.3% Not sure it makes much difference in compiler.i, but it shouldn't > hurt. > > bootstraps on x86_64-pc-linux-gnu with no regressions. OK for trunk? > or do you want to wait for the next release... I see @@ -365,16 +335,8 @@ path_range_query::compute_ranges_in_phis (basic_block bb) Value_Range r (TREE_TYPE (name)); if (range_defined_in_block (r, name, bb)) - { - unsigned v = SSA_NAME_VERSION (name); - set_cache (r, name); - bitmap_set_bit (phi_set, v); - // Pretend we don't have a cache entry for this name until - // we're done with all PHIs. - bitmap_clear_bit (m_has_cache_entry, v); - } + m_cache.set_global_range (name, r); } - bitmap_ior_into (m_has_cache_entry, phi_set); } // Return TRUE if relations may be invalidated after crossing edge E. which I think is not correct - if we have # _1 = PHI <..., _2> # _2 = PHI <..., _1> then their effects are supposed to be executed in parallel, that is, both PHI argument _2 and _1 are supposed to see the "old" version. The previous code tried to make sure the range of the new _1 doesn't get seen when processing the argument _1 in the definition of _2. The new version drops this, possibly resulting in wrong-code. While I think it's appropriate to sort out compile-time issues like this during stage4 at least the above makes me think it should be defered to next stage1. Richard. > > Andrew
[PATCH] PR tree-optimization/108697 - Create a lazy ssa_cache
This patch implements the suggestion that we have an alternative ssa-cache which does not zero memory, and instead uses a bitmap to track whether a value is currently set or not. It roughly mimics what path_range_query was doing internally. For sparsely used cases, expecially in large programs, this is more efficient. I changed path_range_query to use this, and removed it old bitmap (and a hack or two around PHI calculations), and also utilized this is the assume_query class. Performance wise, the patch doesn't affect VRP (since that still uses the original version). Switching to the lazy version caused a slowdown of 2.5% across VRP. There was a noticeable improvement elsewhere., across 230 GCC source files, threading ran over 12% faster!. Overall compilation improved by 0.3% Not sure it makes much difference in compiler.i, but it shouldn't hurt. bootstraps on x86_64-pc-linux-gnu with no regressions. OK for trunk? or do you want to wait for the next release... Andrew From a4736b402d95b184659846ba308ce51f708472d1 Mon Sep 17 00:00:00 2001 From: Andrew MacLeod Date: Wed, 8 Feb 2023 17:43:45 -0500 Subject: [PATCH 1/2] Create a lazy ssa_cache Sparsely used ssa name caches can benefit from using a bitmap to determine if a name already has an entry. Utilize it in the path query and remove its private bitmap for tracking the same info. Also use it in the "assume" query class. * gimple-range-cache.cc (ssa_global_cache::clear_global_range): Do not clear the vector on an out of range query. (ssa_lazy_cache::set_global_range): New. * gimple-range-cache.h (class ssa_lazy_cache): New. (ssa_lazy_cache::ssa_lazy_cache): New. (ssa_lazy_cache::~ssa_lazy_cache): New. (ssa_lazy_cache::get_global_range): New. (ssa_lazy_cache::clear_global_range): New. (ssa_lazy_cache::clear): New. (ssa_lazy_cache::dump): New. * gimple-range-path.cc (path_range_query::path_range_query): Do not allocate a ssa_global_cache object not has_cache bitmap. (path_range_query::~path_range_query): Do not free objects. (path_range_query::clear_cache): Remove. (path_range_query::get_cache): Adjust. (path_range_query::set_cache): Remove. (path_range_query::dump): Don't call through a pointer. (path_range_query::internal_range_of_expr): Set cache directly. (path_range_query::reset_path): Clear cache directly. (path_range_query::ssa_range_in_phi): Fold with globals only. (path_range_query::compute_ranges_in_phis): Simply set range. (path_range_query::compute_ranges_in_block): Call cache directly. * gimple-range-path.h (class path_range_query): Replace bitmap and cache pointer with lazy cache object. * gimple-range.h (class assume_query): Use ssa_lazy_cache. --- gcc/gimple-range-cache.cc | 24 -- gcc/gimple-range-cache.h | 33 +++- gcc/gimple-range-path.cc | 66 +-- gcc/gimple-range-path.h | 7 + gcc/gimple-range.h| 2 +- 5 files changed, 70 insertions(+), 62 deletions(-) diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc index 546262c4794..9bfbdb2c9b3 100644 --- a/gcc/gimple-range-cache.cc +++ b/gcc/gimple-range-cache.cc @@ -525,14 +525,14 @@ ssa_global_cache::set_global_range (tree name, const vrange ) return m != NULL; } -// Set the range for NAME to R in the glonbal cache. +// Set the range for NAME to R in the global cache. void ssa_global_cache::clear_global_range (tree name) { unsigned v = SSA_NAME_VERSION (name); if (v >= m_tab.length ()) -m_tab.safe_grow_cleared (num_ssa_names + 1); +return; m_tab[v] = NULL; } @@ -579,6 +579,26 @@ ssa_global_cache::dump (FILE *f) fputc ('\n', f); } + +// Set range of NAME to R in a lazy cache. Return FALSE if it did not already +// have a range. + +bool +ssa_lazy_cache::set_global_range (tree name, const vrange ) +{ + unsigned v = SSA_NAME_VERSION (name); + if (!bitmap_set_bit (active_p, v)) +{ + // There is already an entry, simply set it. + gcc_checking_assert (v < m_tab.length ()); + return ssa_global_cache::set_global_range (name, r); +} + if (v >= m_tab.length ()) +m_tab.safe_grow (num_ssa_names + 1); + m_tab[v] = m_range_allocator->clone (r); + return false; +} + // -- diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h index 4ff435dc5c1..f1799b45738 100644 --- a/gcc/gimple-range-cache.h +++ b/gcc/gimple-range-cache.h @@ -62,11 +62,42 @@ public: void clear_global_range (tree name); void clear (); void dump (FILE *f = stderr); -private: +protected: vec m_tab; vrange_allocator *m_range_allocator; }; +// This is the same as global cache, except it maintains an active bitmap +// rather than depending on a zero'd out vector of pointers. This is better +// for sparsely/lightly used caches. +// It could be made a fully derived class, but at this point there doesnt seem +// to be a