On 5/15/23 13:08, Richard Biener wrote:
On Mon, May 15, 2023 at 12:35 PM Aldy Hernandez <al...@redhat.com> wrote:
<tldr>
We can now have int_range<N, RESIZABLE=false> for automatically
resizable ranges. int_range_max is now int_range<3, true>
for a 69X reduction in size from current trunk, and 6.9X reduction from
GCC12. This incurs a 5% performance penalty for VRP that is more than
covered by our > 13% improvements recently.
</tldr>
int_range_max is the temporary range object we use in the ranger for
integers. With the conversion to wide_int, this structure bloated up
significantly because wide_ints are huge (80 bytes a piece) and are
about 10 times as big as a plain tree. Since the temporary object
requires 255 sub-ranges, that's 255 * 80 * 2, plus the control word.
This means the structure grew from 4112 bytes to 40912 bytes.
This patch adds the ability to resize ranges as needed, defaulting to
no resizing, while int_range_max now defaults to 3 sub-ranges (instead
of 255) and grows to 255 when the range being calculated does not fit.
For example:
int_range<1> foo; // 1 sub-range with no resizing.
int_range<5> foo; // 5 sub-ranges with no resizing.
int_range<5, true> foo; // 5 sub-ranges with resizing.
I ran some tests and found that 3 sub-ranges cover 99% of cases, so
I've set the int_range_max default to that:
typedef int_range<3, /*RESIZABLE=*/true> int_range_max;
We don't bother growing incrementally, since the default covers most
cases and we have a 255 hard-limit. This hard limit could be reduced
to 128, since my tests never saw a range needing more than 124, but we
could do that as a follow-up if needed.
With 3-subranges, int_range_max is now 592 bytes versus 40912 for
trunk, and versus 4112 bytes for GCC12! The penalty is 5.04% for VRP
and 3.02% for threading, with no noticeable change in overall
compilation (0.27%). This is more than covered by our 13.26%
improvements for the legacy removal + wide_int conversion.
Thanks for doing this.
I think this approach is a good alternative, while providing us with
flexibility going forward. For example, we could try defaulting to a
8 sub-ranges for a noticeable improvement in VRP. We could also use
large sub-ranges for switch analysis to avoid resizing.
Another approach I tried was always resizing. With this, we could
drop the whole int_range<N> nonsense, and have irange just hold a
resizable range. This simplified things, but incurred a 7% penalty on
ipa_cp. This was hard to pinpoint, and I'm not entirely convinced
this wasn't some artifact of valgrind. However, until we're sure,
let's avoid massive changes, especially since IPA changes are coming
up.
For the curious, a particular hot spot for IPA in this area was:
ipcp_vr_lattice::meet_with_1 (const value_range *other_vr)
{
...
...
value_range save (m_vr);
m_vr.union_ (*other_vr);
return m_vr != save;
}
The problem isn't the resizing (since we do that at most once) but the
fact that for some functions with lots of callers we end up a huge
range that gets copied and compared for every meet operation. Maybe
the IPA algorithm could be adjusted somehow??.
Well, the above just wants to know whether the union_ operation changed
the range. I suppose that would be an interesting (and easy to compute?)
secondary output of union_ and it seems it already computes that (but
maybe not correctly?). So I suggest to change the above to
union_ returns a value specifically for that, which Andrew uses for
cache optimization. For that matter, your suggestion was my first
approach, but I quickly found out we were being overly pessimistic in
some cases, and I was too lazy to figure out why.
bool res;
if (flag_checking)
{
value_range save (m_vr);
res = m_vr.union_ (*other_vr);
gcc_assert (res == (m_vr != save));
}
else
res = m_vr.union (*other_vr);
return res;
With your suggested sanity check I chased the problem to a minor
inconsistency when unioning nonzero masks. The issue wasn't a bug, just
a pessimization. I'm attaching a patch that corrects the oversight
(well, not oversight, everything was more expensive with trees)... It
yields a 6.89% improvement to the ipa-cp pass!!! Thanks.
I'll push it if it passes tests.
BTW, without the annoying IPA-cp performance regression, this paves the
way for nuking int_range<N> in favor of just irange, and have everything
resize as needed. I'll wait for Andrew to chime in when he returns from
PTO, since we may want to leave int_range<N> around since it does
provide flexibility (at the expensive of fugly looking declarations).
Aldy
From 6a7354d3494665d46f8cbfc71c58f784c02142ff Mon Sep 17 00:00:00 2001
From: Aldy Hernandez <al...@redhat.com>
Date: Mon, 15 May 2023 15:10:11 +0200
Subject: [PATCH] Only return changed=true in union_nonzero when appropriate.
irange::union_ was being overly pessimistic in its return value. It
was returning false when the nonzero mask was possibly the same.
The reason for this is because the nonzero mask is not entirely kept
up to date. We avoid setting it up when a new range is set (from a
set, intersect, union, etc), because calculating a mask from a range
is measurably expensive. However, irange::get_nonzero_bits() will
always return the correct mask because it will calculate the nonzero
mask inherit in the mask on the fly and bitwise or it with the saved
mask. This was an optimization because last release it was a big
penalty to keep the mask up to date. This may not necessarily be the
case with the conversion to wide_int's. We should investigate.
Just to be clear, the result from get_nonzero_bits() is always correct
as seen by the user, but the wide_int in the irange does not contain
all the information, since part of the nonzero bits can be determined
by the range itself, on the fly.
The fix here is to make sure that the result the user sees (callers of
get_nonzero_bits()) changed when unioning bits. This allows
ipcp_vr_lattice::meet_with_1 to avoid unnecessary copies when
determining if a range changed.
This patch yields an 6.89% improvement to the ipa_cp pass. I'm
including the IPA changes in this patch, as it's a testcase of sorts for
the change.
gcc/ChangeLog:
* ipa-cp.cc (ipcp_vr_lattice::meet_with_1): Avoid unnecessary
range copying
* value-range.cc (irange::union_nonzero_bits): Return TRUE only
when range changed.
---
gcc/ipa-cp.cc | 13 ++++++++++---
gcc/value-range.cc | 5 +++--
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/gcc/ipa-cp.cc b/gcc/ipa-cp.cc
index 1f5e0e13872..8cd0fa2cae7 100644
--- a/gcc/ipa-cp.cc
+++ b/gcc/ipa-cp.cc
@@ -1040,9 +1040,16 @@ ipcp_vr_lattice::meet_with_1 (const value_range *other_vr)
if (other_vr->varying_p ())
return set_to_bottom ();
- value_range save (m_vr);
- m_vr.union_ (*other_vr);
- return m_vr != save;
+ bool res;
+ if (flag_checking)
+ {
+ value_range save (m_vr);
+ res = m_vr.union_ (*other_vr);
+ gcc_assert (res == (m_vr != save));
+ }
+ else
+ res = m_vr.union_ (*other_vr);
+ return res;
}
/* Return true if value range information in the lattice is yet unknown. */
diff --git a/gcc/value-range.cc b/gcc/value-range.cc
index def9299dc0e..a341cece86d 100644
--- a/gcc/value-range.cc
+++ b/gcc/value-range.cc
@@ -1859,12 +1859,13 @@ irange::union_nonzero_bits (const irange &r)
bool changed = false;
if (m_nonzero_mask != r.m_nonzero_mask)
{
- m_nonzero_mask = get_nonzero_bits () | r.get_nonzero_bits ();
+ wide_int save = get_nonzero_bits ();
+ m_nonzero_mask = save | r.get_nonzero_bits ();
// No need to call set_range_from_nonzero_bits, because we'll
// never narrow the range. Besides, it would cause endless
// recursion because of the union_ in
// set_range_from_nonzero_bits.
- changed = true;
+ changed = m_nonzero_mask != save;
}
normalize_kind ();
if (flag_checking)
--
2.40.0