On 05/23/2017 08:11 AM, Jakub Jelinek wrote:
On Tue, May 23, 2017 at 06:48:15AM -0400, Aldy Hernandez wrote:

[ughh, one more time, but CCing the list.]

Sorry, for the delayed response. I was fighting with Firefox + LTO to gather some data :).

I'm worried a lot here about compile time memory usage increase, especially
with EVRP and IPA-VRP and even more so with LTO.
The changes mean that for every SSA_NAME of any kind we now need 8 more
bytes, and for every SSA_NAME that has range info (typically both range info
and nonzero_bits; in the old implementation the two were tied together for a
good reason, many ranges also imply some non-zero bits and from non-zero
bits one can in many cases derive a range) much more than that (through
the nonzero_bits_def you get all the overhead of trailing_wide_ints -
the 3 fields it has, just save on the actual 2 lengths and 2 HWI sets,
but then you add 3x 8 bytes and 6x size of the whole wide_int which is
from what I understood not really meant as the memory storage of wide ints
in data structures, but as something not memory efficient you can work
quickly on (so ideal for automatic variables in functions that work with
those).  So it is a 8 byte growth for SSA_NAMEs without ranges and
8+3*8+6*32-2*4-2*8*x growth for SSA_NAMEs with ranges if x is the number
of HWIs needed to represent the integers of that type (so most commonly
x=1).  For x=1 8+3*8+6*32-2*4-2*8*x is 200 bytes growth.
With say 10000000 SSA_NAMEs, 5000000 of them with ranges, that will be
already a 1GB difference, dunno how many SSA_NAMEs are there e.g. in firefox
LTO build.
Can the nonzero_bits stuff be folded back into irange (and have code to
maintain nonzero_bits in addition to ranges as before (from setting a range
compute or update nonzero_bits and vice versa)?  Can you use
trailing_wide_int for the irange storage of the values?  Can you allocate
only as many HWIs as you really need, rather than always 6?
Again, it can be different between a class you use for accessing the
information and manipulating it and for actual underlying storage on
SSA_NAMEs.

Before I fire off some stats, a few things. Yes, we can do better. Yes, we can use trailing wide ints or another idiom. Yes, we could join nonzero bits with the range info (if absolutely necessary, because as I'll show below, the extra word doesn't blow up memory as much as advertised).

Thanks to Martin Liska and Markus I was able to build a three-month old firefox branch with GCC7 LTO. I built firefox with -O2 -flto=9, and then looked at the biggest partition at the end of VRP1 (that is, warn_array_bound_p == true).

The short version is that I don't see anywhere close to 10 million SSA_NAMEs. Of the SSA_NAMEs we do get, 25% are pointers and irrelevant. 22% of the non-pointer SSA_NAMEs actually have range information, and of this range info 32% is actually useless because it spans the entire domain. So, unless I'm misunderstanding something, even in the worst case scenario (in firefox + LTO anyways), I don't see such a big blow up of memory.

Now on to the actual numbers. Remember, this is only for one of the partitions during the ltrans stage of LTO, running VRP1, since this seems to be the granular level of a compilation process. If you're trying to compile 20 partitions at the same time, get more memory :).

The biggest number of SSA_NAMEs I saw was actually 472,225. Of these, 357,032 were non-pointers, so could conceivably have range information. In reality, 77,398 had range information, so 16% of all pointer and non-pointer SSA_NAMEs actually have range information.

Now here's the interesting bit... in analyzing those 77,398 ranges, I noticed that a great many of them were useless. They consisted of the range for the entire domain even with no NON_ZERO_BITS set. The new implementation will obviously not have this limitation :-). Taking this into account, I see a total of 52,718 non-useless ranges out of 472,225. That's about 11%.

In the compilation of this large partition, G.stats.total_allocated is 2,192,677,688, so about 2 gigs of memory. Of which the total non-useless range info is 1,686,976 for vanilla GCC7. This looks like less than 2 megs. I calculated the range info memory consumption with (sizeof (range_info_def) + trailing_wide_ints <3>::extra_size (precision)).

Assuming a pathological case of 200 extra bytes in each of the 52,718 non-useless ranges, we'd have 10 megabytes of extra data. And that's in a compilation that already consumes 2 gigs of memory. Isn't that like less than half of a percent? We could even account for that extra pointer in SSA_NAME and add less than 4 megs to that number (472,225 * 8 bytes).

Again, do not take this as an excuse for the careless memory layout of my patch (after all, it was meant as a proof of concept). I will re-arrange things and optimize things a bit. However, I also don't think that adding a word or two here is going to make a difference, since even at the pathological case, we're only increasing memory by less than a percent.

This begs the question, is firefox with LTO representative of a large enough sample? Am I wrong in analyzing SSA_NAME usage at the end of VRP1? Are there programs out there with 10 million SSA_NAMEs, for which a larger range info would be a non-trivial increase of the compilation process's memory?

I will post a less horrible version, but in the meantime I wanted to show some numbers.

Thanks to everyone for their input.
Aldy

p.s. I did measure the number of entries in the internal VR_VALUE table for the above case. It had 103,577 non-empty entries. Again, this doesn't look like a significant increase in memory, even for this internal VRP representation.

Reply via email to