On Thu, Jan 1, 2026 at 6:02 PM Lewis Hyatt <[email protected]> wrote:
>
> Hello-
>
> I did not quite complete this series in time for stage 1, but I thought it
> might still be worth sending now, given there has been an uptick in PRs about
> this topic lately.
>
> I started with a simple goal of streaming `#pragma GCC diagnostic'
> information for LTO, so that diagnostic suppression could work in the LTO
> front end. The main difficulty is that the linemap structure that LTO
> creates while streaming in the data is not globally ordered; there is no
> definite relation between the numerical ordering of two location_t values in
> its linemap and the order in which source lines were originally
> processed. There is an ordering local to each function, but this is not
> enough to handle general diagnostic pragmas, since for a given source
> location, it needs to be unambiguous which diagnostic pragmas were in force
> at that point specifically.
>
> The solution I went with was to stream out all of the relevant linemap data
> structures into a new LTO section, so that the linemap reconstructed on the
> other side could reflect the global ordering. With that done, diagnostic
> pragmas work automatically without needing any LTO-specific logic.
>
> Testing was done on the following platforms, where I did bootstrap + regtest
> of the indicated languages, both for a normal bootstrap and for one with
> --with-build-config=bootstrap-lto.
>
> x86_64-linux-gnu: all languages
>
> ppc64le-redhat-linux (cfarm135): all languages
>
> aarch64-redhat-linux (cfarm185): c,c++,fortran
>
> sparcv9-sun-solaris2.11 (cfarm216): c,c++,fortran,objc
>
> powerpc64-linux-gnu (cfarm121): all languages *
>     *I was not able to get bootstrap-lto to work here. compare-lto fails on
>     all object files; it seems that objcopy and strip both silently decline to
>     strip the LTO options section here? But I ran with bootstrap-lto-lean,
>     which skips the compare step, and the regtest at least was OK.
>
> x86_64-apple-darwin24 - c,c++ *
>    *For this platform, I had to use BOOT_CFLAGS+=-g0 in order to bootstrap
>    with LTO, or else dsymutil was trying to use 200 GB of RAM; not sure if
>    that's something about my system or a known issue. (It was not related to
>    these patches.) It also seemed that compare-lto does not work here either,
>    so I did bootstrap-lto-lean.
>
> I figured it would be of interest how the change to the streaming format
> affects the object file sizes. I don't think it is very significant either
> way. It is possible for the object files to be either smaller or larger than
> before, depending on the nature of the locations being streamed out, but on
> balance they tend to be a little smaller. The previous format streamed the
> file name, line number, and column number for a location each time it was
> output (avoiding duplication of the file name when possible), while the new
> format streams an integer index into the table of linemaps and an integer
> offset to compute the location_t (plus, separately, the linemap table
> itself.)  The new format is preferable in case the same location is streamed
> multiple times. The old format prefers when locations appear mostly once and
> when they appear in roughly chronological order, since it requires fewer
> bits to store small deltas in the 4-bit uleb format being streamed. I tried
> to recapture some of that benefit in the new format by streaming the
> location indices as deltas from the prior one when possible.
>
> Here are some real-world examples that I tried:
>
> 0) GCC LTO bootstrap on x86-64
>     - Size of stage 3 object files under gcc/ after bootstrap-lto build of all
>       languages.
>     - Total size of *.o before this patch: 2773040k
>     - Total size of *.o after this patch:  2745372k (-1.00%)
>
> 1) Python (C mostly)
>     - Built all of Python 3.13 objects with -fno-fat-lto-objects.
>         -(Python build doesn't support -fno-fat-lto-objects, but
>          it gets as far as creating all of the object files.)
>     - Total size of *.o before this patch: 93560k
>     - Total size of *.o after this patch:  91476k  (-2.23%)
>
> 2) Boost (C++)
>     - Built boost 1.90.0 with b2 args:
>         - --build-type=complete --layout=versioned link=static lto=on
>     - Total size of *.o before this patch: 643372k
>     - Total size of *.o after this patch:  631716k  (-1.81%)
>
> 3) Quantum Espresso (Fortran)
>     - Built commit 797f00f1d3f390f642411209b167af6668f3cb83 with -flto.
>     - Total size of *.o before this patch: 89972k
>     - Total size of *.o after this patch:  89236k (-0.82%)
>
> Regarding the temporary files written out by WPA for the LTRANS phase, all
> of the linemap sections are written just once into their own file, so the
> space usage is comparable to that needed for the regular object files. In
> general, the space saved vs the old streaming format could be a little more
> for these than for the object files, because the same location is often
> streamed into multiple partitions. For example, here are the sizes of all
> the LTRANS files produced when compiling cc1plus with -flto:
>
>     -Total size of ltrans*.o before this patch: 736988k
>     -Total size of ltrans*.o after this patch:  714700k (-3.02%).
>         -The new size is comprised of 710616k for the 512 partitions,
>          plus 4084k for the linemaps object.
>
> Another performance-related concern would be the number of line maps and
> location_t's used by the LTO front end. This was the subject of PR65536
> (described more in the commit message for [2/5]). To measure this, I looked
> at the output of -fmem-report from the WPA stage of building cc1plus from
> LTO-enabled object files. (This is the stage that reads in all of the files
> at once, before optionally producing LTRANS partitions, so it's the stage
> that uses the largest number of locations.)
>
> Here is the relevant portion of it, first for the case -flto-partition=none. 
> In
> this mode, there is no WPA + LTRANS, rather all the functions are read in and
> processed as one compilation.
>
> Old way:
> --------
> Number of ordinary maps used:         1386k
> Ordinary map used size:                 43M
> Number of ordinary maps allocated:    4096k
> Ordinary maps allocated size:          128M
> Total allocated maps size:             128M
> Total used maps size:                   43M
> Ad-hoc table size:                    1280M
> Ad-hoc table entries used:              16M
> optimized_ranges:                      540k
> unoptimized_ranges:                      0
> max location: 2912904036992
>
> (Note: for that last row "max location", I have temporarily modified
> -fmem-report to add this because it is relevant to PR65536.)
>
> New way:
> --------
> Number of ordinary maps used:          139k
> Ordinary map used size:               4473k
> Number of ordinary maps allocated:     256k
> Ordinary maps allocated size:         8192k
> Total allocated maps size:            8192k
> Total used maps size:                 4473k
> Ad-hoc table size:                     640M
> Ad-hoc table entries used:              14M
> optimized_ranges:                      540k
> unoptimized_ranges:                      0
> max location: 9030944385
>
>
> So this much is satisfactory. There are about 10X fewer maps created with
> the new approach. The max location_t is even more reduced, by a factor of
> 300X or so. The reason this got so much smaller is because in the new
> approach, I also put an optimization for the linemap that could actually
> have been done at any time even without this change (and, if we don't go
> with my patches, then it should probably just be done as a one-line
> change). The linemap has a configurable number of bits reserved inside each
> location_t to store range information for location ranges. With 64-bit
> location_t, the default number of bits for ranges is set in toplev.cc to 7
> bits. The LTO front end does not use range information, so it could change
> this to 0 with no change in behavior other than having 128X more location_t
> space available. In the new approach to the linemap, I do use 0 range bits
> for the maps. [As an aside, if it were desired to include range information
> in the locations, this could be done as a future enhancement; it is
> conceptually very straightforward, it just does meaningfully increase the
> size of the streamed location data and it also increases the number of adhoc
> locations.]
>
> I feel the above results are a good outcome for the new approach and are
> sufficient also to close out PR65536. There is (at least) one downside that
> should be mentioned, however. Here is the same data, but using the default LTO
> partition rather than -flto-partition=none. In this mode, the WPA phase just
> reads the decls section and not the function bodies; then it partitions
> everything into a number of LTRANS files that are subsequently processed
> separately.
>
> Old way:
> --------
> Number of ordinary maps used:           96k
> Ordinary map used size:               3100k
> Number of ordinary maps allocated:     256k
> Ordinary maps allocated size:         8192k
> Total allocated maps size:            8192k
> Total used maps size:                 3100k
> Ad-hoc table size:                    5120k
> Ad-hoc table entries used:             100k
> optimized_ranges:                        0
> unoptimized_ranges:                      0
> max location: 261878903168
>
> New way:
> --------
> Number of ordinary maps used:          139k
> Ordinary map used size:               4473k
> Number of ordinary maps allocated:     256k
> Ordinary maps allocated size:         8192k
> Total allocated maps size:            8192k
> Total used maps size:                 4473k
> Ad-hoc table size:                    5120k
> Ad-hoc table entries used:              80k
> optimized_ranges:                        0
> unoptimized_ranges:                      0
> max location: 9030944385
>
> So what you can see here is that with the old approach, there are fewer maps
> used when doing just WPA, compared to when doing the full compilation; but 
> with
> the new approach, the number of maps and locations used is the same in both
> cases. This is because with the old approach, most locations (other than those
> read from the decls section) are only created when the function bodies are 
> read,
> and they are not read during WPA; with the new approach, it is necessary to 
> read
> all the maps from the linemap section during WPA (although the function bodies
> are still not read, as before). As a result, when doing just WPA, the new
> approach creates around 50% more maps than the old way. I don't think this is 
> a
> concern from the perspective of PR65536, because what really matters there is
> the number of location_t used, and that is still significantly smaller than
> before. It does mean there is potentially more memory used with the new 
> approach
> during WPA. In this particular case, the memory usage was not actually 
> increased
> at all, because the space for the maps is overallocated some. I don't expect
> this to be a deal-breaker for the new approach; the memory used by linemaps 
> is a
> small fraction of the total.
>
> [1/5] libcpp: Preparation for LTO linemap changes
> [2/5] lto: Overhaul approach to location streaming [PR65536]
> [3/5] diagnostics: Preparation for LTO diagnostic pragma support
> [4/5] testsuite: Add dg-lto-additional-options directive
> [5/5] lto: Support #pragma GCC diagnostic [PR80922] [PR106823] [PR107936]
>
> [2/5] implements the new streaming format and is the largest one; [1/5],
> [3/5], and [4/5] are small preparatory patches, and [5/5], which actually
> implements diagnostic pragma streaming, is a small extension of [2/5].
>
> Thanks in advance for taking a look at it; I hope this looks like a useful
> direction, whether for now or for stage 1.

Thanks for doing this - it seems like an overall improvement to me.  I do think
it's too late for GCC 16 and should wait for next stage 1 though.

Richard.

>
> -Lewis

Reply via email to