On Thu, Jan 1, 2026 at 6:02 PM Lewis Hyatt <[email protected]> wrote: > > Hello- > > I did not quite complete this series in time for stage 1, but I thought it > might still be worth sending now, given there has been an uptick in PRs about > this topic lately. > > I started with a simple goal of streaming `#pragma GCC diagnostic' > information for LTO, so that diagnostic suppression could work in the LTO > front end. The main difficulty is that the linemap structure that LTO > creates while streaming in the data is not globally ordered; there is no > definite relation between the numerical ordering of two location_t values in > its linemap and the order in which source lines were originally > processed. There is an ordering local to each function, but this is not > enough to handle general diagnostic pragmas, since for a given source > location, it needs to be unambiguous which diagnostic pragmas were in force > at that point specifically. > > The solution I went with was to stream out all of the relevant linemap data > structures into a new LTO section, so that the linemap reconstructed on the > other side could reflect the global ordering. With that done, diagnostic > pragmas work automatically without needing any LTO-specific logic. > > Testing was done on the following platforms, where I did bootstrap + regtest > of the indicated languages, both for a normal bootstrap and for one with > --with-build-config=bootstrap-lto. > > x86_64-linux-gnu: all languages > > ppc64le-redhat-linux (cfarm135): all languages > > aarch64-redhat-linux (cfarm185): c,c++,fortran > > sparcv9-sun-solaris2.11 (cfarm216): c,c++,fortran,objc > > powerpc64-linux-gnu (cfarm121): all languages * > *I was not able to get bootstrap-lto to work here. compare-lto fails on > all object files; it seems that objcopy and strip both silently decline to > strip the LTO options section here? But I ran with bootstrap-lto-lean, > which skips the compare step, and the regtest at least was OK. > > x86_64-apple-darwin24 - c,c++ * > *For this platform, I had to use BOOT_CFLAGS+=-g0 in order to bootstrap > with LTO, or else dsymutil was trying to use 200 GB of RAM; not sure if > that's something about my system or a known issue. (It was not related to > these patches.) It also seemed that compare-lto does not work here either, > so I did bootstrap-lto-lean. > > I figured it would be of interest how the change to the streaming format > affects the object file sizes. I don't think it is very significant either > way. It is possible for the object files to be either smaller or larger than > before, depending on the nature of the locations being streamed out, but on > balance they tend to be a little smaller. The previous format streamed the > file name, line number, and column number for a location each time it was > output (avoiding duplication of the file name when possible), while the new > format streams an integer index into the table of linemaps and an integer > offset to compute the location_t (plus, separately, the linemap table > itself.) The new format is preferable in case the same location is streamed > multiple times. The old format prefers when locations appear mostly once and > when they appear in roughly chronological order, since it requires fewer > bits to store small deltas in the 4-bit uleb format being streamed. I tried > to recapture some of that benefit in the new format by streaming the > location indices as deltas from the prior one when possible. > > Here are some real-world examples that I tried: > > 0) GCC LTO bootstrap on x86-64 > - Size of stage 3 object files under gcc/ after bootstrap-lto build of all > languages. > - Total size of *.o before this patch: 2773040k > - Total size of *.o after this patch: 2745372k (-1.00%) > > 1) Python (C mostly) > - Built all of Python 3.13 objects with -fno-fat-lto-objects. > -(Python build doesn't support -fno-fat-lto-objects, but > it gets as far as creating all of the object files.) > - Total size of *.o before this patch: 93560k > - Total size of *.o after this patch: 91476k (-2.23%) > > 2) Boost (C++) > - Built boost 1.90.0 with b2 args: > - --build-type=complete --layout=versioned link=static lto=on > - Total size of *.o before this patch: 643372k > - Total size of *.o after this patch: 631716k (-1.81%) > > 3) Quantum Espresso (Fortran) > - Built commit 797f00f1d3f390f642411209b167af6668f3cb83 with -flto. > - Total size of *.o before this patch: 89972k > - Total size of *.o after this patch: 89236k (-0.82%) > > Regarding the temporary files written out by WPA for the LTRANS phase, all > of the linemap sections are written just once into their own file, so the > space usage is comparable to that needed for the regular object files. In > general, the space saved vs the old streaming format could be a little more > for these than for the object files, because the same location is often > streamed into multiple partitions. For example, here are the sizes of all > the LTRANS files produced when compiling cc1plus with -flto: > > -Total size of ltrans*.o before this patch: 736988k > -Total size of ltrans*.o after this patch: 714700k (-3.02%). > -The new size is comprised of 710616k for the 512 partitions, > plus 4084k for the linemaps object. > > Another performance-related concern would be the number of line maps and > location_t's used by the LTO front end. This was the subject of PR65536 > (described more in the commit message for [2/5]). To measure this, I looked > at the output of -fmem-report from the WPA stage of building cc1plus from > LTO-enabled object files. (This is the stage that reads in all of the files > at once, before optionally producing LTRANS partitions, so it's the stage > that uses the largest number of locations.) > > Here is the relevant portion of it, first for the case -flto-partition=none. > In > this mode, there is no WPA + LTRANS, rather all the functions are read in and > processed as one compilation. > > Old way: > -------- > Number of ordinary maps used: 1386k > Ordinary map used size: 43M > Number of ordinary maps allocated: 4096k > Ordinary maps allocated size: 128M > Total allocated maps size: 128M > Total used maps size: 43M > Ad-hoc table size: 1280M > Ad-hoc table entries used: 16M > optimized_ranges: 540k > unoptimized_ranges: 0 > max location: 2912904036992 > > (Note: for that last row "max location", I have temporarily modified > -fmem-report to add this because it is relevant to PR65536.) > > New way: > -------- > Number of ordinary maps used: 139k > Ordinary map used size: 4473k > Number of ordinary maps allocated: 256k > Ordinary maps allocated size: 8192k > Total allocated maps size: 8192k > Total used maps size: 4473k > Ad-hoc table size: 640M > Ad-hoc table entries used: 14M > optimized_ranges: 540k > unoptimized_ranges: 0 > max location: 9030944385 > > > So this much is satisfactory. There are about 10X fewer maps created with > the new approach. The max location_t is even more reduced, by a factor of > 300X or so. The reason this got so much smaller is because in the new > approach, I also put an optimization for the linemap that could actually > have been done at any time even without this change (and, if we don't go > with my patches, then it should probably just be done as a one-line > change). The linemap has a configurable number of bits reserved inside each > location_t to store range information for location ranges. With 64-bit > location_t, the default number of bits for ranges is set in toplev.cc to 7 > bits. The LTO front end does not use range information, so it could change > this to 0 with no change in behavior other than having 128X more location_t > space available. In the new approach to the linemap, I do use 0 range bits > for the maps. [As an aside, if it were desired to include range information > in the locations, this could be done as a future enhancement; it is > conceptually very straightforward, it just does meaningfully increase the > size of the streamed location data and it also increases the number of adhoc > locations.] > > I feel the above results are a good outcome for the new approach and are > sufficient also to close out PR65536. There is (at least) one downside that > should be mentioned, however. Here is the same data, but using the default LTO > partition rather than -flto-partition=none. In this mode, the WPA phase just > reads the decls section and not the function bodies; then it partitions > everything into a number of LTRANS files that are subsequently processed > separately. > > Old way: > -------- > Number of ordinary maps used: 96k > Ordinary map used size: 3100k > Number of ordinary maps allocated: 256k > Ordinary maps allocated size: 8192k > Total allocated maps size: 8192k > Total used maps size: 3100k > Ad-hoc table size: 5120k > Ad-hoc table entries used: 100k > optimized_ranges: 0 > unoptimized_ranges: 0 > max location: 261878903168 > > New way: > -------- > Number of ordinary maps used: 139k > Ordinary map used size: 4473k > Number of ordinary maps allocated: 256k > Ordinary maps allocated size: 8192k > Total allocated maps size: 8192k > Total used maps size: 4473k > Ad-hoc table size: 5120k > Ad-hoc table entries used: 80k > optimized_ranges: 0 > unoptimized_ranges: 0 > max location: 9030944385 > > So what you can see here is that with the old approach, there are fewer maps > used when doing just WPA, compared to when doing the full compilation; but > with > the new approach, the number of maps and locations used is the same in both > cases. This is because with the old approach, most locations (other than those > read from the decls section) are only created when the function bodies are > read, > and they are not read during WPA; with the new approach, it is necessary to > read > all the maps from the linemap section during WPA (although the function bodies > are still not read, as before). As a result, when doing just WPA, the new > approach creates around 50% more maps than the old way. I don't think this is > a > concern from the perspective of PR65536, because what really matters there is > the number of location_t used, and that is still significantly smaller than > before. It does mean there is potentially more memory used with the new > approach > during WPA. In this particular case, the memory usage was not actually > increased > at all, because the space for the maps is overallocated some. I don't expect > this to be a deal-breaker for the new approach; the memory used by linemaps > is a > small fraction of the total. > > [1/5] libcpp: Preparation for LTO linemap changes > [2/5] lto: Overhaul approach to location streaming [PR65536] > [3/5] diagnostics: Preparation for LTO diagnostic pragma support > [4/5] testsuite: Add dg-lto-additional-options directive > [5/5] lto: Support #pragma GCC diagnostic [PR80922] [PR106823] [PR107936] > > [2/5] implements the new streaming format and is the largest one; [1/5], > [3/5], and [4/5] are small preparatory patches, and [5/5], which actually > implements diagnostic pragma streaming, is a small extension of [2/5]. > > Thanks in advance for taking a look at it; I hope this looks like a useful > direction, whether for now or for stage 1.
Thanks for doing this - it seems like an overall improvement to me. I do think it's too late for GCC 16 and should wait for next stage 1 though. Richard. > > -Lewis
