On Wed, 2015-10-14 at 11:00 +0200, Richard Biener wrote: > On Tue, Oct 13, 2015 at 5:32 PM, David Malcolm <dmalc...@redhat.com> wrote: > > On Thu, 2015-09-24 at 10:15 +0200, Richard Biener wrote: > >> On Thu, Sep 24, 2015 at 2:25 AM, David Malcolm <dmalc...@redhat.com> wrote: > >> > On Wed, 2015-09-23 at 15:36 +0200, Richard Biener wrote: > >> >> On Wed, Sep 23, 2015 at 3:19 PM, Michael Matz <m...@suse.de> wrote: > >> >> > Hi, > >> >> > > >> >> > On Tue, 22 Sep 2015, David Malcolm wrote: > >> >> > > >> >> >> The drawback is that it could bloat the ad-hoc table. Can the ad-hoc > >> >> >> table ever get smaller, or does it only ever get inserted into? > >> >> > > >> >> > It only ever grows. > >> >> > > >> >> >> An idea I had is that we could stash short ranges directly into the > >> >> >> 32 > >> >> >> bits of location_t, by offsetting the per-column-bits somewhat. > >> >> > > >> >> > It's certainly worth an experiment: let's say you restrict yourself to > >> >> > tokens less than 8 characters, you need an additional 3 bits (using > >> >> > one > >> >> > value, e.g. zero, as the escape value). That leaves 20 bits for the > >> >> > line > >> >> > numbers (for the normal 8 bit columns), which might be enough for most > >> >> > single-file compilations. For LTO compilation this often won't be > >> >> > enough. > >> >> > > >> >> >> My plan is to investigate the impact these patches have on the time > >> >> >> and > >> >> >> memory consumption of the compiler, > >> >> > > >> >> > When you do so, make sure you're also measuring an LTO compilation > >> >> > with > >> >> > debug info of something big (firefox). I know that we already had > >> >> > issues > >> >> > with the size of the linemap data in the past for these cases > >> >> > (probably > >> >> > when we added columns). > >> >> > >> >> The issue we have with LTO is that the linemap gets populated in quite > >> >> random order and thus we repeatedly switch files (we've mitigated this > >> >> somewhat for GCC 5). We also considered dropping column info > >> >> (and would drop range info) as diagnostics are from optimizers only > >> >> with LTO and we keep locations merely for debug info. > >> > > >> > Thanks. Presumably the mitigation you're referring to is the > >> > lto_location_cache class in lto-streamer-in.c? > >> > > >> > Am I right in thinking that, right now, the LTO code doesn't support > >> > ad-hoc locations? (presumably the block pointers only need to exist > >> > during optimization, which happens after the serialization) > >> > >> LTO code does support ad-hoc locations but they are "restored" only > >> when reading function bodies and stmts (by means of COMBINE_LOCATION_DATA). > >> > >> > The obvious simplification would be, as you suggest, to not bother > >> > storing range information with LTO, falling back to just the existing > >> > representation. Then there's no need to extend LTO to serialize ad-hoc > >> > data; simply store the underlying locus into the bit stream. I think > >> > that this happens already: lto-streamer-out.c calls expand_location and > >> > stores the result, so presumably any ad-hoc location_t values made by > >> > the v2 patches would have dropped their range data there when I ran the > >> > test suite. > >> > >> Yep. We only preserve BLOCKs, so if you don't add extra code to > >> preserve ranges they'll be "dropped". > >> > >> > If it's acceptable to not bother with ranges for LTO, one way to do the > >> > "stashing short ranges into the location_t" idea might be for the > >> > bits-per-range of location_t values to be a property of the line_table > >> > (or possibly the line map), set up when the struct line_maps is created. > >> > For non-LTO it could be some tuned value (maybe from a param?); for LTO > >> > it could be zero, so that we have as many bits as before for line/column > >> > data. > >> > >> That could be a possibility (likewise for column info?) > >> > >> Richard. > >> > >> > Hope this sounds sane > >> > Dave > > > > I did some crude benchmarking of the patchkit, using these scripts: > > https://github.com/davidmalcolm/gcc-benchmarking > > (specifically, bb0222b455df8cefb53bfc1246eb0a8038256f30), > > using the "big-code.c" and "kdecore.cc" files Michael posted as: > > https://gcc.gnu.org/ml/gcc-patches/2013-09/msg00062.html > > and "influence.i", a preprocessed version of SPEC2006's 445.gobmk > > engine/influence.c (as an example of a moderate-sized pure C source > > file). > > > > This doesn't yet cover very large autogenerated C files, and the .cc > > file is only being measured to see the effect on the ad-hoc table (and > > tokenization). > > > > "control" was r227977. > > "experiment" was the same revision with the v2 patchkit applied. > > > > Recall that this patchkit captures ranges for tokens as an extra field > > within tokens within libcpp and the C FE, and adds ranges to the ad-hoc > > location lookaside, storing them for all tree nodes within the C FE that > > have a location_t, and passing them around within c_expr for all C > > expressions (including those that don't have a location_t). > > > > Both control and experiment were built with > > --enable-checking=release \ > > --disable-bootstrap \ > > --disable-multilib \ > > --enable-languages=c,ada,c++,fortran,go,java,lto,objc,obj-c++ > > > > The script measures: > > > > (a) wallclock time for "xgcc -S" so it's measuring the driver, parsing, > > optimimation, etc, rather than attempting to directly measure parsing. > > This is without -ftime-report, since Mikhail indicated it's sufficiently > > expensive to skew timings in this post: > > https://gcc.gnu.org/ml/gcc/2015-07/msg00165.html > > > > (b) memory usage: by performing a separate build with -ftime-report, > > extracting the "TOTAL" ggc value (actually 3 builds, but it's the same > > each time). > > > > Is this a fair way to measure things? It could be argued that by > > measuring totals I'm hiding the extra parsing cost in the overall cost. > > Overall cost is what matters. Time to build the libstdc++ PCHs > would be interesting as well ;) (and their size) > > One could have argued you should have used -fsyntax-only. > > > Full logs can be seen at: > > https://dmalcolm.fedorapeople.org/gcc/2015-09-25/bmark-v2.txt > > (v2 of the patchkit) > > > > I also investigated a version of the patchkit with the token tracking > > rewritten to build ad-hoc ranges for *every token*, without attempting > > any kind of optimization (e.g. for short ranges). > > A log of this can be seen at: > > https://dmalcolm.fedorapeople.org/gcc/2015-09-25/bmark-v2-plus-adhoc-ranges-for-tokens.txt > > (v2 of the patchkit, with token tracking rewritten to build ad-hoc > > ranges for *every token*). > > The nice thing about this approach is that lots of token-related > > diagnostics gain underlining of the relevant token "for free" simply > > from the location_t, without having to individually patch them. Without > > any optimization, the memory consumed by this approach is clearly > > larger. > > > > A summary comparing the two logs: > > > > Minimal wallclock time (s) over 10 iterations > > Control -> v2 > > Control -> v2+adhocloc+at+every+token > > kdecore.cc -g -O0 10.306548 -> 10.268712: 1.00x faster > > 10.247160 -> 10.444528: 1.02x slower > > kdecore.cc -g -O1 27.026285 -> 27.220654: 1.01x slower > > 27.280681 -> 27.622676: 1.01x slower > > kdecore.cc -g -O2 43.791668 -> 44.020270: 1.01x slower > > 43.904934 -> 44.248477: 1.01x slower > > kdecore.cc -g -O3 47.471836 -> 47.651101: 1.00x slower > > 47.645985 -> 48.005495: 1.01x slower > > kdecore.cc -g -Os 31.678652 -> 31.802829: 1.00x slower > > 31.741484 -> 32.033478: 1.01x slower > > empty.c -g -O0 0.012662 -> 0.011932: 1.06x faster > > 0.012888 -> 0.013143: 1.02x slower > > empty.c -g -O1 0.012685 -> 0.012558: 1.01x faster > > 0.013164 -> 0.012790: 1.03x faster > > empty.c -g -O2 0.012694 -> 0.012846: 1.01x slower > > 0.012912 -> 0.013175: 1.02x slower > > empty.c -g -O3 0.012654 -> 0.012699: 1.00x slower > > 0.012596 -> 0.012792: 1.02x slower > > empty.c -g -Os 0.013057 -> 0.012766: 1.02x faster > > 0.012691 -> 0.012885: 1.02x slower > > big-code.c -g -O0 3.292680 -> 3.325748: 1.01x slower > > 3.292948 -> 3.303049: 1.00x slower > > big-code.c -g -O1 15.701810 -> 15.765014: 1.00x slower > > 15.714116 -> 15.759254: 1.00x slower > > big-code.c -g -O2 22.575615 -> 22.620187: 1.00x slower > > 22.567406 -> 22.605435: 1.00x slower > > big-code.c -g -O3 52.423586 -> 52.590075: 1.00x slower > > 52.421460 -> 52.703835: 1.01x slower > > big-code.c -g -Os 21.153980 -> 21.253598: 1.00x slower > > 21.146266 -> 21.260138: 1.01x slower > > influence.i -g -O0 0.148229 -> 0.149518: 1.01x slower > > 0.148672 -> 0.156262: 1.05x slower > > influence.i -g -O1 0.387397 -> 0.389930: 1.01x slower > > 0.387734 -> 0.396655: 1.02x slower > > influence.i -g -O2 0.587514 -> 0.589604: 1.00x slower > > 0.588064 -> 0.596510: 1.01x slower > > influence.i -g -O3 1.273561 -> 1.280514: 1.01x slower > > 1.274599 -> 1.287596: 1.01x slower > > influence.i -g -Os 0.526045 -> 0.527579: 1.00x slower > > 0.526827 -> 0.535635: 1.02x slower > > > > > > Maximal ggc memory (kb) > > Control -> v2 Control > > -> v2+adhocloc+at+every+token > > kdecore.cc -g -O0 650337.000 -> 654435.000: 1.0063x larger > > 650337.000 -> 711775.000: 1.0945x larger > > kdecore.cc -g -O1 931966.000 -> 940144.000: 1.0088x larger > > 931951.000 -> 989384.000: 1.0616x larger > > kdecore.cc -g -O2 1125325.000 -> 1133514.000: 1.0073x larger > > 1125318.000 -> 1182384.000: 1.0507x larger > > kdecore.cc -g -O3 1221408.000 -> 1229596.000: 1.0067x larger > > 1221410.000 -> 1278658.000: 1.0469x larger > > kdecore.cc -g -Os 867140.000 -> 871235.000: 1.0047x larger > > 867141.000 -> 928700.000: 1.0710x larger > > empty.c -g -O0 1189.000 -> 1192.000: 1.0025x larger > > 1189.000 -> 1193.000: 1.0034x larger > > empty.c -g -O1 1189.000 -> 1192.000: 1.0025x larger > > 1189.000 -> 1193.000: 1.0034x larger > > empty.c -g -O2 1189.000 -> 1192.000: 1.0025x larger > > 1189.000 -> 1193.000: 1.0034x larger > > empty.c -g -O3 1189.000 -> 1192.000: 1.0025x larger > > 1189.000 -> 1193.000: 1.0034x larger > > empty.c -g -Os 1189.000 -> 1192.000: 1.0025x larger > > 1189.000 -> 1193.000: 1.0034x larger > > big-code.c -g -O0 166584.000 -> 172731.000: 1.0369x larger > > 166584.000 -> 172726.000: 1.0369x larger > > big-code.c -g -O1 279793.000 -> 285940.000: 1.0220x larger > > 279793.000 -> 285935.000: 1.0220x larger > > big-code.c -g -O2 400058.000 -> 406194.000: 1.0153x larger > > 400058.000 -> 406189.000: 1.0153x larger > > big-code.c -g -O3 903648.000 -> 909750.000: 1.0068x larger > > 903906.000 -> 910001.000: 1.0067x larger > > big-code.c -g -Os 357060.000 -> 363010.000: 1.0167x larger > > 357060.000 -> 363005.000: 1.0166x larger > > influence.i -g -O0 9273.000 -> 9719.000: 1.0481x larger > > 9273.000 -> 13303.000: 1.4346x larger > > influence.i -g -O1 12968.000 -> 13414.000: 1.0344x larger > > 12968.000 -> 16998.000: 1.3108x larger > > influence.i -g -O2 16386.000 -> 16768.000: 1.0233x larger > > 16386.000 -> 20352.000: 1.2420x larger > > influence.i -g -O3 35508.000 -> 35763.000: 1.0072x larger > > 35508.000 -> 39346.000: 1.1081x larger > > influence.i -g -Os 14287.000 -> 14669.000: 1.0267x larger > > 14287.000 -> 18253.000: 1.2776x larger > > > > Thoughts? > > The compile-time and memory-usage impact for the adhocloc at every > token patchkit is quite big. Remember > that gaining 1% in compile-time is hard and 20-40% memory increase for > influence.i looks too much. > > I also wonder why you see differences in memory usage change for > different -O levels. I think we should > have a pretty "static" line table after parsing? Thus rather than > percentages I'd like to see absolute changes > (which I'd expect to be the same for all -O levels). > Richard. > > > Dave
I have a mostly-working implementation of the range-packing idea, where short ranges that start at the caret location and finish within (1<<RANGE_BITS) of the start are stored in low bits directly within the 32-bit source_location, the hope being to ameliorate the cost of packing every token's range into the location_t. https://dmalcolm.fedorapeople.org/gcc/2015-10-15/rich-locations/ (patch 0014 implements the range compression idea). It doesn't bootstrap yet (am working on that), but runs well enough to pass a lot of tests, and can be benchmarked. The following shows the results posted above, plus the new v2+ packed ranges as the final column, in a slightly different format that I hope is easier to read. Minimal wallclock time (s) over 10 iterations (each experiment's % change after "v2" is relative to 10 iterations of control interleaved with that experiment) Control v2 v2+every+token v2+packed+ranges ------------------ --------- ----------------- ----------------- ------------------ kdecore.cc -g -O0 10.3065 10.268712 (-0.4%) 10.444528 (+1.9%) 10.398405 (+1.3%) kdecore.cc -g -O1 27.0263 27.220654 (+0.7%) 27.622676 (+1.3%) 27.548755 (+1.9%) kdecore.cc -g -O2 43.7917 44.02027 (+0.5%) 44.248477 (+0.8%) 43.843728 (-0.6%) kdecore.cc -g -O3 47.4718 47.651101 (+0.4%) 48.005495 (+0.8%) 47.57282 (+0.1%) kdecore.cc -g -Os 31.6787 31.802829 (+0.4%) 32.033478 (+0.9%) 31.699171 (-0.2%) empty.c -g -O0 0.012662 0.011932 (-5.8%) 0.013143 (+2.0%) 0.013208 (+1.5%) empty.c -g -O1 0.012685 0.012558 (-1.0%) 0.01279 (-2.8%) 0.012424 (+0.4%) empty.c -g -O2 0.012694 0.012846 (+1.2%) 0.013175 (+2.0%) 0.013176 (+1.6%) empty.c -g -O3 0.012654 0.012699 (+0.4%) 0.012792 (+1.6%) 0.013198 (+0.5%) empty.c -g -Os 0.013057 0.012766 (-2.2%) 0.012885 (+1.5%) 0.01298 (-1.6%) big-code.c -g -O0 3.29268 3.325748 (+1.0%) 3.303049 (+0.3%) 3.350572 (+1.7%) big-code.c -g -O1 15.7018 15.765014 (+0.4%) 15.759254 (+0.3%) 15.45175 (-2.0%) big-code.c -g -O2 22.5756 22.620187 (+0.2%) 22.605435 (+0.2%) 22.343609 (-1.2%) big-code.c -g -O3 52.4236 52.590075 (+0.3%) 52.703835 (+0.5%) 51.9239 (-1.0%) big-code.c -g -Os 21.154 21.253598 (+0.5%) 21.260138 (+0.5%) 20.907407 (-1.3%) influence.i -g -O0 0.148229 0.149518 (+0.9%) 0.156262 (+5.1%) 0.150652 (+1.0%) influence.i -g -O1 0.387397 0.38993 (+0.7%) 0.396655 (+2.3%) 0.3916 (+0.8%) influence.i -g -O2 0.587514 0.589604 (+0.4%) 0.59651 (+1.4%) 0.585223 (-0.6%) influence.i -g -O3 1.27356 1.280514 (+0.5%) 1.287596 (+1.0%) 1.273018 (-0.1%) influence.i -g -Os 0.526045 0.527579 (+0.3%) 0.535635 (+1.7%) 0.528192 (+0.2%) Maximal ggc memory (kb) Control v2 v2+every+token v2+packed+ranges ------------------ --------- --------------- ---------------- ------------------ kdecore.cc -g -O0 650337 654435 (+0.6%) 711775 (+9.4%) 659372 (+1.4%) kdecore.cc -g -O1 931966 940144 (+0.9%) 989384 (+6.2%) 954779 (+2.4%) kdecore.cc -g -O2 1125325 1133514 (+0.7%) 1182384 (+5.1%) 1136459 (+1.0%) kdecore.cc -g -O3 1221408 1229596 (+0.7%) 1278658 (+4.7%) 1232688 (+0.9%) kdecore.cc -g -Os 867140 871235 (+0.5%) 928700 (+7.1%) 884304 (+2.0%) empty.c -g -O0 1189 1192 (+0.3%) 1193 (+0.3%) 1189 (+0.0%) empty.c -g -O1 1189 1192 (+0.3%) 1193 (+0.3%) 1189 (+0.0%) empty.c -g -O2 1189 1192 (+0.3%) 1193 (+0.3%) 1189 (+0.0%) empty.c -g -O3 1189 1192 (+0.3%) 1193 (+0.3%) 1189 (+0.0%) empty.c -g -Os 1189 1192 (+0.3%) 1193 (+0.3%) 1189 (+0.0%) big-code.c -g -O0 166584 172731 (+3.7%) 172726 (+3.7%) 176062 (+5.7%) big-code.c -g -O1 279793 285940 (+2.2%) 285935 (+2.2%) 281538 (+0.6%) big-code.c -g -O2 400058 406194 (+1.5%) 406189 (+1.5%) 395632 (-1.1%) big-code.c -g -O3 903648 909750 (+0.7%) 910001 (+0.7%) 902477 (-0.1%) big-code.c -g -Os 357060 363010 (+1.7%) 363005 (+1.7%) 353338 (-1.0%) influence.i -g -O0 9273 9719 (+4.8%) 13303 (+43.5%) 9758 (+5.2%) influence.i -g -O1 12968 13414 (+3.4%) 16998 (+31.1%) 13562 (+4.6%) influence.i -g -O2 16386 16768 (+2.3%) 20352 (+24.2%) 16737 (+2.1%) influence.i -g -O3 35508 35763 (+0.7%) 39346 (+10.8%) 35783 (+0.8%) influence.i -g -Os 14287 14669 (+2.7%) 18253 (+27.8%) 14769 (+3.4%) This fixes much of the bloat seen for influence.i when sending ranges through for every token. This was with 8 bits allocated for packed ranges (which is probably excessive, but it makes debugging easier). Interestingly, the memory usage sometimes goes down relative to control at higher optimization levels. I'm not sure why yet. Dave