Re: Benchmarks of v2 (was Re: [PATCH 0/5] RFC: Overhaul of diagnostics (v2))

David Malcolm Fri, 16 Oct 2015 08:57:13 -0700

On Wed, 2015-10-14 at 11:00 +0200, Richard Biener wrote:
> On Tue, Oct 13, 2015 at 5:32 PM, David Malcolm <dmalc...@redhat.com> wrote:
> > On Thu, 2015-09-24 at 10:15 +0200, Richard Biener wrote:
> >> On Thu, Sep 24, 2015 at 2:25 AM, David Malcolm <dmalc...@redhat.com> wrote:
> >> > On Wed, 2015-09-23 at 15:36 +0200, Richard Biener wrote:
> >> >> On Wed, Sep 23, 2015 at 3:19 PM, Michael Matz <m...@suse.de> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > On Tue, 22 Sep 2015, David Malcolm wrote:
> >> >> >
> >> >> >> The drawback is that it could bloat the ad-hoc table.  Can the ad-hoc
> >> >> >> table ever get smaller, or does it only ever get inserted into?
> >> >> >
> >> >> > It only ever grows.
> >> >> >
> >> >> >> An idea I had is that we could stash short ranges directly into the 
> >> >> >> 32
> >> >> >> bits of location_t, by offsetting the per-column-bits somewhat.
> >> >> >
> >> >> > It's certainly worth an experiment: let's say you restrict yourself to
> >> >> > tokens less than 8 characters, you need an additional 3 bits (using 
> >> >> > one
> >> >> > value, e.g. zero, as the escape value).  That leaves 20 bits for the 
> >> >> > line
> >> >> > numbers (for the normal 8 bit columns), which might be enough for most
> >> >> > single-file compilations.  For LTO compilation this often won't be 
> >> >> > enough.
> >> >> >
> >> >> >> My plan is to investigate the impact these patches have on the time 
> >> >> >> and
> >> >> >> memory consumption of the compiler,
> >> >> >
> >> >> > When you do so, make sure you're also measuring an LTO compilation 
> >> >> > with
> >> >> > debug info of something big (firefox).  I know that we already had 
> >> >> > issues
> >> >> > with the size of the linemap data in the past for these cases 
> >> >> > (probably
> >> >> > when we added columns).
> >> >>
> >> >> The issue we have with LTO is that the linemap gets populated in quite
> >> >> random order and thus we repeatedly switch files (we've mitigated this
> >> >> somewhat for GCC 5).  We also considered dropping column info
> >> >> (and would drop range info) as diagnostics are from optimizers only
> >> >> with LTO and we keep locations merely for debug info.
> >> >
> >> > Thanks.  Presumably the mitigation you're referring to is the
> >> > lto_location_cache class in lto-streamer-in.c?
> >> >
> >> > Am I right in thinking that, right now, the LTO code doesn't support
> >> > ad-hoc locations? (presumably the block pointers only need to exist
> >> > during optimization, which happens after the serialization)
> >>
> >> LTO code does support ad-hoc locations but they are "restored" only
> >> when reading function bodies and stmts (by means of COMBINE_LOCATION_DATA).
> >>
> >> > The obvious simplification would be, as you suggest, to not bother
> >> > storing range information with LTO, falling back to just the existing
> >> > representation.  Then there's no need to extend LTO to serialize ad-hoc
> >> > data; simply store the underlying locus into the bit stream.  I think
> >> > that this happens already: lto-streamer-out.c calls expand_location and
> >> > stores the result, so presumably any ad-hoc location_t values made by
> >> > the v2 patches would have dropped their range data there when I ran the
> >> > test suite.
> >>
> >> Yep.  We only preserve BLOCKs, so if you don't add extra code to
> >> preserve ranges they'll be "dropped".
> >>
> >> > If it's acceptable to not bother with ranges for LTO, one way to do the
> >> > "stashing short ranges into the location_t" idea might be for the
> >> > bits-per-range of location_t values to be a property of the line_table
> >> > (or possibly the line map), set up when the struct line_maps is created.
> >> > For non-LTO it could be some tuned value (maybe from a param?); for LTO
> >> > it could be zero, so that we have as many bits as before for line/column
> >> > data.
> >>
> >> That could be a possibility (likewise for column info?)
> >>
> >> Richard.
> >>
> >> > Hope this sounds sane
> >> > Dave
> >
> > I did some crude benchmarking of the patchkit, using these scripts:
> >   https://github.com/davidmalcolm/gcc-benchmarking
> > (specifically, bb0222b455df8cefb53bfc1246eb0a8038256f30),
> > using the "big-code.c" and "kdecore.cc" files Michael posted as:
> >   https://gcc.gnu.org/ml/gcc-patches/2013-09/msg00062.html
> > and "influence.i", a preprocessed version of SPEC2006's 445.gobmk
> > engine/influence.c (as an example of a moderate-sized pure C source
> > file).
> >
> > This doesn't yet cover very large autogenerated C files, and the .cc
> > file is only being measured to see the effect on the ad-hoc table (and
> > tokenization).
> >
> > "control" was r227977.
> > "experiment" was the same revision with the v2 patchkit applied.
> >
> > Recall that this patchkit captures ranges for tokens as an extra field
> > within tokens within libcpp and the C FE, and adds ranges to the ad-hoc
> > location lookaside, storing them for all tree nodes within the C FE that
> > have a location_t, and passing them around within c_expr for all C
> > expressions (including those that don't have a location_t).
> >
> > Both control and experiment were built with
> >   --enable-checking=release \
> >   --disable-bootstrap \
> >   --disable-multilib \
> >   --enable-languages=c,ada,c++,fortran,go,java,lto,objc,obj-c++
> >
> > The script measures:
> >
> > (a) wallclock time for "xgcc -S" so it's measuring the driver, parsing,
> > optimimation, etc, rather than attempting to directly measure parsing.
> > This is without -ftime-report, since Mikhail indicated it's sufficiently
> > expensive to skew timings in this post:
> >   https://gcc.gnu.org/ml/gcc/2015-07/msg00165.html
> >
> > (b) memory usage: by performing a separate build with -ftime-report,
> > extracting the "TOTAL" ggc value (actually 3 builds, but it's the same
> > each time).
> >
> > Is this a fair way to measure things?  It could be argued that by
> > measuring totals I'm hiding the extra parsing cost in the overall cost.
> 
> Overall cost is what matters.   Time to build the libstdc++ PCHs
> would be interesting as well ;)  (and their size)
> 
> One could have argued you should have used -fsyntax-only.
> 
> > Full logs can be seen at:
> >   https://dmalcolm.fedorapeople.org/gcc/2015-09-25/bmark-v2.txt
> > (v2 of the patchkit)
> >
> > I also investigated a version of the patchkit with the token tracking
> > rewritten to build ad-hoc ranges for *every token*, without attempting
> > any kind of optimization (e.g. for short ranges).
> > A log of this can be seen at:
> > https://dmalcolm.fedorapeople.org/gcc/2015-09-25/bmark-v2-plus-adhoc-ranges-for-tokens.txt
> > (v2 of the patchkit, with token tracking rewritten to build ad-hoc
> > ranges for *every token*).
> > The nice thing about this approach is that lots of token-related
> > diagnostics gain underlining of the relevant token "for free" simply
> > from the location_t, without having to individually patch them.  Without
> > any optimization, the memory consumed by this approach is clearly
> > larger.
> >
> > A summary comparing the two logs:
> >
> > Minimal wallclock time (s) over 10 iterations
> >                           Control -> v2                                 
> > Control -> v2+adhocloc+at+every+token
> > kdecore.cc -g -O0          10.306548 -> 10.268712: 1.00x faster          
> > 10.247160 -> 10.444528: 1.02x slower
> > kdecore.cc -g -O1          27.026285 -> 27.220654: 1.01x slower          
> > 27.280681 -> 27.622676: 1.01x slower
> > kdecore.cc -g -O2          43.791668 -> 44.020270: 1.01x slower          
> > 43.904934 -> 44.248477: 1.01x slower
> > kdecore.cc -g -O3          47.471836 -> 47.651101: 1.00x slower          
> > 47.645985 -> 48.005495: 1.01x slower
> > kdecore.cc -g -Os          31.678652 -> 31.802829: 1.00x slower          
> > 31.741484 -> 32.033478: 1.01x slower
> >    empty.c -g -O0            0.012662 -> 0.011932: 1.06x faster            
> > 0.012888 -> 0.013143: 1.02x slower
> >    empty.c -g -O1            0.012685 -> 0.012558: 1.01x faster            
> > 0.013164 -> 0.012790: 1.03x faster
> >    empty.c -g -O2            0.012694 -> 0.012846: 1.01x slower            
> > 0.012912 -> 0.013175: 1.02x slower
> >    empty.c -g -O3            0.012654 -> 0.012699: 1.00x slower            
> > 0.012596 -> 0.012792: 1.02x slower
> >    empty.c -g -Os            0.013057 -> 0.012766: 1.02x faster            
> > 0.012691 -> 0.012885: 1.02x slower
> > big-code.c -g -O0            3.292680 -> 3.325748: 1.01x slower            
> > 3.292948 -> 3.303049: 1.00x slower
> > big-code.c -g -O1          15.701810 -> 15.765014: 1.00x slower          
> > 15.714116 -> 15.759254: 1.00x slower
> > big-code.c -g -O2          22.575615 -> 22.620187: 1.00x slower          
> > 22.567406 -> 22.605435: 1.00x slower
> > big-code.c -g -O3          52.423586 -> 52.590075: 1.00x slower          
> > 52.421460 -> 52.703835: 1.01x slower
> > big-code.c -g -Os          21.153980 -> 21.253598: 1.00x slower          
> > 21.146266 -> 21.260138: 1.01x slower
> > influence.i -g -O0            0.148229 -> 0.149518: 1.01x slower            
> > 0.148672 -> 0.156262: 1.05x slower
> > influence.i -g -O1            0.387397 -> 0.389930: 1.01x slower            
> > 0.387734 -> 0.396655: 1.02x slower
> > influence.i -g -O2            0.587514 -> 0.589604: 1.00x slower            
> > 0.588064 -> 0.596510: 1.01x slower
> > influence.i -g -O3            1.273561 -> 1.280514: 1.01x slower            
> > 1.274599 -> 1.287596: 1.01x slower
> > influence.i -g -Os            0.526045 -> 0.527579: 1.00x slower            
> > 0.526827 -> 0.535635: 1.02x slower
> >
> >
> > Maximal ggc memory (kb)
> >                      Control -> v2                                 Control 
> > -> v2+adhocloc+at+every+token
> > kdecore.cc -g -O0      650337.000 -> 654435.000: 1.0063x larger      
> > 650337.000 -> 711775.000: 1.0945x larger
> > kdecore.cc -g -O1      931966.000 -> 940144.000: 1.0088x larger      
> > 931951.000 -> 989384.000: 1.0616x larger
> > kdecore.cc -g -O2    1125325.000 -> 1133514.000: 1.0073x larger    
> > 1125318.000 -> 1182384.000: 1.0507x larger
> > kdecore.cc -g -O3    1221408.000 -> 1229596.000: 1.0067x larger    
> > 1221410.000 -> 1278658.000: 1.0469x larger
> > kdecore.cc -g -Os      867140.000 -> 871235.000: 1.0047x larger      
> > 867141.000 -> 928700.000: 1.0710x larger
> >    empty.c -g -O0          1189.000 -> 1192.000: 1.0025x larger          
> > 1189.000 -> 1193.000: 1.0034x larger
> >    empty.c -g -O1          1189.000 -> 1192.000: 1.0025x larger          
> > 1189.000 -> 1193.000: 1.0034x larger
> >    empty.c -g -O2          1189.000 -> 1192.000: 1.0025x larger          
> > 1189.000 -> 1193.000: 1.0034x larger
> >    empty.c -g -O3          1189.000 -> 1192.000: 1.0025x larger          
> > 1189.000 -> 1193.000: 1.0034x larger
> >    empty.c -g -Os          1189.000 -> 1192.000: 1.0025x larger          
> > 1189.000 -> 1193.000: 1.0034x larger
> > big-code.c -g -O0      166584.000 -> 172731.000: 1.0369x larger      
> > 166584.000 -> 172726.000: 1.0369x larger
> > big-code.c -g -O1      279793.000 -> 285940.000: 1.0220x larger      
> > 279793.000 -> 285935.000: 1.0220x larger
> > big-code.c -g -O2      400058.000 -> 406194.000: 1.0153x larger      
> > 400058.000 -> 406189.000: 1.0153x larger
> > big-code.c -g -O3      903648.000 -> 909750.000: 1.0068x larger      
> > 903906.000 -> 910001.000: 1.0067x larger
> > big-code.c -g -Os      357060.000 -> 363010.000: 1.0167x larger      
> > 357060.000 -> 363005.000: 1.0166x larger
> > influence.i -g -O0          9273.000 -> 9719.000: 1.0481x larger         
> > 9273.000 -> 13303.000: 1.4346x larger
> > influence.i -g -O1        12968.000 -> 13414.000: 1.0344x larger        
> > 12968.000 -> 16998.000: 1.3108x larger
> > influence.i -g -O2        16386.000 -> 16768.000: 1.0233x larger        
> > 16386.000 -> 20352.000: 1.2420x larger
> > influence.i -g -O3        35508.000 -> 35763.000: 1.0072x larger        
> > 35508.000 -> 39346.000: 1.1081x larger
> > influence.i -g -Os        14287.000 -> 14669.000: 1.0267x larger        
> > 14287.000 -> 18253.000: 1.2776x larger
> >
> > Thoughts?
> 
> The compile-time and memory-usage impact for the adhocloc at every
> token patchkit is quite big.  Remember
> that gaining 1% in compile-time is hard and 20-40% memory increase for
> influence.i looks too much.
> 
> I also wonder why you see differences in memory usage change for
> different -O levels.  I think we should
> have a pretty "static" line table after parsing?  Thus rather than
> percentages I'd like to see absolute changes
> (which I'd expect to be the same for all -O levels).
> Richard.
> 
> > Dave


I have a mostly-working implementation of the range-packing idea, where
short ranges that start at the caret location and finish  within
(1<<RANGE_BITS) of the start are stored in low bits directly within the
32-bit source_location, the hope being to ameliorate the cost of packing
every token's range into the location_t.
     https://dmalcolm.fedorapeople.org/gcc/2015-10-15/rich-locations/

(patch 0014 implements the range compression idea).

It doesn't bootstrap yet (am working on that), but runs well enough to
pass a lot of tests, and can be benchmarked.

The following shows the results posted above, plus the new v2+ packed
ranges as the final column, in a slightly different format that I hope
is easier to read.

Minimal wallclock time (s) over 10 iterations
  (each experiment's % change after "v2" is relative to 10 iterations of
   control interleaved with that experiment)
                      Control  v2                 v2+every+token     
v2+packed+ranges
------------------  ---------  -----------------  -----------------  
------------------
kdecore.cc -g -O0   10.3065    10.268712 (-0.4%)  10.444528 (+1.9%)  10.398405 
(+1.3%)
kdecore.cc -g -O1   27.0263    27.220654 (+0.7%)  27.622676 (+1.3%)  27.548755 
(+1.9%)
kdecore.cc -g -O2   43.7917    44.02027 (+0.5%)   44.248477 (+0.8%)  43.843728 
(-0.6%)
kdecore.cc -g -O3   47.4718    47.651101 (+0.4%)  48.005495 (+0.8%)  47.57282 
(+0.1%)
kdecore.cc -g -Os   31.6787    31.802829 (+0.4%)  32.033478 (+0.9%)  31.699171 
(-0.2%)
empty.c -g -O0       0.012662  0.011932 (-5.8%)   0.013143 (+2.0%)   0.013208 
(+1.5%)
empty.c -g -O1       0.012685  0.012558 (-1.0%)   0.01279 (-2.8%)    0.012424 
(+0.4%)
empty.c -g -O2       0.012694  0.012846 (+1.2%)   0.013175 (+2.0%)   0.013176 
(+1.6%)
empty.c -g -O3       0.012654  0.012699 (+0.4%)   0.012792 (+1.6%)   0.013198 
(+0.5%)
empty.c -g -Os       0.013057  0.012766 (-2.2%)   0.012885 (+1.5%)   0.01298 
(-1.6%)
big-code.c -g -O0    3.29268   3.325748 (+1.0%)   3.303049 (+0.3%)   3.350572 
(+1.7%)
big-code.c -g -O1   15.7018    15.765014 (+0.4%)  15.759254 (+0.3%)  15.45175 
(-2.0%)
big-code.c -g -O2   22.5756    22.620187 (+0.2%)  22.605435 (+0.2%)  22.343609 
(-1.2%)
big-code.c -g -O3   52.4236    52.590075 (+0.3%)  52.703835 (+0.5%)  51.9239 
(-1.0%)
big-code.c -g -Os   21.154     21.253598 (+0.5%)  21.260138 (+0.5%)  20.907407 
(-1.3%)
influence.i -g -O0   0.148229  0.149518 (+0.9%)   0.156262 (+5.1%)   0.150652 
(+1.0%)
influence.i -g -O1   0.387397  0.38993 (+0.7%)    0.396655 (+2.3%)   0.3916 
(+0.8%)
influence.i -g -O2   0.587514  0.589604 (+0.4%)   0.59651 (+1.4%)    0.585223 
(-0.6%)
influence.i -g -O3   1.27356   1.280514 (+0.5%)   1.287596 (+1.0%)   1.273018 
(-0.1%)
influence.i -g -Os   0.526045  0.527579 (+0.3%)   0.535635 (+1.7%)   0.528192 
(+0.2%)


Maximal ggc memory (kb)
                      Control  v2               v2+every+token    
v2+packed+ranges
------------------  ---------  ---------------  ----------------  
------------------
kdecore.cc -g -O0      650337  654435 (+0.6%)   711775 (+9.4%)    659372 (+1.4%)
kdecore.cc -g -O1      931966  940144 (+0.9%)   989384 (+6.2%)    954779 (+2.4%)
kdecore.cc -g -O2     1125325  1133514 (+0.7%)  1182384 (+5.1%)   1136459 
(+1.0%)
kdecore.cc -g -O3     1221408  1229596 (+0.7%)  1278658 (+4.7%)   1232688 
(+0.9%)
kdecore.cc -g -Os      867140  871235 (+0.5%)   928700 (+7.1%)    884304 (+2.0%)
empty.c -g -O0           1189  1192 (+0.3%)     1193 (+0.3%)      1189 (+0.0%)
empty.c -g -O1           1189  1192 (+0.3%)     1193 (+0.3%)      1189 (+0.0%)
empty.c -g -O2           1189  1192 (+0.3%)     1193 (+0.3%)      1189 (+0.0%)
empty.c -g -O3           1189  1192 (+0.3%)     1193 (+0.3%)      1189 (+0.0%)
empty.c -g -Os           1189  1192 (+0.3%)     1193 (+0.3%)      1189 (+0.0%)
big-code.c -g -O0      166584  172731 (+3.7%)   172726 (+3.7%)    176062 (+5.7%)
big-code.c -g -O1      279793  285940 (+2.2%)   285935 (+2.2%)    281538 (+0.6%)
big-code.c -g -O2      400058  406194 (+1.5%)   406189 (+1.5%)    395632 (-1.1%)
big-code.c -g -O3      903648  909750 (+0.7%)   910001 (+0.7%)    902477 (-0.1%)
big-code.c -g -Os      357060  363010 (+1.7%)   363005 (+1.7%)    353338 (-1.0%)
influence.i -g -O0       9273  9719 (+4.8%)     13303 (+43.5%)    9758 (+5.2%)
influence.i -g -O1      12968  13414 (+3.4%)    16998 (+31.1%)    13562 (+4.6%)
influence.i -g -O2      16386  16768 (+2.3%)    20352 (+24.2%)    16737 (+2.1%)
influence.i -g -O3      35508  35763 (+0.7%)    39346 (+10.8%)    35783 (+0.8%)
influence.i -g -Os      14287  14669 (+2.7%)    18253 (+27.8%)    14769 (+3.4%)


This fixes much of the bloat seen for influence.i when sending ranges
through for every token.

This was with 8 bits allocated for packed ranges (which is probably
excessive, but it makes debugging easier).

Interestingly, the memory usage sometimes goes down relative to control
at higher optimization levels.  I'm not sure why yet.

Dave

Re: Benchmarks of v2 (was Re: [PATCH 0/5] RFC: Overhaul of diagnostics (v2))

Reply via email to