Re: ghc 8.0.2 vs 8.4.1 compilation time and performance

2018-03-22 Thread Evan Laforge
Ok, done, I created https://ghc.haskell.org/trac/ghc/ticket/14964

But first:

On Thu, Mar 22, 2018 at 3:26 AM, Simon Peyton Jones
 wrote:
> score   max mb  total mb  prdderive lily   perform
> ghc
> 6   72.26   3279.22   0.88   0.79~0.84  0.70~0.74  0.31~0.33
> 8.0.2
> 6   76.63   3419.20   0.58   1.45~1.59  1.05~1.07  0.33~0.36
> 8.4.1
>
> bloom   70.69   2456.14   0.89   1.32~1.36 0.15~0.16
> 8.0.2
> bloom   67.86   2589.97   0.62   1.94~1.99 0.20~0.22
> 8.4.1
>
> The bytes-allocated number has gone up a bit.  (Not too surprising… the
> compiler is doing more.)  But the productivity number is down sharply, and
> consistently so, which translates directly into longer compile times.
> Somehow, although residency is not increasing, GC time is greatly increased.

To be clear, this is the performance of generated code, not the
performance of the compiler itself.  The conclusion from compiler
performance was non-optimized is much faster (yay!) and optimized
slightly slower.  That's reasonable if it's doing more work, but
somehow the more work turns into lower GC productivity in the
generated code :(


> It’d be good to figure out what’s gone wrong here. Maybe a change in nursery
> size or something stupid like that?

I don't think so, this is my own code and the invocation is the same
across versions, hopefully only the compiler version has changed.  I
am tweaking nursery size though, here are the RTS flags: -N -A8m -T

I can try testing again with no RTS flags.

I realize this is a bit vague as it stands, I'll try to narrow things down.
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


RE: ghc 8.0.2 vs 8.4.1 compilation time and performance

2018-03-22 Thread Simon Peyton Jones via ghc-devs
Evan

That’s truly great information, thank you.  Could you start a GHC ticket for 
it?   Stuff gets lots in email, less so in tickets.

The big take-away I got from your message is this:

score   max mb  total mb  prdderive lily   performghc
6   72.26   3279.22   0.88   0.79~0.84  0.70~0.74  0.31~0.33  8.0.2
6   76.63   3419.20   0.58   1.45~1.59  1.05~1.07  0.33~0.36  8.4.1

bloom   70.69   2456.14   0.89   1.32~1.36 0.15~0.16  8.0.2
bloom   67.86   2589.97   0.62   1.94~1.99 0.20~0.22  8.4.1

The bytes-allocated number has gone up a bit.  (Not too surprising… the 
compiler is doing more.)  But the productivity number is down sharply, and 
consistently so, which translates directly into longer compile times.  Somehow, 
although residency is not increasing, GC time is greatly increased.

We have some compiler/perf regression tests that track bytes-allocated, but not 
that track productivity, which is why we have noticed.

It’d be good to figure out what’s gone wrong here. Maybe a change in nursery 
size or something stupid like that?

It’d be great if someone felt up to investigating a bit

Thanks for gathering this data.

Simon


From: Glasgow-haskell-users  On 
Behalf Of Evan Laforge
Sent: 21 March 2018 22:05
To: GHC users 
Subject: ghc 8.0.2 vs 8.4.1 compilation time and performance

I just upgraded from 8.0.2 to 8.4.1, and I took the opportunity to do a few
informal compile time and run time tests.  There's been a lot of talk about
compile time regressions, so maybe these will be of interest, informal as
they are.

I wound up skipping 8.2.1 due to https://ghc.haskell.org/trac/ghc/ticket/13604,
but I could still test with 8.2 perfectly well.  Just haven't done it yet.

In this context, RunTests is more code with no optimization (and -fhpc, if it
matters).  debug/seq and opt/seq are the same code but with no optimization and
-O respectively.  I found that -O2 hurt compile time but didn't improve run
time, but it's been a long time so I should run that experiment again.

compile times:

OS X, macbook pro:

RunTests  549.10s user 118.45s system 343% cpu 3:14.53 total8.0.2
RunTests  548.71s user 117.10s system 347% cpu 3:11.78 total8.0.2
RunTests  450.92s user 109.63s system 343% cpu 2:43.13 total8.4.1
RunTests  445.48s user 107.99s system 341% cpu 2:42.19 total8.4.1

debug/seq 284.47s user 55.95s system 345% cpu 1:38.58 total 8.0.2
debug/seq 283.33s user 55.27s system 343% cpu 1:38.53 total 8.0.2
debug/seq 220.92s user 50.21s system 337% cpu 1:20.32 total 8.4.1
debug/seq 218.39s user 49.20s system 345% cpu 1:17.47 total 8.4.1

opt/seq   732.63s user 70.86s system 338% cpu 3:57.30 total 8.0.2
opt/seq   735.21s user 71.48s system 327% cpu 4:06.31 total 8.0.2
opt/seq   785.12s user 65.42s system 327% cpu 4:19.84 total 8.4.1
opt/seq   765.52s user 64.01s system 321% cpu 4:18.29 total 8.4.1

Linux, PC:

RunTests781.31s user 58.21s system 363% cpu 3:50.70 total   8.0.2
RunTests613.11s user 49.84s system 357% cpu 3:05.52 total   8.4.1

debug/seq   429.44s user 31.34s system 362% cpu 2:07.03 total   8.0.2
debug/seq   329.67s user 23.86s system 352% cpu 1:40.38 total   8.4.1

opt/seq 1277.20s user 45.85s system 358% cpu 6:08.68 total  8.0.2
opt/seq 1339.73s user 39.87s system 341% cpu 6:43.50 total  8.4.1

So it looks like non-optimized compile times have gotten significantly better
since 8.0.2.  However, optimized has gotten a little worse, but not much.

The performance numbers are a bit more disappointing.  At first it appeared
that allocation went down in 8.4.1 while overall time is up significantly.
However, the 8.4.1 used newer dependencies, so to try to control for those, I
tested again after using a cabal freeze from the 8.4.1 test.  Of course I had
to remove the ghc distributed packages, like container and 'ghc' itself, but
the rest of the deps should be the same.  Those have the 'libs' suffix on
Linux.

From that, it looks like the improved memory in 8.4.1 was due to external
dependencies, and in fact 8.4.1 bumped memory usage up again.  Ow.

In the graphs, 'score' is just the input file.  'max mb' and 'total mb' and
'prd' come from the post-run GC report, specifically '* bytes maximum
residency', '* bytes allocated in the heap', and productivity fields.
'derive', 'lily' and 'perform' are just different kinds of processes.  They are
CPU time bracketing the specific action, after initialization, and the range is
min and max over 6 runs, so no fancy criterion-like analysis.  Each run is a
separate process, so they should be independent.

I was hoping for some gains due to the join points stuff, but it kind of looks
like things get worse