We need to do something about this, and I'd advocate for just not making stats 
fail with marge.

Generally I agree.   One point you don’t mention is that our perf tests (which 
CI forces us to look at assiduously) are often pretty weird cases.  So there is 
at least a danger that these more exotic cases will stand in the way of (say) a 
perf improvement in the typical case.

But “not making stats fail” is a bit crude.   Instead how about

  *   Always accept stat improvements



  *   We already have per-benchmark windows.  If the stat falls outside the 
window, we fail.  You are effectively saying “widen all windows to infinity”.  
If something makes a stat 10 times worse, I think we *should* fail.  But 10% 
worse?  Maybe we should accept and look later as you suggest.   So I’d argue 
for widening the windows rather than disabling them completely.


  *   If we did that we’d need good instrumentation to spot steps and drift in 
perf, as you say.  An advantage is that since the perf instrumentation runs 
only on committed master patches, not on every CI, it can cost more.  In 
particular , it could run a bunch of “typical” tests, including nofib and 
compiling Cabal or other libraries.

The big danger is that by relieving patch authors from worrying about perf 
drift, it’ll end up in the lap of the GHC HQ team.  If it’s hard for the author 
of a single patch (with which she is intimately familiar) to work out why it’s 
making some test 2% worse, imagine how hard, and demotivating, it’d be for Ben 
to wonder why 50 patches (with which he is unfamiliar) are making some test 5% 
worse.

I’m not sure how to address this problem.   At least we should make it clear 
that patch authors are expected to engage *actively* in a conversation about 
why their patch is making something worse, even after it lands.

Simon

From: ghc-devs <ghc-devs-boun...@haskell.org> On Behalf Of Moritz Angermann
Sent: 17 March 2021 03:00
To: ghc-devs <ghc-devs@haskell.org>
Subject: On CI

Hi there!

Just a quick update on our CI situation. Ben, John, Davean and I have been
discussion on CI yesterday, and what we can do about it, as well as some
minor notes on why we are frustrated with it. This is an open invitation to 
anyone who in earnest wants to work on CI. Please come forward and help!
We'd be glad to have more people involved!

First the good news, over the last few weeks we've seen we *can* improve
CI performance quite substantially. And the goal is now to have MR go through
CI within at most 3hs.  There are some ideas on how to make this even faster,
especially on wide (high core count) machines; however that will take a bit more
time.

Now to the more thorny issue: Stat failures.  We do not want GHC to regress,
and I believe everyone is on board with that mission.  Yet we have just 
witnessed a train of marge trials all fail due to a -2% regression in a few 
tests. Thus we've been blocking getting stuff into master for at least another 
day. This is (in my opinion) not acceptable! We just had five days of nothing 
working because master was broken and subsequently all CI pipelines kept 
failing. We have thus effectively wasted a week. While we can mitigate the 
latter part by enforcing marge for all merges to master (and with faster 
pipeline turnaround times this might be more palatable than with 9-12h 
turnaround times -- when you need to get something done! ha!), but that won't 
help us with issues where marge can't find a set of buildable MRs, because she 
just keeps hitting a combination of MRs that somehow together increase or 
decrease metrics.

We have three knobs to adjust:
- Make GHC build faster / make the testsuite run faster.
  There is some rather interesting work going on about parallelizing (earlier)
  during builds. We've also seen that we've wasted enormous amounts of
  time during darwin builds in the kernel, because of a bug in the testdriver.
- Use faster hardware.
  We've seen that just this can cut windows build times from 220min to 80min.
- Reduce the amount of builds.
  We used to build two pipelines for each marge merge, and if either of both
  (see below) failed, marge's merge would fail as well. So not only did we build
  twice as much as we needed, we also increased our chances to hit bogous
  build failures by 2.

We need to do something about this, and I'd advocate for just not making stats 
fail with marge. Build errors of course, but stat failures, no. And then have a 
separate dashboard (and Ben has some old code lying around for this, which 
someone would need to pick up and polish, ...), that tracks GHC's Performance 
for each commit to master, with easy access from the dashboard to the offending 
commit. We will also need to consider the implications of synthetic micro 
benchmarks, as opposed to say building Cabal or other packages, that reflect 
more real-world experience of users using GHC.

I will try to provide a data driven report on GHC's CI on a bi-weekly or month 
(we will have to see what the costs for writing it up, and the usefulness is) 
going forward. And my sincere hope is that it will help us better understand 
our CI situation; instead of just having some vague complaints about it.

Cheers,
 Moritz
_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Reply via email to