Re: Tracking intermittently failing CI jobs

Bryan Richter via ghc-devs Tue, 12 Jul 2022 04:03:51 -0700

Hello again,

Thanks to everyone who pointed out spurious failures over the last fewweeks. Here's the current state of affairs and some discussion on nextsteps.


*
*

*Dashboard
***

I made a dashboard for tracking spurious failures:

https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2

I created this for three reasons:

1. Keep tabs on new occurrences of spurious failures
2. Understand which problems are causing the most issues
3. Measure the effectiveness of any intervention

The dashboard still needs development, but it can already be used toshow that the number of "Cannot connect to Docker daemon" failures hasbeen reduced.


*
*

*Characterizing and Fixing Failures*

I have preliminary results on a few failure types. For instance, I usedthe "docker" type of failure to bootstrap the dashboard. Along with"Killed with signal 9", it seems to indicate a problem with the CIrunner, itself.

To look more deeply into these types of runner-system failures, *I willneed more access*. If you are responsible for some runners and you'recomfortable giving me shell access, you can find my public ssh key athttps://gitlab.haskell.org/-/snippets/5546. (Posted as a snippet so atleast you know the key comes from somebody who can access my GitLabaccount. Other secure means of communication are listed athttps://keybase.io/chreekat.) Send me a message if you do so.

Besides runner problems, there are spurious failures that may have moreto do with the CI code, itself. They include some problem withenvironment variables and (probably) some issue with console buffering.Neither of these are being tracked on the dashboard yet. Many otherproblems are yet to be explored at all.



*Next Steps*

The theme for the next steps is finalizing the dashboard andcharacterizing more failures.


 * Track more failure types on the dashboard
 * Improve the process of backfilling failure data on the dashboard
 * Include more metadata (like project id!) on the dashboard so it's
   easier to zoom on failures
 * Document the dashboard and the processes that populate it for posterity
 * Diagnose runner-system failures (if accessible)
 * Continue exploring other failure types
 * Fix failures omg!?

The list of next steps is currently heavy on finalizing the dashboardand light on fixing spurious failures. I know that might be frustrating.My justification is that CI is a complex hardware/software/human systemunder continuous operation where most the low-hanging fruit have alreadybeen plucked. It's time to get serious. :) My goal is to make spuriousfailures surprising rather than commonplace. This is the best way I knowto achieve that.


Thanks again for helping me with this goal. :)


-Bryan

P.S. If you're interested, I've been posting updates like this one onDiscourse:


https://discourse.haskell.org/search?q=DevOps%20Weekly%20Log%20%23haskell-foundation%20order%3Alatest_topic


On 18/05/2022 13:25, Bryan wrote:

Hi all,
I'd like to get some data on weird CI failures. Before clicking"retry" on a spurious failure, please paste the url for your job intothe spreadsheet you'll find linked athttps://gitlab.haskell.org/ghc/ghc/-/issues/21591.
Sorry for the slight misdirection. I wanted the spreadsheet to beworld-writable, which means I don't want its url floating around intoo many places. Maybe you can bookmark it if CI is causing you toomuch trouble. :)
-Bryan

_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Re: Tracking intermittently failing CI jobs

Reply via email to