Hello again,

Thanks to everyone who pointed out spurious failures over the last few weeks. Here's the current state of affairs and some discussion on next steps.

*
*

*Dashboard
***

I made a dashboard for tracking spurious failures:

https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2

I created this for three reasons:

1. Keep tabs on new occurrences of spurious failures
2. Understand which problems are causing the most issues
3. Measure the effectiveness of any intervention

The dashboard still needs development, but it can already be used to show that the number of "Cannot connect to Docker daemon" failures has been reduced.

*
*

*Characterizing and Fixing Failures*

I have preliminary results on a few failure types. For instance, I used the "docker" type of failure to bootstrap the dashboard. Along with "Killed with signal 9", it seems to indicate a problem with the CI runner, itself.

To look more deeply into these types of runner-system failures, *I will need more access*. If you are responsible for some runners and you're comfortable giving me shell access, you can find my public ssh key at https://gitlab.haskell.org/-/snippets/5546. (Posted as a snippet so at least you know the key comes from somebody who can access my GitLab account. Other secure means of communication are listed at https://keybase.io/chreekat.) Send me a message if you do so.

Besides runner problems, there are spurious failures that may have more to do with the CI code, itself. They include some problem with environment variables and (probably) some issue with console buffering. Neither of these are being tracked on the dashboard yet. Many other problems are yet to be explored at all.


*Next Steps*

The theme for the next steps is finalizing the dashboard and characterizing more failures.

 * Track more failure types on the dashboard
 * Improve the process of backfilling failure data on the dashboard
 * Include more metadata (like project id!) on the dashboard so it's
   easier to zoom on failures
 * Document the dashboard and the processes that populate it for posterity
 * Diagnose runner-system failures (if accessible)
 * Continue exploring other failure types
 * Fix failures omg!?

The list of next steps is currently heavy on finalizing the dashboard and light on fixing spurious failures. I know that might be frustrating. My justification is that CI is a complex hardware/software/human system under continuous operation where most the low-hanging fruit have already been plucked. It's time to get serious. :) My goal is to make spurious failures surprising rather than commonplace. This is the best way I know to achieve that.

Thanks again for helping me with this goal. :)


-Bryan

P.S. If you're interested, I've been posting updates like this one on Discourse:

https://discourse.haskell.org/search?q=DevOps%20Weekly%20Log%20%23haskell-foundation%20order%3Alatest_topic


On 18/05/2022 13:25, Bryan wrote:
Hi all,

I'd like to get some data on weird CI failures. Before clicking "retry" on a spurious failure, please paste the url for your job into the spreadsheet you'll find linked at https://gitlab.haskell.org/ghc/ghc/-/issues/21591.

Sorry for the slight misdirection. I wanted the spreadsheet to be world-writable, which means I don't want its url floating around in too many places. Maybe you can bookmark it if CI is causing you too much trouble. :)

-Bryan

_______________________________________________
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Reply via email to