Re: Tracking intermittently failing CI jobs

2022-07-12 Thread Bryan Richter via ghc-devs

Hello again,

Thanks to everyone who pointed out spurious failures over the last few 
weeks. Here's the current state of affairs and some discussion on next 
steps.


*
*

*Dashboard
***

I made a dashboard for tracking spurious failures:

https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2

I created this for three reasons:

1. Keep tabs on new occurrences of spurious failures
2. Understand which problems are causing the most issues
3. Measure the effectiveness of any intervention

The dashboard still needs development, but it can already be used to 
show that the number of "Cannot connect to Docker daemon" failures has 
been reduced.


*
*

*Characterizing and Fixing Failures*

I have preliminary results on a few failure types. For instance, I used 
the "docker" type of failure to bootstrap the dashboard. Along with 
"Killed with signal 9", it seems to indicate a problem with the CI 
runner, itself.


To look more deeply into these types of runner-system failures, *I will 
need more access*. If you are responsible for some runners and you're 
comfortable giving me shell access, you can find my public ssh key at 
https://gitlab.haskell.org/-/snippets/5546. (Posted as a snippet so at 
least you know the key comes from somebody who can access my GitLab 
account. Other secure means of communication are listed at 
https://keybase.io/chreekat.) Send me a message if you do so.


Besides runner problems, there are spurious failures that may have more 
to do with the CI code, itself. They include some problem with 
environment variables and (probably) some issue with console buffering. 
Neither of these are being tracked on the dashboard yet. Many other 
problems are yet to be explored at all.



*Next Steps*

The theme for the next steps is finalizing the dashboard and 
characterizing more failures.


 * Track more failure types on the dashboard
 * Improve the process of backfilling failure data on the dashboard
 * Include more metadata (like project id!) on the dashboard so it's
   easier to zoom on failures
 * Document the dashboard and the processes that populate it for posterity
 * Diagnose runner-system failures (if accessible)
 * Continue exploring other failure types
 * Fix failures omg!?

The list of next steps is currently heavy on finalizing the dashboard 
and light on fixing spurious failures. I know that might be frustrating. 
My justification is that CI is a complex hardware/software/human system 
under continuous operation where most the low-hanging fruit have already 
been plucked. It's time to get serious. :) My goal is to make spurious 
failures surprising rather than commonplace. This is the best way I know 
to achieve that.


Thanks again for helping me with this goal. :)


-Bryan

P.S. If you're interested, I've been posting updates like this one on 
Discourse:


https://discourse.haskell.org/search?q=DevOps%20Weekly%20Log%20%23haskell-foundation%20order%3Alatest_topic


On 18/05/2022 13:25, Bryan wrote:

Hi all,

I'd like to get some data on weird CI failures. Before clicking 
"retry" on a spurious failure, please paste the url for your job into 
the spreadsheet you'll find linked at 
https://gitlab.haskell.org/ghc/ghc/-/issues/21591.


Sorry for the slight misdirection. I wanted the spreadsheet to be 
world-writable, which means I don't want its url floating around in 
too many places. Maybe you can bookmark it if CI is causing you too 
much trouble. :)


-Bryan

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Hadrian problem

2022-07-12 Thread Simon Peyton Jones
I'm in a GHC tree, built with Hadrian, I'm getting this red problem.  But
compilation has got way past compiling base.

why is it looking in my .ghc/... directory?   It should be looking in my
build tree.

Simon

bash$ ~/code/HEAD-1/_build/ghc-stage1 -c Foo.hs
Loaded package environment from
/home/simonpj/.ghc/x86_64-linux-9.5.20220628/environments/default
: cannot satisfy -package-id base-4.17.0.0
(use -v for more information)

bash$ cat ~/code/HEAD-1/_build/ghc-stage1
"/home/simonpj/code/HEAD-1/_build/stage0/bin/ghc" "-no-global-package-db"
"-package-db /home/simonpj/code/HEAD-1/_build/stage1/lib/package.conf.d"
"$@"
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Hadrian problem

2022-07-12 Thread Sylvain Henry

Hi Simon,

Matt should have fixed it with 
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8556


Sylvain


On 12/07/2022 14:24, Simon Peyton Jones wrote:
I'm in a GHC tree, built with Hadrian, I'm getting this red problem.  
But compilation has got way past compiling base.


why is it looking in my .ghc/... directory?   It should be looking in 
my build tree.


Simon

bash$ ~/code/HEAD-1/_build/ghc-stage1 -c Foo.hs
Loaded package environment from 
/home/simonpj/.ghc/x86_64-linux-9.5.20220628/environments/default

: cannot satisfy -package-id base-4.17.0.0
    (use -v for more information)

bash$ cat ~/code/HEAD-1/_build/ghc-stage1
"/home/simonpj/code/HEAD-1/_build/stage0/bin/ghc" 
"-no-global-package-db" "-package-db 
/home/simonpj/code/HEAD-1/_build/stage1/lib/package.conf.d" "$@"




___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Hadrian problem

2022-07-12 Thread Douglas Wilson
DHi Simon,

It seems that GHC is "helpfully" adding flags to your invocation see
https://ghc.gitlab.haskell.org/ghc/doc/users_guide/packages.html#package-environments

You can see exactly what is being added by inspecting the environment file
/home/simonpj/.ghc/x86_64-linux-9.5.20220628/environments/default.

You can safely delete .ghc/x86_64-linux-9.5.20220628/environments/default.
Although without knowing how it got there, it may re-appear.

You can set `export GHC_ENVIRONMENT=-` or pass `-package-env -` on your
command line to disable the reading of environment files.

Regards,
Douglas Wilson


On Tue, Jul 12, 2022 at 1:24 PM Simon Peyton Jones <
simon.peytonjo...@gmail.com> wrote:

> I'm in a GHC tree, built with Hadrian, I'm getting this red problem.  But
> compilation has got way past compiling base.
>
> why is it looking in my .ghc/... directory?   It should be looking in my
> build tree.
>
> Simon
>
> bash$ ~/code/HEAD-1/_build/ghc-stage1 -c Foo.hs
> Loaded package environment from
> /home/simonpj/.ghc/x86_64-linux-9.5.20220628/environments/default
> : cannot satisfy -package-id base-4.17.0.0
> (use -v for more information)
>
> bash$ cat ~/code/HEAD-1/_build/ghc-stage1
> "/home/simonpj/code/HEAD-1/_build/stage0/bin/ghc" "-no-global-package-db"
> "-package-db /home/simonpj/code/HEAD-1/_build/stage1/lib/package.conf.d"
> "$@"
>
>
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs