Re: Acknowledgement of CI instability

2024-07-23 Thread Moritz Angermann
Hi,

As I’ve now arrived in Seoul and have my Mac’s back, I can bring two more
M1 Mac’s back online today. However they are hooked up on a serviced
apartment internet which might not be the fastest.

Best,
 Moritz

On Tue, 23 Jul 2024 at 11:36 PM, Sam Derbyshire 
wrote:

> Hi all,
>
> The GHC team would like to acknowledge some ongoing CI instabilities in
> the GHC project:
>
>   - regular failures of the i386 job, possibly related to the bump to
> debian 12
> <https://gitlab.haskell.org/ghc/ghc/-/commit/203830065b81fe29003c1640a354f11661ffc604>
> (although the job still failed occasionally before this),
>   - flakiness of the MultiLayerModulesDefsGhciReload test causing the
> fedora33-release job to fail,
>   - lack of availability of darwin runners, causing aarch64-darwin and
> x86_64-darwin jobs to time out.
>
> These issues are currently causing a string of marge batch failures,
> holding up several MRs.
>
> We are currently short on resources for addressing problems with CI, so
> please bear with us while we sort the situation out. In the meantime, feel
> free to let us know of any other CI issues that are impacting your work on
> GHC.
>
> Best,
>
> Sam
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Acknowledgement of CI instability

2024-07-23 Thread Sam Derbyshire
Hi all,

The GHC team would like to acknowledge some ongoing CI instabilities in the
GHC project:

  - regular failures of the i386 job, possibly related to the bump to
debian 12
<https://gitlab.haskell.org/ghc/ghc/-/commit/203830065b81fe29003c1640a354f11661ffc604>
(although the job still failed occasionally before this),
  - flakiness of the MultiLayerModulesDefsGhciReload test causing the
fedora33-release job to fail,
  - lack of availability of darwin runners, causing aarch64-darwin and
x86_64-darwin jobs to time out.

These issues are currently causing a string of marge batch failures,
holding up several MRs.

We are currently short on resources for addressing problems with CI, so
please bear with us while we sort the situation out. In the meantime, feel
free to let us know of any other CI issues that are impacting your work on
GHC.

Best,

Sam
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI stuck?

2024-07-12 Thread Sylvain Henry

Hi Simon,

There was a MR failing with the JS job [1,2]. I've fixed it 1 hour ago 
so it should pass now.


Sylvain

[1] https://gitlab.haskell.org/ghc/ghc/-/merge_requests/13025#note_575683
[2] https://gitlab.haskell.org/ghc/ghc/-/merge_requests/12991#note_575777


On 12/07/2024 09:58, Simon Peyton Jones wrote:

Dear GHC devs

Is GHC's CI stuck in some way?  My !12928 has been scheduled by Marge 
over 10 times now, and each time the commit has failed. Ten seems...  
a lot.


Thanks

Simon

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


CI stuck?

2024-07-12 Thread Simon Peyton Jones
Dear GHC devs

Is GHC's CI stuck in some way?  My !12928 has been scheduled by Marge over
10 times now, and each time the commit has failed.  Ten seems...  a lot.

Thanks

Simon
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


GitLab CI happenings in July

2023-07-03 Thread Bryan Richter via ghc-devs
Hello,

This is the first (and, perhaps, last[1]) monthly update on GitLab CI. This
month in particular deserves its own little email for the following reasons:

1. Some of the Darwin runners recently spontaneously self-upgraded,
introducing toolchain changes that broke CI. All the fixes are now on GHC's
master branch. This leaves us with two choices:

A) Re-enable the affected runners now. All current patches will have a 50%
of failing CI because the old problems are still present. Users (you) can
rebase your patches to avoid this problem.

B) Wait before re-enabling the runners. All in-flight MRs have a better
chance of getting green CI and/or organically getting rebased for other
reasons. However, Darwin capacity would remain at 50% for longer, slowing
down all pipelines.

My current plan is to wait one week before re-enabling the runners, but
ultimately it's not my call. Opinions welcome.

2. I will be on vacation from July 17 to July 28 (weeks 29 and 30). Please
tell Marge to be good while I am away.

3. GitLab was recently upgraded. Please do not be alarmed by any UI changes.

Enjoy!

-Bryan

[1]: Intuitively, I like the idea of giving monthly updates about GitLab
CI. But I don't know if it will be practical or valuable. I'll take a look
in a month to see if there's anything notable to write about again.
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI *sad face*

2023-06-30 Thread Bryan Richter via ghc-devs
Final update: both problems are solved!

Now we just need to wait for the wheels of time to do their magic. The
final patch is still waiting to get picked up and merged.

The queue for CI jobs is still a bit longer than usual right now, but I
think it's legitimate. There are simply more open MRs on GitLab than usual,
which is a good thing. (Darwin jobs aren't the source of the bottleneck.)

-Bryan

P.S. A quick shoutout to Marge for preventing two patches that merged
cleanly but created invalid results from making their way into the master
branch.

On Wed, 28 Jun 2023 at 09:20, Bryan Richter 
wrote:

> Nice!
>
> Other good news is that I lost track of all the Mac runners we actually
> have, and our current capacity is actually 3/6 rather than 1/4.
>
> On Wed, 28 Jun 2023 at 09:15, Rodrigo Mesquita <
> rodrigo.m.mesqu...@gmail.com> wrote:
>
>> The root of the second problem was !10723, which started failing on its
>> own pipeline after being rebased.
>> I’m pushing a fix.
>>
>> - Rodrigo
>>
>> On 28 Jun 2023, at 06:41, Bryan Richter via ghc-devs <
>> ghc-devs@haskell.org> wrote:
>>
>> Two things are negatively impacting GHC CI right now:
>>
>> Darwin runner capacity is down to one machine, since the other three are
>> paused. The problem and solution are known[1], but until the fix is
>> implemented in GHC, expect pipelines to get backed up. I will work on a
>> patch this morning
>>
>> [1]: https://gitlab.haskell.org/ghc/ghc/-/issues/23561
>>
>> The other problem is one I just noticed, and I don't have any good info
>> about it yet. The symptom is that Marge batch merges are failing reliably.
>> Three patches that do fine individually somehow cause a type error in the
>> hadrian-ghc-in-ghci job when combined[2]. The only clue is the error
>> itself, which complains of an out-of-scope data constructor
>> "ArchJavaScript" in the file compiler/GHC/Driver/Main.hs. A cursory look at
>> the individual patches doesn't shed any light. I just rebased all of them
>> to see if I can shake the error out of them that way. Any knowledge that
>> can be brought to bear would be appreciated
>>
>> [2]:
>> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10745#note_507418
>>
>> -Bryan
>> ___
>> ghc-devs mailing list
>> ghc-devs@haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>
>>
>>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI *sad face*

2023-06-28 Thread Bryan Richter via ghc-devs
Nice!

Other good news is that I lost track of all the Mac runners we actually
have, and our current capacity is actually 3/6 rather than 1/4.

On Wed, 28 Jun 2023 at 09:15, Rodrigo Mesquita 
wrote:

> The root of the second problem was !10723, which started failing on its
> own pipeline after being rebased.
> I’m pushing a fix.
>
> - Rodrigo
>
> On 28 Jun 2023, at 06:41, Bryan Richter via ghc-devs 
> wrote:
>
> Two things are negatively impacting GHC CI right now:
>
> Darwin runner capacity is down to one machine, since the other three are
> paused. The problem and solution are known[1], but until the fix is
> implemented in GHC, expect pipelines to get backed up. I will work on a
> patch this morning
>
> [1]: https://gitlab.haskell.org/ghc/ghc/-/issues/23561
>
> The other problem is one I just noticed, and I don't have any good info
> about it yet. The symptom is that Marge batch merges are failing reliably.
> Three patches that do fine individually somehow cause a type error in the
> hadrian-ghc-in-ghci job when combined[2]. The only clue is the error
> itself, which complains of an out-of-scope data constructor
> "ArchJavaScript" in the file compiler/GHC/Driver/Main.hs. A cursory look at
> the individual patches doesn't shed any light. I just rebased all of them
> to see if I can shake the error out of them that way. Any knowledge that
> can be brought to bear would be appreciated
>
> [2]: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10745#note_507418
>
> -Bryan
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
>
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI *sad face*

2023-06-28 Thread Rodrigo Mesquita
The root of the second problem was !10723, which started failing on its own 
pipeline after being rebased.
I’m pushing a fix.

- Rodrigo

> On 28 Jun 2023, at 06:41, Bryan Richter via ghc-devs  
> wrote:
> 
> Two things are negatively impacting GHC CI right now:
> 
> Darwin runner capacity is down to one machine, since the other three are 
> paused. The problem and solution are known[1], but until the fix is 
> implemented in GHC, expect pipelines to get backed up. I will work on a patch 
> this morning
> 
> [1]: https://gitlab.haskell.org/ghc/ghc/-/issues/23561
> 
> The other problem is one I just noticed, and I don't have any good info about 
> it yet. The symptom is that Marge batch merges are failing reliably. Three 
> patches that do fine individually somehow cause a type error in the 
> hadrian-ghc-in-ghci job when combined[2]. The only clue is the error itself, 
> which complains of an out-of-scope data constructor "ArchJavaScript" in the 
> file compiler/GHC/Driver/Main.hs. A cursory look at the individual patches 
> doesn't shed any light. I just rebased all of them to see if I can shake the 
> error out of them that way. Any knowledge that can be brought to bear would 
> be appreciated
> 
> [2]: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10745#note_507418
> 
> -Bryan
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


CI *sad face*

2023-06-27 Thread Bryan Richter via ghc-devs
Two things are negatively impacting GHC CI right now:

Darwin runner capacity is down to one machine, since the other three are
paused. The problem and solution are known[1], but until the fix is
implemented in GHC, expect pipelines to get backed up. I will work on a
patch this morning

[1]: https://gitlab.haskell.org/ghc/ghc/-/issues/23561

The other problem is one I just noticed, and I don't have any good info
about it yet. The symptom is that Marge batch merges are failing reliably.
Three patches that do fine individually somehow cause a type error in the
hadrian-ghc-in-ghci job when combined[2]. The only clue is the error
itself, which complains of an out-of-scope data constructor
"ArchJavaScript" in the file compiler/GHC/Driver/Main.hs. A cursory look at
the individual patches doesn't shed any light. I just rebased all of them
to see if I can shake the error out of them that way. Any knowledge that
can be brought to bear would be appreciated

[2]: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10745#note_507418

-Bryan
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI

2023-03-19 Thread Bryan Richter via ghc-devs
Hoo boy... now I've fixed an even *bigger* problem that is pretty
embarrassing.

https://gitlab.haskell.org/ghc/ghc-perf-import/-/commit/67238099e9c3478ba591080c6e582985c62b83c0

It's time to give this service a proper testsuite.

On Sun, 19 Mar 2023 at 14:42, Bryan Richter 
wrote:

> I did find that some jobs were being retried repeatedly. I have deployed a
> workaround to prevent this from continuing. The problem was related to
> T18623 behaving strangely, and I've opened
> https://gitlab.haskell.org/ghc/ghc/-/issues/23139 so GHC devs can have a
> look at it.
>
> On Sun, 19 Mar 2023 at 13:27, Bryan Richter 
> wrote:
>
>> I'm back at my computer and am investigating. I don't see the problem I
>> feared, but I do see some anomalies. I'll update once it's back to normal.
>>
>> On Sat, 18 Mar 2023 at 17:55, Bryan Richter 
>> wrote:
>>
>>> I'm away from my computer for the day, but yes there were some jobs that
>>> got stuck in a restart loop. See
>>> https://gitlab.haskell.org/ghc/ghc/-/issues/23094#note_487426 .
>>> Unfortunately I don't know if there are others, but I did fix the root
>>> cause of that particular loop.
>>>
>>> On Sat, 18 Mar 2023, 15.06 Sam Derbyshire, 
>>> wrote:
>>>
>>>> I think there's a problem with jobs restarting, on my renamer MR
>>>> <https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8686> there were
>>>> 5 full pipelines running at once. I had to cancel some of them, but also it
>>>> seems some got cancelled by some new CI pipelines restarting.
>>>>
>>>> On Sat, 18 Mar 2023 at 13:59, Simon Peyton Jones <
>>>> simon.peytonjo...@gmail.com> wrote:
>>>>
>>>>> All GHC CI pipelines seem stalled, sadly
>>>>>
>>>>> e.g.
>>>>> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10123/pipelines
>>>>>
>>>>> Can someone unglue it?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Simon
>>>>> ___
>>>>> ghc-devs mailing list
>>>>> ghc-devs@haskell.org
>>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>>>>
>>>>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI

2023-03-19 Thread Bryan Richter via ghc-devs
I did find that some jobs were being retried repeatedly. I have deployed a
workaround to prevent this from continuing. The problem was related to
T18623 behaving strangely, and I've opened
https://gitlab.haskell.org/ghc/ghc/-/issues/23139 so GHC devs can have a
look at it.

On Sun, 19 Mar 2023 at 13:27, Bryan Richter 
wrote:

> I'm back at my computer and am investigating. I don't see the problem I
> feared, but I do see some anomalies. I'll update once it's back to normal.
>
> On Sat, 18 Mar 2023 at 17:55, Bryan Richter 
> wrote:
>
>> I'm away from my computer for the day, but yes there were some jobs that
>> got stuck in a restart loop. See
>> https://gitlab.haskell.org/ghc/ghc/-/issues/23094#note_487426 .
>> Unfortunately I don't know if there are others, but I did fix the root
>> cause of that particular loop.
>>
>> On Sat, 18 Mar 2023, 15.06 Sam Derbyshire, 
>> wrote:
>>
>>> I think there's a problem with jobs restarting, on my renamer MR
>>> <https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8686> there were 5
>>> full pipelines running at once. I had to cancel some of them, but also it
>>> seems some got cancelled by some new CI pipelines restarting.
>>>
>>> On Sat, 18 Mar 2023 at 13:59, Simon Peyton Jones <
>>> simon.peytonjo...@gmail.com> wrote:
>>>
>>>> All GHC CI pipelines seem stalled, sadly
>>>>
>>>> e.g.
>>>> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10123/pipelines
>>>>
>>>> Can someone unglue it?
>>>>
>>>> Thanks!
>>>>
>>>> Simon
>>>> ___
>>>> ghc-devs mailing list
>>>> ghc-devs@haskell.org
>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>>>
>>>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI

2023-03-19 Thread Bryan Richter via ghc-devs
Spam detection software, running on the system "mail.haskell.org", has
identified this incoming email as possible spam.  The original message
has been attached to this so you can view it (if it isn't spam) or label
similar future email.  If you have any questions, see
@@CONTACT_ADDRESS@@ for details.

Content preview:  I'm back at my computer and am investigating. I don't see
  the problem I feared, but I do see some anomalies. I'll update once it's back
   to normal. On Sat, 18 Mar 2023 at 17:55, Bryan Richter 

   wrote: [...] 

Content analysis details:   (5.8 points, 5.0 required)

 pts rule name  description
 -- --
-0.0 SPF_PASS   SPF: sender matches SPF record
 5.0 UNWANTED_LANGUAGE_BODY BODY: Message written in an undesired language
 0.0 HTML_MESSAGE   BODY: HTML included in message
 0.8 BAYES_50   BODY: Bayes spam probability is 40 to 60%
[score: 0.5000]
 0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid

The original message was not completely plain text, and may be unsafe to
open with some email clients; in particular, it may contain a virus,
or confirm that your address can receive spam.  If you wish to view
it, it may be safer to save it to a file and open it with an editor.

--- Begin Message ---
I'm back at my computer and am investigating. I don't see the problem I
feared, but I do see some anomalies. I'll update once it's back to normal.

On Sat, 18 Mar 2023 at 17:55, Bryan Richter 
wrote:

> I'm away from my computer for the day, but yes there were some jobs that
> got stuck in a restart loop. See
> https://gitlab.haskell.org/ghc/ghc/-/issues/23094#note_487426 .
> Unfortunately I don't know if there are others, but I did fix the root
> cause of that particular loop.
>
> On Sat, 18 Mar 2023, 15.06 Sam Derbyshire, 
> wrote:
>
>> I think there's a problem with jobs restarting, on my renamer MR
>> <https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8686> there were 5
>> full pipelines running at once. I had to cancel some of them, but also it
>> seems some got cancelled by some new CI pipelines restarting.
>>
>> On Sat, 18 Mar 2023 at 13:59, Simon Peyton Jones <
>> simon.peytonjo...@gmail.com> wrote:
>>
>>> All GHC CI pipelines seem stalled, sadly
>>>
>>> e.g. https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10123/pipelines
>>>
>>> Can someone unglue it?
>>>
>>> Thanks!
>>>
>>> Simon
>>> ___
>>> ghc-devs mailing list
>>> ghc-devs@haskell.org
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>>
>>
--- End Message ---
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI

2023-03-18 Thread Bryan Richter via ghc-devs
I'm away from my computer for the day, but yes there were some jobs that
got stuck in a restart loop. See
https://gitlab.haskell.org/ghc/ghc/-/issues/23094#note_487426 .
Unfortunately I don't know if there are others, but I did fix the root
cause of that particular loop.

On Sat, 18 Mar 2023, 15.06 Sam Derbyshire,  wrote:

> I think there's a problem with jobs restarting, on my renamer MR
> <https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8686> there were 5
> full pipelines running at once. I had to cancel some of them, but also it
> seems some got cancelled by some new CI pipelines restarting.
>
> On Sat, 18 Mar 2023 at 13:59, Simon Peyton Jones <
> simon.peytonjo...@gmail.com> wrote:
>
>> All GHC CI pipelines seem stalled, sadly
>>
>> e.g. https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10123/pipelines
>>
>> Can someone unglue it?
>>
>> Thanks!
>>
>> Simon
>> ___
>> ghc-devs mailing list
>> ghc-devs@haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI

2023-03-18 Thread Sam Derbyshire
I think there's a problem with jobs restarting, on my renamer MR
<https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8686> there were 5
full pipelines running at once. I had to cancel some of them, but also it
seems some got cancelled by some new CI pipelines restarting.

On Sat, 18 Mar 2023 at 13:59, Simon Peyton Jones <
simon.peytonjo...@gmail.com> wrote:

> All GHC CI pipelines seem stalled, sadly
>
> e.g. https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10123/pipelines
>
> Can someone unglue it?
>
> Thanks!
>
> Simon
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


CI

2023-03-18 Thread Simon Peyton Jones
All GHC CI pipelines seem stalled, sadly

e.g. https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10123/pipelines

Can someone unglue it?

Thanks!

Simon
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Coordination on FreeBSD CI, default WinIO and Info Tables profiling work

2023-03-15 Thread Hécate

Hi everyone,

I have created topical aggregators of tickets that go beyond the rhythm 
of releases (aka. "epics") for  the following topics:


* Info Tables Profiling: https://gitlab.haskell.org/groups/ghc/-/epics/3

* Setting WinIO "on" by default: 
https://gitlab.haskell.org/groups/ghc/-/epics/4


* FreeBSD CI revival: https://gitlab.haskell.org/groups/ghc/-/epics/5

These epics have no deadline and their purpose is to track the evolution 
of our workload for certain "big" tasks that go beyond a single ticket.


They are also useful as they are a (albeit imprecise) tool to help 
determine after the fact the magnitude of a project and the efforts it 
took. This will certainly be helpful for future estimations.


And finally their prime purpose is to enable more awareness about our 
co-contributors work, so that we get all a better sense of what it takes 
to do certain things. :)


Please do feel free to create your own for projects that are not fit for 
a single milestone (or are not related to release milestones at all).



Cheers,
Hécate

--
Hécate ✨
: @TechnoEmpress
IRC: Hecate
WWW: https://glitchbra.in
RUN: BSD

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Overnight CI failures

2022-09-30 Thread Bryan Richter via ghc-devs
(Adding ghc-devs)

Are these fragile tests?

1. T14346 got a "bad file descriptor" on Darwin
2. linker_unload got some gold errors on Linux

Neither of these have been reported to me before, so I don't know much
about them. Nor have I looked deeply (or at all) at the tests themselves,
yet.

On Thu, Sep 29, 2022 at 3:37 PM Simon Peyton Jones <
simon.peytonjo...@gmail.com> wrote:

> Bryan
>
> These failed overnight
>
> On !8897
>
>- https://gitlab.haskell.org/ghc/ghc/-/jobs/1185519
>- https://gitlab.haskell.org/ghc/ghc/-/jobs/1185520
>
> I think it's extremely unlikely that this had anything to do with my patch.
>
> Simon
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Consistent CI failure in job nightly-i386-linux-deb9-validate

2022-09-29 Thread Cheng Shao
When hadrian builds the binary-dist job, invoking tar and xz is
already the last step and there'll be no other ongoing jobs. But I do
agree with reverting, this minor optimization I proposed has caused
more trouble than its worth :/

On Thu, Sep 29, 2022 at 9:25 AM Bryan Richter  wrote:
>
> Matthew pointed out that the build system already parallelizes jobs, so it's 
> risky to force parallelization of any individual job. That means I should 
> just revert.
>
> On Wed, Sep 28, 2022 at 2:38 PM Cheng Shao  wrote:
>>
>> I believe we can either modify ci.sh to disable parallel compression
>> for i386, or modify .gitlab/gen_ci.hs and .gitlab/jobs.yaml to disable
>> XZ_OPT=-9 for i386.
>>
>> On Wed, Sep 28, 2022 at 1:21 PM Bryan Richter  
>> wrote:
>> >
>> > Aha: while i386-linux-deb9-validate sets no extra XZ options, 
>> > nightly-i386-linux-deb9-validate (the failing job) sets "XZ_OPT = 9".
>> >
>> > A revert would fix the problem, but presumably so would tweaking that 
>> > option. Does anyone have information that would lead to a better decision 
>> > here?
>> >
>> >
>> > On Wed, Sep 28, 2022 at 2:02 PM Cheng Shao  wrote:
>> >>
>> >> Sure, in which case pls revert it. Apologies for the impact, though
>> >> I'm still a bit curious, the i386 job did pass in the original MR.
>> >>
>> >> On Wed, Sep 28, 2022 at 1:00 PM Bryan Richter  
>> >> wrote:
>> >> >
>> >> > Yep, it seems to mostly be xz that is running out of memory. (All 
>> >> > recent builds that I sampled, but not all builds through all time.) 
>> >> > Thanks for pointing it out!
>> >> >
>> >> > I can revert the change.
>> >> >
>> >> > On Wed, Sep 28, 2022 at 11:46 AM Cheng Shao  wrote:
>> >> >>
>> >> >> Hi Bryan,
>> >> >>
>> >> >> This may be an unintended fallout of !8940. Would you try starting an
>> >> >> i386 pipeline with it reversed to see if it solves the issue, in which
>> >> >> case we should revert or fix it in master?
>> >> >>
>> >> >> On Wed, Sep 28, 2022 at 9:58 AM Bryan Richter via ghc-devs
>> >> >>  wrote:
>> >> >> >
>> >> >> > Hi all,
>> >> >> >
>> >> >> > For the past week or so, nightly-i386-linux-deb9-validate has been 
>> >> >> > failing consistently.
>> >> >> >
>> >> >> > They show up on the failure dashboard because the logs contain the 
>> >> >> > phrase "Cannot allocate memory".
>> >> >> >
>> >> >> > I haven't looked yet to see if they always fail in the same place, 
>> >> >> > but I'll do that soon. The first example I looked at, however, has 
>> >> >> > the line "xz: (stdin): Cannot allocate memory", so it's not GHC 
>> >> >> > (alone) causing the problem.
>> >> >> >
>> >> >> > As a consequence of showing up on the dashboard, the jobs get 
>> >> >> > restarted. Since they fail consistently, they keep getting 
>> >> >> > restarted. Since the jobs keep getting restarted, the pipelines stay 
>> >> >> > alive. When I checked just now, there were 8 nightly runs still 
>> >> >> > running. :) Thus I'm going to cancel the still-running 
>> >> >> > nightly-i386-linux-deb9-validate jobs and let the pipelines die in 
>> >> >> > peace. You can still find all examples of failed jobs on the 
>> >> >> > dashboard:
>> >> >> >
>> >> >> > https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2=now-90d=now=5m=cannot_allocate
>> >> >> >
>> >> >> > To prevent future problems, it would be good if someone could help 
>> >> >> > me look into this. Otherwise I'll just disable the job. :(
>> >> >> > ___
>> >> >> > ghc-devs mailing list
>> >> >> > ghc-devs@haskell.org
>> >> >> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Consistent CI failure in job nightly-i386-linux-deb9-validate

2022-09-29 Thread Bryan Richter via ghc-devs
Matthew pointed out that the build system already parallelizes jobs, so
it's risky to force parallelization of any individual job. That means I
should just revert.

On Wed, Sep 28, 2022 at 2:38 PM Cheng Shao  wrote:

> I believe we can either modify ci.sh to disable parallel compression
> for i386, or modify .gitlab/gen_ci.hs and .gitlab/jobs.yaml to disable
> XZ_OPT=-9 for i386.
>
> On Wed, Sep 28, 2022 at 1:21 PM Bryan Richter 
> wrote:
> >
> > Aha: while i386-linux-deb9-validate sets no extra XZ options,
> nightly-i386-linux-deb9-validate (the failing job) sets "XZ_OPT = 9".
> >
> > A revert would fix the problem, but presumably so would tweaking that
> option. Does anyone have information that would lead to a better decision
> here?
> >
> >
> > On Wed, Sep 28, 2022 at 2:02 PM Cheng Shao  wrote:
> >>
> >> Sure, in which case pls revert it. Apologies for the impact, though
> >> I'm still a bit curious, the i386 job did pass in the original MR.
> >>
> >> On Wed, Sep 28, 2022 at 1:00 PM Bryan Richter 
> wrote:
> >> >
> >> > Yep, it seems to mostly be xz that is running out of memory. (All
> recent builds that I sampled, but not all builds through all time.) Thanks
> for pointing it out!
> >> >
> >> > I can revert the change.
> >> >
> >> > On Wed, Sep 28, 2022 at 11:46 AM Cheng Shao 
> wrote:
> >> >>
> >> >> Hi Bryan,
> >> >>
> >> >> This may be an unintended fallout of !8940. Would you try starting an
> >> >> i386 pipeline with it reversed to see if it solves the issue, in
> which
> >> >> case we should revert or fix it in master?
> >> >>
> >> >> On Wed, Sep 28, 2022 at 9:58 AM Bryan Richter via ghc-devs
> >> >>  wrote:
> >> >> >
> >> >> > Hi all,
> >> >> >
> >> >> > For the past week or so, nightly-i386-linux-deb9-validate has been
> failing consistently.
> >> >> >
> >> >> > They show up on the failure dashboard because the logs contain the
> phrase "Cannot allocate memory".
> >> >> >
> >> >> > I haven't looked yet to see if they always fail in the same place,
> but I'll do that soon. The first example I looked at, however, has the line
> "xz: (stdin): Cannot allocate memory", so it's not GHC (alone) causing the
> problem.
> >> >> >
> >> >> > As a consequence of showing up on the dashboard, the jobs get
> restarted. Since they fail consistently, they keep getting restarted. Since
> the jobs keep getting restarted, the pipelines stay alive. When I checked
> just now, there were 8 nightly runs still running. :) Thus I'm going to
> cancel the still-running nightly-i386-linux-deb9-validate jobs and let the
> pipelines die in peace. You can still find all examples of failed jobs on
> the dashboard:
> >> >> >
> >> >> >
> https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2=now-90d=now=5m=cannot_allocate
> >> >> >
> >> >> > To prevent future problems, it would be good if someone could help
> me look into this. Otherwise I'll just disable the job. :(
> >> >> > ___
> >> >> > ghc-devs mailing list
> >> >> > ghc-devs@haskell.org
> >> >> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Consistent CI failure in job nightly-i386-linux-deb9-validate

2022-09-28 Thread Cheng Shao
I believe we can either modify ci.sh to disable parallel compression
for i386, or modify .gitlab/gen_ci.hs and .gitlab/jobs.yaml to disable
XZ_OPT=-9 for i386.

On Wed, Sep 28, 2022 at 1:21 PM Bryan Richter  wrote:
>
> Aha: while i386-linux-deb9-validate sets no extra XZ options, 
> nightly-i386-linux-deb9-validate (the failing job) sets "XZ_OPT = 9".
>
> A revert would fix the problem, but presumably so would tweaking that option. 
> Does anyone have information that would lead to a better decision here?
>
>
> On Wed, Sep 28, 2022 at 2:02 PM Cheng Shao  wrote:
>>
>> Sure, in which case pls revert it. Apologies for the impact, though
>> I'm still a bit curious, the i386 job did pass in the original MR.
>>
>> On Wed, Sep 28, 2022 at 1:00 PM Bryan Richter  
>> wrote:
>> >
>> > Yep, it seems to mostly be xz that is running out of memory. (All recent 
>> > builds that I sampled, but not all builds through all time.) Thanks for 
>> > pointing it out!
>> >
>> > I can revert the change.
>> >
>> > On Wed, Sep 28, 2022 at 11:46 AM Cheng Shao  wrote:
>> >>
>> >> Hi Bryan,
>> >>
>> >> This may be an unintended fallout of !8940. Would you try starting an
>> >> i386 pipeline with it reversed to see if it solves the issue, in which
>> >> case we should revert or fix it in master?
>> >>
>> >> On Wed, Sep 28, 2022 at 9:58 AM Bryan Richter via ghc-devs
>> >>  wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> > For the past week or so, nightly-i386-linux-deb9-validate has been 
>> >> > failing consistently.
>> >> >
>> >> > They show up on the failure dashboard because the logs contain the 
>> >> > phrase "Cannot allocate memory".
>> >> >
>> >> > I haven't looked yet to see if they always fail in the same place, but 
>> >> > I'll do that soon. The first example I looked at, however, has the line 
>> >> > "xz: (stdin): Cannot allocate memory", so it's not GHC (alone) causing 
>> >> > the problem.
>> >> >
>> >> > As a consequence of showing up on the dashboard, the jobs get 
>> >> > restarted. Since they fail consistently, they keep getting restarted. 
>> >> > Since the jobs keep getting restarted, the pipelines stay alive. When I 
>> >> > checked just now, there were 8 nightly runs still running. :) Thus I'm 
>> >> > going to cancel the still-running nightly-i386-linux-deb9-validate jobs 
>> >> > and let the pipelines die in peace. You can still find all examples of 
>> >> > failed jobs on the dashboard:
>> >> >
>> >> > https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2=now-90d=now=5m=cannot_allocate
>> >> >
>> >> > To prevent future problems, it would be good if someone could help me 
>> >> > look into this. Otherwise I'll just disable the job. :(
>> >> > ___
>> >> > ghc-devs mailing list
>> >> > ghc-devs@haskell.org
>> >> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Consistent CI failure in job nightly-i386-linux-deb9-validate

2022-09-28 Thread Bryan Richter via ghc-devs
Aha: while i386-linux-deb9-validate sets no extra XZ options,
*nightly*-i386-linux-deb9-validate
(the failing job) sets "XZ_OPT = 9".

A revert would fix the problem, but presumably so would tweaking that
option. Does anyone have information that would lead to a better decision
here?


On Wed, Sep 28, 2022 at 2:02 PM Cheng Shao  wrote:

> Sure, in which case pls revert it. Apologies for the impact, though
> I'm still a bit curious, the i386 job did pass in the original MR.
>
> On Wed, Sep 28, 2022 at 1:00 PM Bryan Richter 
> wrote:
> >
> > Yep, it seems to mostly be xz that is running out of memory. (All recent
> builds that I sampled, but not all builds through all time.) Thanks for
> pointing it out!
> >
> > I can revert the change.
> >
> > On Wed, Sep 28, 2022 at 11:46 AM Cheng Shao  wrote:
> >>
> >> Hi Bryan,
> >>
> >> This may be an unintended fallout of !8940. Would you try starting an
> >> i386 pipeline with it reversed to see if it solves the issue, in which
> >> case we should revert or fix it in master?
> >>
> >> On Wed, Sep 28, 2022 at 9:58 AM Bryan Richter via ghc-devs
> >>  wrote:
> >> >
> >> > Hi all,
> >> >
> >> > For the past week or so, nightly-i386-linux-deb9-validate has been
> failing consistently.
> >> >
> >> > They show up on the failure dashboard because the logs contain the
> phrase "Cannot allocate memory".
> >> >
> >> > I haven't looked yet to see if they always fail in the same place,
> but I'll do that soon. The first example I looked at, however, has the line
> "xz: (stdin): Cannot allocate memory", so it's not GHC (alone) causing the
> problem.
> >> >
> >> > As a consequence of showing up on the dashboard, the jobs get
> restarted. Since they fail consistently, they keep getting restarted. Since
> the jobs keep getting restarted, the pipelines stay alive. When I checked
> just now, there were 8 nightly runs still running. :) Thus I'm going to
> cancel the still-running nightly-i386-linux-deb9-validate jobs and let the
> pipelines die in peace. You can still find all examples of failed jobs on
> the dashboard:
> >> >
> >> >
> https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2=now-90d=now=5m=cannot_allocate
> >> >
> >> > To prevent future problems, it would be good if someone could help me
> look into this. Otherwise I'll just disable the job. :(
> >> > ___
> >> > ghc-devs mailing list
> >> > ghc-devs@haskell.org
> >> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Consistent CI failure in job nightly-i386-linux-deb9-validate

2022-09-28 Thread Cheng Shao
Sure, in which case pls revert it. Apologies for the impact, though
I'm still a bit curious, the i386 job did pass in the original MR.

On Wed, Sep 28, 2022 at 1:00 PM Bryan Richter  wrote:
>
> Yep, it seems to mostly be xz that is running out of memory. (All recent 
> builds that I sampled, but not all builds through all time.) Thanks for 
> pointing it out!
>
> I can revert the change.
>
> On Wed, Sep 28, 2022 at 11:46 AM Cheng Shao  wrote:
>>
>> Hi Bryan,
>>
>> This may be an unintended fallout of !8940. Would you try starting an
>> i386 pipeline with it reversed to see if it solves the issue, in which
>> case we should revert or fix it in master?
>>
>> On Wed, Sep 28, 2022 at 9:58 AM Bryan Richter via ghc-devs
>>  wrote:
>> >
>> > Hi all,
>> >
>> > For the past week or so, nightly-i386-linux-deb9-validate has been failing 
>> > consistently.
>> >
>> > They show up on the failure dashboard because the logs contain the phrase 
>> > "Cannot allocate memory".
>> >
>> > I haven't looked yet to see if they always fail in the same place, but 
>> > I'll do that soon. The first example I looked at, however, has the line 
>> > "xz: (stdin): Cannot allocate memory", so it's not GHC (alone) causing the 
>> > problem.
>> >
>> > As a consequence of showing up on the dashboard, the jobs get restarted. 
>> > Since they fail consistently, they keep getting restarted. Since the jobs 
>> > keep getting restarted, the pipelines stay alive. When I checked just now, 
>> > there were 8 nightly runs still running. :) Thus I'm going to cancel the 
>> > still-running nightly-i386-linux-deb9-validate jobs and let the pipelines 
>> > die in peace. You can still find all examples of failed jobs on the 
>> > dashboard:
>> >
>> > https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2=now-90d=now=5m=cannot_allocate
>> >
>> > To prevent future problems, it would be good if someone could help me look 
>> > into this. Otherwise I'll just disable the job. :(
>> > ___
>> > ghc-devs mailing list
>> > ghc-devs@haskell.org
>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Consistent CI failure in job nightly-i386-linux-deb9-validate

2022-09-28 Thread Bryan Richter via ghc-devs
Yep, it seems to mostly be xz that is running out of memory. (All recent
builds that I sampled, but not all builds through all time.) Thanks for
pointing it out!

I can revert the change.

On Wed, Sep 28, 2022 at 11:46 AM Cheng Shao  wrote:

> Hi Bryan,
>
> This may be an unintended fallout of !8940. Would you try starting an
> i386 pipeline with it reversed to see if it solves the issue, in which
> case we should revert or fix it in master?
>
> On Wed, Sep 28, 2022 at 9:58 AM Bryan Richter via ghc-devs
>  wrote:
> >
> > Hi all,
> >
> > For the past week or so, nightly-i386-linux-deb9-validate has been
> failing consistently.
> >
> > They show up on the failure dashboard because the logs contain the
> phrase "Cannot allocate memory".
> >
> > I haven't looked yet to see if they always fail in the same place, but
> I'll do that soon. The first example I looked at, however, has the line
> "xz: (stdin): Cannot allocate memory", so it's not GHC (alone) causing the
> problem.
> >
> > As a consequence of showing up on the dashboard, the jobs get restarted.
> Since they fail consistently, they keep getting restarted. Since the jobs
> keep getting restarted, the pipelines stay alive. When I checked just now,
> there were 8 nightly runs still running. :) Thus I'm going to cancel the
> still-running nightly-i386-linux-deb9-validate jobs and let the pipelines
> die in peace. You can still find all examples of failed jobs on the
> dashboard:
> >
> >
> https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2=now-90d=now=5m=cannot_allocate
> >
> > To prevent future problems, it would be good if someone could help me
> look into this. Otherwise I'll just disable the job. :(
> > ___
> > ghc-devs mailing list
> > ghc-devs@haskell.org
> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Consistent CI failure in job nightly-i386-linux-deb9-validate

2022-09-28 Thread Cheng Shao
Hi Bryan,

This may be an unintended fallout of !8940. Would you try starting an
i386 pipeline with it reversed to see if it solves the issue, in which
case we should revert or fix it in master?

On Wed, Sep 28, 2022 at 9:58 AM Bryan Richter via ghc-devs
 wrote:
>
> Hi all,
>
> For the past week or so, nightly-i386-linux-deb9-validate has been failing 
> consistently.
>
> They show up on the failure dashboard because the logs contain the phrase 
> "Cannot allocate memory".
>
> I haven't looked yet to see if they always fail in the same place, but I'll 
> do that soon. The first example I looked at, however, has the line "xz: 
> (stdin): Cannot allocate memory", so it's not GHC (alone) causing the problem.
>
> As a consequence of showing up on the dashboard, the jobs get restarted. 
> Since they fail consistently, they keep getting restarted. Since the jobs 
> keep getting restarted, the pipelines stay alive. When I checked just now, 
> there were 8 nightly runs still running. :) Thus I'm going to cancel the 
> still-running nightly-i386-linux-deb9-validate jobs and let the pipelines die 
> in peace. You can still find all examples of failed jobs on the dashboard:
>
> https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2=now-90d=now=5m=cannot_allocate
>
> To prevent future problems, it would be good if someone could help me look 
> into this. Otherwise I'll just disable the job. :(
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Consistent CI failure in job nightly-i386-linux-deb9-validate

2022-09-28 Thread Bryan Richter via ghc-devs
Hi all,

For the past week or so, nightly-i386-linux-deb9-validate has been failing
consistently.

They show up on the failure dashboard because the logs contain the phrase
"Cannot allocate memory".

I haven't looked yet to see if they always fail in the same place, but I'll
do that soon. The first example I looked at, however, has the line "xz:
(stdin): Cannot allocate memory", so it's not GHC (alone) causing the
problem.

As a consequence of showing up on the dashboard, the jobs get restarted.
Since they fail consistently, they keep getting restarted. Since the jobs
keep getting restarted, the pipelines stay alive. When I checked just now,
there were 8 nightly runs still running. :) Thus I'm going to cancel the
still-running nightly-i386-linux-deb9-validate jobs and let the pipelines
die in peace. You can still find all examples of failed jobs on the
dashboard:

https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2=now-90d=now=5m=cannot_allocate

To prevent future problems, it would be good if someone could help me look
into this. Otherwise I'll just disable the job. :(
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


GHC compiler perf CI

2022-09-11 Thread Simon Peyton Jones
Dear devs

I used to use the build *x86_64-linux-deb10-int_native-validate* as the
place to look for compiler/bytes-allocated changes in perf/compiler.  But
now it doesn't show those results any more, only runtime/bytes-allocated in
perf/should_run.


   - Should we not run perf/compiler in every build?
   - Why has it gone from the build above
   - Which build should I look at to see perf/compiler data

Thanks

Simon
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI failures

2022-09-05 Thread Ben Gamari
Ben Gamari  writes:

> Simon Peyton Jones  writes:
>
>> Matthew, Ben, Bryan
>>
>> CI is failing in in "lint-ci-config"..
>>
>> See https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8916
>> or https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847
>>
> I'll investigate.
>
I believe this should be fixed by !8943. Perhaps you could try rebasing
on top of this?

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI failures

2022-09-05 Thread Ben Gamari
Simon Peyton Jones  writes:

> Matthew, Ben, Bryan
>
> CI is failing in in "lint-ci-config"..
>
> See https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8916
> or https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847
>
I'll investigate.

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


CI failures

2022-09-05 Thread Simon Peyton Jones
Matthew, Ben, Bryan

CI is failing in in "lint-ci-config"..

See https://gitlab.haskell.org/ghc/ghc/-/merge_requests/8916
or https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847

What's up?

Simon
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Tracking intermittently failing CI jobs

2022-07-12 Thread Bryan Richter via ghc-devs

Hello again,

Thanks to everyone who pointed out spurious failures over the last few 
weeks. Here's the current state of affairs and some discussion on next 
steps.


*
*

*Dashboard
***

I made a dashboard for tracking spurious failures:

https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2

I created this for three reasons:

1. Keep tabs on new occurrences of spurious failures
2. Understand which problems are causing the most issues
3. Measure the effectiveness of any intervention

The dashboard still needs development, but it can already be used to 
show that the number of "Cannot connect to Docker daemon" failures has 
been reduced.


*
*

*Characterizing and Fixing Failures*

I have preliminary results on a few failure types. For instance, I used 
the "docker" type of failure to bootstrap the dashboard. Along with 
"Killed with signal 9", it seems to indicate a problem with the CI 
runner, itself.


To look more deeply into these types of runner-system failures, *I will 
need more access*. If you are responsible for some runners and you're 
comfortable giving me shell access, you can find my public ssh key at 
https://gitlab.haskell.org/-/snippets/5546. (Posted as a snippet so at 
least you know the key comes from somebody who can access my GitLab 
account. Other secure means of communication are listed at 
https://keybase.io/chreekat.) Send me a message if you do so.


Besides runner problems, there are spurious failures that may have more 
to do with the CI code, itself. They include some problem with 
environment variables and (probably) some issue with console buffering. 
Neither of these are being tracked on the dashboard yet. Many other 
problems are yet to be explored at all.



*Next Steps*

The theme for the next steps is finalizing the dashboard and 
characterizing more failures.


 * Track more failure types on the dashboard
 * Improve the process of backfilling failure data on the dashboard
 * Include more metadata (like project id!) on the dashboard so it's
   easier to zoom on failures
 * Document the dashboard and the processes that populate it for posterity
 * Diagnose runner-system failures (if accessible)
 * Continue exploring other failure types
 * Fix failures omg!?

The list of next steps is currently heavy on finalizing the dashboard 
and light on fixing spurious failures. I know that might be frustrating. 
My justification is that CI is a complex hardware/software/human system 
under continuous operation where most the low-hanging fruit have already 
been plucked. It's time to get serious. :) My goal is to make spurious 
failures surprising rather than commonplace. This is the best way I know 
to achieve that.


Thanks again for helping me with this goal. :)


-Bryan

P.S. If you're interested, I've been posting updates like this one on 
Discourse:


https://discourse.haskell.org/search?q=DevOps%20Weekly%20Log%20%23haskell-foundation%20order%3Alatest_topic


On 18/05/2022 13:25, Bryan wrote:

Hi all,

I'd like to get some data on weird CI failures. Before clicking 
"retry" on a spurious failure, please paste the url for your job into 
the spreadsheet you'll find linked at 
https://gitlab.haskell.org/ghc/ghc/-/issues/21591.


Sorry for the slight misdirection. I wanted the spreadsheet to be 
world-writable, which means I don't want its url floating around in 
too many places. Maybe you can bookmark it if CI is causing you too 
much trouble. :)


-Bryan

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Tracking intermittently failing CI jobs

2022-05-18 Thread Bryan
Hi all,

I'd like to get some data on weird CI failures. Before clicking "retry" on a 
spurious failure, please paste the url for your job into the spreadsheet you'll 
find linked at https://gitlab.haskell.org/ghc/ghc/-/issues/21591.

Sorry for the slight misdirection. I wanted the spreadsheet to be 
world-writable, which means I don't want its url floating around in too many 
places. Maybe you can bookmark it if CI is causing you too much trouble. :)

-Bryan___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Windows CI instability

2022-04-01 Thread Matthew Pickering
Hi all,

Currently the windows CI issue is experiencing high amounts of
instability so if your patch fails for this reason then don't worry.
We are attempting to fix it.

Cheers,

Matt
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


White space in CI

2022-02-01 Thread Simon Peyton Jones
 Devs

As you'll see from this pipeline record
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7105/pipelines

CI consistently fails once a single commit has trailing whitespace, even if
it is fixed in a subsequent commit

   - dce2054d
   
<https://gitlab.haskell.org/ghc/ghc/-/commit/dce2054d44ea60bdde6409050284fbbcc227457a>
   introduced trailing whitespace
   - 6411223c
   
<https://gitlab.haskell.org/ghc/ghc/-/commit/6411223cd3977c92d01b09b55a455d8d86adde1d>
   removed it again.
   - but all subsequent pipelines fail

This came as a big surprise.  It doesn't make sense to lint each individual
commit.  Let's just lint the final version!  (I will squash them in due
course, but I didn't want to lose my work-in-progress history.)

Simon
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI: Choice of base commit for perf comparisons

2021-12-22 Thread Joachim Breitner
Thanks! I like it when my feature suggestions are implemented even before I 
voice them ;-)

22.12.2021 14:13:24 Richard Eisenberg :

> It seems to be that this thought is in the air right now. This was done just 
> a few days ago: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7184
> 
> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7231 also looks relevant.
> 
> Richard
> 
>> On Dec 22, 2021, at 7:19 AM, Joachim Breitner  
>> wrote:
>> 
>> Hi,
>> 
>> the new (or “new”?) handling of perf numbers, where CI just magically
>> records and compares them, without us having to manually edit the
>> `all.T` files, is a big improvement, thanks!
>> 
>> However, I found the choice of the base commit to compare against
>> unhelpful. Assume master is at commit M, and I start a feature branch
>> and MR with commit A. CI runs, and tells me about a performance
>> regressions, and CI is red. I now fix the issue and push commit B to
>> the branch. CI runs, but it picks A to compare against, and now it is
>> red because of an seemingly unexpected performance improvement!
>> 
>> I would have expected that all CI runs for this MR to compare the
>> performance against the base branch on master, and to look for perf
>> change notices in all commit messages in between.
>> 
>> I see these advantages:
>> 
>> * The reported perf changes correspond to the changes shown on the MR
>>   page
>> * Green CI = the MR is ready (after squashing)
>> * CI will have numbers for the base commit more reliably
>>   (else, if I push commit C quickly after B, then the job for B might
>>   be cancelled and Ci will report changes of C against A instead of B,
>>   which is unexpected).
>> 
>> I have used this logic of reporting perf changes (or any other
>> “differential CI”) against the base branch in the Motoko project and it
>> was quite natural.
>> 
>> Would it be desirable and possible for us here, too?
>> 
>> 
>> (A possible rebuttal might be: we don’t push new commits to feature
>> branches, but always squash and rebase, as that’s what we have to do
>> before merging anyways. If that’s the case then ok, although I
>> generally lean to having chronological commits on feature branches and
>> a nice squashed commit on master.)
>> 
>> Cheers,
>> Joachim
>> 
>> 
>> -- 
>> Joachim Breitner
>> m...@joachim-breitner.de
>> http://www.joachim-breitner.de/
>> 
>> ___
>> ghc-devs mailing list
>> ghc-devs@haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI: Choice of base commit for perf comparisons

2021-12-22 Thread Richard Eisenberg
It seems to be that this thought is in the air right now. This was done just a 
few days ago: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7184

https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7231 also looks relevant.

Richard

> On Dec 22, 2021, at 7:19 AM, Joachim Breitner  
> wrote:
> 
> Hi,
> 
> the new (or “new”?) handling of perf numbers, where CI just magically
> records and compares them, without us having to manually edit the
> `all.T` files, is a big improvement, thanks!
> 
> However, I found the choice of the base commit to compare against
> unhelpful. Assume master is at commit M, and I start a feature branch
> and MR with commit A. CI runs, and tells me about a performance
> regressions, and CI is red. I now fix the issue and push commit B to
> the branch. CI runs, but it picks A to compare against, and now it is
> red because of an seemingly unexpected performance improvement!
> 
> I would have expected that all CI runs for this MR to compare the
> performance against the base branch on master, and to look for perf
> change notices in all commit messages in between.
> 
> I see these advantages:
> 
> * The reported perf changes correspond to the changes shown on the MR 
>   page
> * Green CI = the MR is ready (after squashing)
> * CI will have numbers for the base commit more reliably
>   (else, if I push commit C quickly after B, then the job for B might
>   be cancelled and Ci will report changes of C against A instead of B,
>   which is unexpected).
> 
> I have used this logic of reporting perf changes (or any other
> “differential CI”) against the base branch in the Motoko project and it
> was quite natural.
> 
> Would it be desirable and possible for us here, too?
> 
> 
> (A possible rebuttal might be: we don’t push new commits to feature
> branches, but always squash and rebase, as that’s what we have to do
> before merging anyways. If that’s the case then ok, although I
> generally lean to having chronological commits on feature branches and
> a nice squashed commit on master.)
> 
> Cheers,
> Joachim
> 
> 
> -- 
> Joachim Breitner
>  m...@joachim-breitner.de
>  http://www.joachim-breitner.de/
> 
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


CI: Choice of base commit for perf comparisons

2021-12-22 Thread Joachim Breitner
Hi,

the new (or “new”?) handling of perf numbers, where CI just magically
records and compares them, without us having to manually edit the
`all.T` files, is a big improvement, thanks!

However, I found the choice of the base commit to compare against
unhelpful. Assume master is at commit M, and I start a feature branch
and MR with commit A. CI runs, and tells me about a performance
regressions, and CI is red. I now fix the issue and push commit B to
the branch. CI runs, but it picks A to compare against, and now it is
red because of an seemingly unexpected performance improvement!

I would have expected that all CI runs for this MR to compare the
performance against the base branch on master, and to look for perf
change notices in all commit messages in between.

I see these advantages:

 * The reported perf changes correspond to the changes shown on the MR 
   page
 * Green CI = the MR is ready (after squashing)
 * CI will have numbers for the base commit more reliably
   (else, if I push commit C quickly after B, then the job for B might
   be cancelled and Ci will report changes of C against A instead of B,
   which is unexpected).

I have used this logic of reporting perf changes (or any other
“differential CI”) against the base branch in the Motoko project and it
was quite natural.

Would it be desirable and possible for us here, too?


(A possible rebuttal might be: we don’t push new commits to feature
branches, but always squash and rebase, as that’s what we have to do
before merging anyways. If that’s the case then ok, although I
generally lean to having chronological commits on feature branches and
a nice squashed commit on master.)

Cheers,
Joachim


-- 
Joachim Breitner
  m...@joachim-breitner.de
  http://www.joachim-breitner.de/

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI build failures

2021-07-27 Thread Gergő Érdi
Thanks, this is all great news

On Tue, Jul 27, 2021, 21:56 Ben Gamari  wrote:

> ÉRDI Gergő  writes:
>
> > Hi,
> >
> > I'm seeing three build failures in CI:
> >
> Hi,
>
> > 1. On perf-nofib, it fails with:
> >
> Don't worry about this one for the moment. This job is marked as
> accepting of failure for a reason (hence the job state being an orange
> exclamation mark rather than a red X).
>
> > == make boot -j --jobserver-fds=3,4 --no-print-directory;
> >   in /builds/cactus/ghc/nofib/real/smallpt
> > 
> > /builds/cactus/ghc/ghc/bin/ghc  -M -dep-suffix "" -dep-makefile .depend
> > -osuf o -O2 -Wno-tabs -Rghc-timing -H32m -hisuf hi
> -packageunboxed-ref
> > -rtsopts smallpt.hs
> > : cannot satisfy -package unboxed-ref
> >  (use -v for more information)
> >
> > (e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743141#L1465)
> >
> > 2. On validate-x86_64-darwin, pretty much every test fails because of
> the
> > following extra stderr output:
> >
> > +
> > +:
> > +warning: Couldn't figure out C compiler information!
> > +     Make sure you're using GNU gcc, or clang
> >
> > (e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743129#L3655)
> >
> Yes, this will be fixed by !6162 once I get it passing CI.
>
> > 3. On validate-x86_64-linux-deb9-integer-simple, T11545 fails on memory
> > consumption:
> >
> > Unexpected stat failures:
> > perf/compiler/T11545.run  T11545 [stat decreased from
> x86_64-linux-deb9-integer-simple-validate baseline @
> > 5f3991c7cab8ccc9ab8daeebbfce57afbd9acc33] (normal)
> >
> This test appears to be quite sensitive to environment. I suspect we
> should further increase its acceptance window to avoid this sort of
> spurious failure.
>
> Cheers,
>
> Cheers,
>
> - Ben
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI build failures

2021-07-27 Thread Ben Gamari
ÉRDI Gergő  writes:

> Hi,
>
> I'm seeing three build failures in CI:
>
Hi,

> 1. On perf-nofib, it fails with:
>
Don't worry about this one for the moment. This job is marked as
accepting of failure for a reason (hence the job state being an orange
exclamation mark rather than a red X).

> == make boot -j --jobserver-fds=3,4 --no-print-directory;
>   in /builds/cactus/ghc/nofib/real/smallpt
> 
> /builds/cactus/ghc/ghc/bin/ghc  -M -dep-suffix "" -dep-makefile .depend 
> -osuf o -O2 -Wno-tabs -Rghc-timing -H32m -hisuf hi -packageunboxed-ref 
> -rtsopts smallpt.hs
> : cannot satisfy -package unboxed-ref
>  (use -v for more information)
>
> (e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743141#L1465)
>
> 2. On validate-x86_64-darwin, pretty much every test fails because of the 
> following extra stderr output:
>
> +
> +:
> +warning: Couldn't figure out C compiler information!
> + Make sure you're using GNU gcc, or clang
>
> (e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743129#L3655)
>
Yes, this will be fixed by !6162 once I get it passing CI.

> 3. On validate-x86_64-linux-deb9-integer-simple, T11545 fails on memory 
> consumption:
>
> Unexpected stat failures:
> perf/compiler/T11545.run  T11545 [stat decreased from 
> x86_64-linux-deb9-integer-simple-validate baseline @ 
> 5f3991c7cab8ccc9ab8daeebbfce57afbd9acc33] (normal)
>
This test appears to be quite sensitive to environment. I suspect we
should further increase its acceptance window to avoid this sort of
spurious failure.

Cheers,

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI build failures

2021-07-27 Thread Gergő Érdi
The other two are resilient to restarts.

On Tue, Jul 27, 2021, 18:49 Moritz Angermann 
wrote:

> You can safely ignore the x86_64-darwin failure. I can get you the juicy
> details over a beverage some time. It boils down to some odd behavior using
> rosetta2 on AArch64 Mac mini’s to build x86_64 GHCs. There is a fix
> somewhere from Ben, so it’s just a question of time until it’s properly
> fixed.
>
> The other two I’m afraid I have no idea. I’ll see to restart them. (You
> can’t ?)
>
> On Tue 27. Jul 2021 at 18:10, ÉRDI Gergő  wrote:
>
>> Hi,
>>
>> I'm seeing three build failures in CI:
>>
>> 1. On perf-nofib, it fails with:
>>
>> == make boot -j --jobserver-fds=3,4 --no-print-directory;
>>   in /builds/cactus/ghc/nofib/real/smallpt
>> 
>> /builds/cactus/ghc/ghc/bin/ghc  -M -dep-suffix "" -dep-makefile .depend
>> -osuf o -O2 -Wno-tabs -Rghc-timing -H32m -hisuf hi
>> -packageunboxed-ref
>> -rtsopts smallpt.hs
>> : cannot satisfy -package unboxed-ref
>>  (use -v for more information)
>>
>> (e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743141#L1465)
>>
>> 2. On validate-x86_64-darwin, pretty much every test fails because of the
>> following extra stderr output:
>>
>> +
>> +:
>> +warning: Couldn't figure out C compiler information!
>> + Make sure you're using GNU gcc, or clang
>>
>> (e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743129#L3655)
>>
>> 3. On validate-x86_64-linux-deb9-integer-simple, T11545 fails on memory
>> consumption:
>>
>> Unexpected stat failures:
>> perf/compiler/T11545.run  T11545 [stat decreased from
>> x86_64-linux-deb9-integer-simple-validate baseline @
>> 5f3991c7cab8ccc9ab8daeebbfce57afbd9acc33] (normal)
>>
>> This one is interesting because there is already a commit that is
>> supposed
>> to fix this:
>>
>> commit efaad7add092c88eab46e00a9f349d4675bbee06
>> Author: Matthew Pickering 
>> Date:   Wed Jul 21 10:03:42 2021 +0100
>>
>>  Stop ug_boring_info retaining a chain of old CoreExpr
>>
>>  [...]
>>
>>  -
>>  Metric Decrease:
>>  T11545
>>  -
>>
>> But still, it's failing.
>>
>> Can someone kick these build setups please?
>>
>> --
>>
>>.--= ULLA! =-.
>> \ http://gergo.erdi.hu   \
>>  `---= ge...@erdi.hu =---'
>> ___
>> ghc-devs mailing list
>> ghc-devs@haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI build failures

2021-07-27 Thread Moritz Angermann
You can safely ignore the x86_64-darwin failure. I can get you the juicy
details over a beverage some time. It boils down to some odd behavior using
rosetta2 on AArch64 Mac mini’s to build x86_64 GHCs. There is a fix
somewhere from Ben, so it’s just a question of time until it’s properly
fixed.

The other two I’m afraid I have no idea. I’ll see to restart them. (You
can’t ?)

On Tue 27. Jul 2021 at 18:10, ÉRDI Gergő  wrote:

> Hi,
>
> I'm seeing three build failures in CI:
>
> 1. On perf-nofib, it fails with:
>
> == make boot -j --jobserver-fds=3,4 --no-print-directory;
>   in /builds/cactus/ghc/nofib/real/smallpt
> 
> /builds/cactus/ghc/ghc/bin/ghc  -M -dep-suffix "" -dep-makefile .depend
> -osuf o -O2 -Wno-tabs -Rghc-timing -H32m -hisuf hi -packageunboxed-ref
> -rtsopts smallpt.hs
> : cannot satisfy -package unboxed-ref
>  (use -v for more information)
>
> (e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743141#L1465)
>
> 2. On validate-x86_64-darwin, pretty much every test fails because of the
> following extra stderr output:
>
> +
> +:
> +warning: Couldn't figure out C compiler information!
> + Make sure you're using GNU gcc, or clang
>
> (e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743129#L3655)
>
> 3. On validate-x86_64-linux-deb9-integer-simple, T11545 fails on memory
> consumption:
>
> Unexpected stat failures:
> perf/compiler/T11545.run  T11545 [stat decreased from
> x86_64-linux-deb9-integer-simple-validate baseline @
> 5f3991c7cab8ccc9ab8daeebbfce57afbd9acc33] (normal)
>
> This one is interesting because there is already a commit that is supposed
> to fix this:
>
> commit efaad7add092c88eab46e00a9f349d4675bbee06
> Author: Matthew Pickering 
> Date:   Wed Jul 21 10:03:42 2021 +0100
>
>  Stop ug_boring_info retaining a chain of old CoreExpr
>
>  [...]
>
>  -
>  Metric Decrease:
>  T11545
>  -
>
> But still, it's failing.
>
> Can someone kick these build setups please?
>
> --
>
>.--= ULLA! =-.
> \ http://gergo.erdi.hu   \
>  `---= ge...@erdi.hu =---'
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


CI build failures

2021-07-27 Thread ÉRDI Gergő

Hi,

I'm seeing three build failures in CI:

1. On perf-nofib, it fails with:

== make boot -j --jobserver-fds=3,4 --no-print-directory;
 in /builds/cactus/ghc/nofib/real/smallpt

/builds/cactus/ghc/ghc/bin/ghc  -M -dep-suffix "" -dep-makefile .depend 
-osuf o -O2 -Wno-tabs -Rghc-timing -H32m -hisuf hi -packageunboxed-ref 
-rtsopts smallpt.hs

: cannot satisfy -package unboxed-ref
(use -v for more information)

(e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743141#L1465)

2. On validate-x86_64-darwin, pretty much every test fails because of the 
following extra stderr output:


+
+:
+warning: Couldn't figure out C compiler information!
+ Make sure you're using GNU gcc, or clang

(e.g. https://gitlab.haskell.org/cactus/ghc/-/jobs/743129#L3655)

3. On validate-x86_64-linux-deb9-integer-simple, T11545 fails on memory 
consumption:


Unexpected stat failures:
   perf/compiler/T11545.run  T11545 [stat decreased from x86_64-linux-deb9-integer-simple-validate baseline @ 
5f3991c7cab8ccc9ab8daeebbfce57afbd9acc33] (normal)


This one is interesting because there is already a commit that is supposed 
to fix this:


commit efaad7add092c88eab46e00a9f349d4675bbee06
Author: Matthew Pickering 
Date:   Wed Jul 21 10:03:42 2021 +0100

Stop ug_boring_info retaining a chain of old CoreExpr

[...]

-
Metric Decrease:
T11545
-

But still, it's failing.

Can someone kick these build setups please?

--

  .--= ULLA! =-.
   \ http://gergo.erdi.hu   \
`---= ge...@erdi.hu =---'
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


FYI: Darwin CI currently broken for forks

2021-07-19 Thread Matthew Pickering
Hi all,

There is a configuration issue with the darwin builders which has
meant that for the last 6 days CI has been broken if you have pushed
from a fork because the majority of darwin builders are only
configured to work with branches pushed to the main project. These
failures manifest as timeout errors
(https://gitlab.haskell.org/blamario/ghc/-/jobs/733244).

Hopefully this can be resolved in the coming days.

Cheers,

Matt
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI Status Update

2021-06-17 Thread Ben Gamari
Ben Gamari  writes:

> Hi all,
>
Hi all,

At this point CI should be fully functional on the stable and master
branches again. However, do note that older base commits may refer to
Docker images that are sadly no longer available. Such cases can be
resolved by simply rebasing.

Cheers,

- Ben



signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


CI Status Update

2021-06-15 Thread Ben Gamari
Hi all,

As you may have realized, CI has been a bit of disaster over the last
few days. It appears that this is just the most recent chapter in our
on-going troubles with Docker image storage, being due to an outage of
our upstream storage service [1]. Davean and I have started to implement
a plan to migrate away from DreamObjects back to local storage.
Unfortunately to complete this migration we need DreamObjects to come
back online; I am currently waiting until this occurs. Further updates
will come as the situation develops.

Cheers,

- Ben

[1] https://www.dreamhoststatus.com/


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


[CI] macOS builds

2021-06-05 Thread Moritz Angermann
Hi there!

You might have seen failed or stuck or pending darwin builds. Our CI
builders we got generously donated have ~250GB of disk space (which should
be absolutely adequat for what we do), and macOS BigSur does some odd
reservation of 200GB in /System/Volumes/Data, this is despite automatic
updates being disabled and time machine being disabled.

It used to happen only when the system was expecting an update to be
performed and the 200GB were freed after the update was done. After the
latest update to 11.4, however, it seems to have not freed that space. This
leaves the CI machine with ~50GB for for the system + build tools + gitlab
checkouts and builds, and they frequently run out of space :-/

If someone knows how to prevent the system from doing stupid stuff like
this (my hunch is it's keeping a backup of the system pre-udpate, for
disaster recovery). Please come forward, my google searches haven't
revealed anything useful yet.

I have filed a TSI with Apple (still had a few on my developer account),
but I don't expect them to come back to me before the end of June. Next
week is WWDC, and there will be a massive backlog of issues that queued up
leading up to, and during the WWDC.  I've also only had very marginal
success with them resolving issues that were not "you wrote this program
wrong".

If everything fails, maybe the solution is to attach some usbc ssd's to the
macs and have gitlab builds be run dedicatedly on those disks. I'm a bit
concerned about performance but we would have to see.

Any ideas are welcome, please also feel free to hit me up on
libera.chat#ghc, or the haskell foundations slack.

Cheers,
 Moritz
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Darwin CI Status

2021-05-20 Thread Matthew Pickering
Thanks Moritz for that update.

The latest is that currently darwin CI is disabled and the merge train
is unblocked (*choo choo*).

I am testing Moritz's patches to speed-up CI and will merge them in
shortly to get darwin coverage back.

Cheers,

Matt

On Wed, May 19, 2021 at 9:46 AM Moritz Angermann
 wrote:
>
> Matt has access to the M1 builder in my closet now. The darwin performance 
> issue
> is mainly there since BigSur, and (afaik) primarily due to the amount of 
> DYLD_LIBRARY_PATH's
> we pass to GHC invocations. The system linker spends the majority of the time 
> in the
> kernel stat'ing and getelements (or some similar directory) call for each and 
> every possible
> path.
>
> Switching to hadrian will cut down the time from ~5hs to ~2hs. At some point 
> we had make
> builds <90min by just killing all DYLD_LIBRARY_PATH logic we ever had, but 
> that broke
> bindists.
>
> The CI has time values attached and some summary at the end right now, which 
> highlights
> time spent in the system and in user mode. This is up to 80% sys, 20% user, 
> and went to
> something like 20% sys, 80% user after nuking all DYLD_LIBRARY_PATH's, with 
> hadrian it's
> closer to ~25% sys, 75% user.
>
> Of note, this is mostly due to time spent during the *test-suite*, not the 
> actual build. For the
> actual build make and hadrian are comparable, though I've seen hadrian to 
> oddly have a
> much higher variance in how long it takes to *build* ghc, whereas the make 
> build was more
> consistent.
>
> The test-suite quite notoriously calls GHC *a lot of times*, which makes any 
> linker issue due
> to DYLD_LIBRARY_PATH (and similar lookups) much worse.
>
> If we would finally split building and testing, we'd see this more clearly I 
> believe. Maybe this
> is motivation enough for someone to come forward to break build/test into two 
> CI steps?
>
> Cheers,
>  Moritz
>
> On Wed, May 19, 2021 at 4:14 PM Matthew Pickering 
>  wrote:
>>
>> Hi all,
>>
>> The darwin pipelines are gumming up the merge pipeline as they are
>> taking over 4 hours to complete on average.
>>
>> I am going to disable them -
>> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5785
>>
>> Please can someone give me access to one of the M1 builders so I can
>> debug why the tests are taking so long. Once I have fixed the issue
>> then I will enable the pipelines.
>>
>> Cheers,
>>
>> Matt
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: Darwin CI Status

2021-05-19 Thread Moritz Angermann
Matt has access to the M1 builder in my closet now. The darwin performance
issue
is mainly there since BigSur, and (afaik) primarily due to the amount of
DYLD_LIBRARY_PATH's
we pass to GHC invocations. The system linker spends the majority of the
time in the
kernel stat'ing and getelements (or some similar directory) call for each
and every possible
path.

Switching to hadrian will cut down the time from ~5hs to ~2hs. At some
point we had make
builds <90min by just killing all DYLD_LIBRARY_PATH logic we ever had, but
that broke
bindists.

The CI has time values attached and some summary at the end right now,
which highlights
time spent in the system and in user mode. This is up to 80% sys, 20% user,
and went to
something like 20% sys, 80% user after nuking all DYLD_LIBRARY_PATH's, with
hadrian it's
closer to ~25% sys, 75% user.

Of note, this is mostly due to time spent during the *test-suite*, not the
actual build. For the
actual build make and hadrian are comparable, though I've seen hadrian to
oddly have a
much higher variance in how long it takes to *build* ghc, whereas the make
build was more
consistent.

The test-suite quite notoriously calls GHC *a lot of times*, which makes
any linker issue due
to DYLD_LIBRARY_PATH (and similar lookups) much worse.

If we would finally split building and testing, we'd see this more clearly
I believe. Maybe this
is motivation enough for someone to come forward to break build/test into
two CI steps?

Cheers,
 Moritz

On Wed, May 19, 2021 at 4:14 PM Matthew Pickering <
matthewtpicker...@gmail.com> wrote:

> Hi all,
>
> The darwin pipelines are gumming up the merge pipeline as they are
> taking over 4 hours to complete on average.
>
> I am going to disable them -
> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5785
>
> Please can someone give me access to one of the M1 builders so I can
> debug why the tests are taking so long. Once I have fixed the issue
> then I will enable the pipelines.
>
> Cheers,
>
> Matt
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Darwin CI Status

2021-05-19 Thread Matthew Pickering
Hi all,

The darwin pipelines are gumming up the merge pipeline as they are
taking over 4 hours to complete on average.

I am going to disable them -
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5785

Please can someone give me access to one of the M1 builders so I can
debug why the tests are taking so long. Once I have fixed the issue
then I will enable the pipelines.

Cheers,

Matt
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: HLint in the GHC CI, an eight-months retrospective

2021-03-26 Thread Hécate

Hi Richard,

I am sorry, I have indeed forgotten one of the most important parts of my 
email. :)


The Hadrian rules are

lint:base
lint:compiler


You can invoke them as simply as:


./hadrian/build lint:base




You need to have a recent version of HLint in your PATH. If you use 
ghc.nix, this should be taken care of for you.


Hope it clarified things!

Cheers,
Hécate

Le 25 mars 2021 21:39:15 Richard Eisenberg  a écrit :


Thanks for this update! Glad to know this effort is going well.

One quick question: suppose I am editing something in `base`. My 
understanding is that my edit will be linted. How can I run hlint locally 
so that I can easily respond to trouble before CI takes a crack? And where 
would I learn this information (that is, how to run hlint locally)?


Thanks!
Richard


On Mar 25, 2021, at 11:19 AM, Hécate  wrote:

Hello fellow devs,

this email is an activity report on the integration of the HLint[0] tool in 
the Continuous Integration (CI) pipelines.


On Jul. 5, 2020 I opened a discussion ticket[1] on the topic of code 
linting in the several components of the GHC code-base. It has served as a 
reference anchor for the Merge Requests (MR) that stemmed from it, and 
allowed us to refine our expectations and processes. If you are not 
acquainted with its content, I invite you to read the whole conversation.


Subsequently, several Hadrian lint rules have been integrated in the 
following months, in order to run HLint on targeted components of the GHC 
repository (the base library, the compiler code-base, etc).
Being satisfied with the state of the rules we applied to the code-base, 
such as removing extraneous pragmata and keywords, it was decided to 
integrate the base library linting rule in the CI. This was five months 
ago, in September[2], and I am happy to report that developer friction has 
been so far minimal.
In parallel to this work on the base library, I took care of cleaning-up 
the compiler, and harmonised the various micro coding styles that have 
emerged quite organically during the decades of development that are behind 
us (I never realised how many variations of the same ten lines of pragmata 
could coexist in the same folders).
Upon feedback from stakeholders of this sub-code base, the rules file was 
altered to better suit their development needs, such as not removing 
extraneous `do` keywords, as they are useful to introduce a block in which 
debug statements can be easily inserted.


Since today, the linting of the compiler code-base has been integrated in 
our CI pipelines, without further burdening our CI times.
Things seem to run smoothly, and I welcome comments and requests of any 
kind related to this area of our code quality process.


Regarding our future plans, there has been a discussion about integrating 
such a linting mechanism for our C code-base, in the RTS. Nothing is 
formally established yet, so I would be grateful if people who have 
experience and wisdom about it can chime in to contribute to the 
discussion: https://gitlab.haskell.org/ghc/ghc/-/issues/19437.


And I would like to say that I am overall very thankful for the involvement 
of the people who have been giving us feedback and have been reviewing the 
resulting MRs.


Have a very nice day,
Hécate

---
[0]: https://github.com/ndmitchell/hlint
[1]: https://gitlab.haskell.org/ghc/ghc/-/issues/18424
[2]: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4147

--
Hécate ✨
: @TechnoEmpress
IRC: Uniaika
WWW: https://glitchbra.in
RUN: BSD

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: HLint in the GHC CI, an eight-months retrospective

2021-03-25 Thread Richard Eisenberg
Thanks for this update! Glad to know this effort is going well.

One quick question: suppose I am editing something in `base`. My understanding 
is that my edit will be linted. How can I run hlint locally so that I can 
easily respond to trouble before CI takes a crack? And where would I learn this 
information (that is, how to run hlint locally)?

Thanks!
Richard

> On Mar 25, 2021, at 11:19 AM, Hécate  wrote:
> 
> Hello fellow devs,
> 
> this email is an activity report on the integration of the HLint[0] tool in 
> the Continuous Integration (CI) pipelines.
> 
> On Jul. 5, 2020 I opened a discussion ticket[1] on the topic of code linting 
> in the several components of the GHC code-base. It has served as a reference 
> anchor for the Merge Requests (MR) that stemmed from it, and allowed us to 
> refine our expectations and processes. If you are not acquainted with its 
> content, I invite you to read the whole conversation.
> 
> Subsequently, several Hadrian lint rules have been integrated in the 
> following months, in order to run HLint on targeted components of the GHC 
> repository (the base library, the compiler code-base, etc).
> Being satisfied with the state of the rules we applied to the code-base, such 
> as removing extraneous pragmata and keywords, it was decided to integrate the 
> base library linting rule in the CI. This was five months ago, in 
> September[2], and I am happy to report that developer friction has been so 
> far minimal.
> In parallel to this work on the base library, I took care of cleaning-up the 
> compiler, and harmonised the various micro coding styles that have emerged 
> quite organically during the decades of development that are behind us (I 
> never realised how many variations of the same ten lines of pragmata could 
> coexist in the same folders).
> Upon feedback from stakeholders of this sub-code base, the rules file was 
> altered to better suit their development needs, such as not removing 
> extraneous `do` keywords, as they are useful to introduce a block in which 
> debug statements can be easily inserted.
> 
> Since today, the linting of the compiler code-base has been integrated in our 
> CI pipelines, without further burdening our CI times.
> Things seem to run smoothly, and I welcome comments and requests of any kind 
> related to this area of our code quality process.
> 
> Regarding our future plans, there has been a discussion about integrating 
> such a linting mechanism for our C code-base, in the RTS. Nothing is formally 
> established yet, so I would be grateful if people who have experience and 
> wisdom about it can chime in to contribute to the discussion: 
> https://gitlab.haskell.org/ghc/ghc/-/issues/19437.
> 
> And I would like to say that I am overall very thankful for the involvement 
> of the people who have been giving us feedback and have been reviewing the 
> resulting MRs.
> 
> Have a very nice day,
> Hécate
> 
> ---
> [0]: https://github.com/ndmitchell/hlint
> [1]: https://gitlab.haskell.org/ghc/ghc/-/issues/18424
> [2]: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4147
> 
> -- 
> Hécate ✨
> : @TechnoEmpress
> IRC: Uniaika
> WWW: https://glitchbra.in
> RUN: BSD
> 
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


HLint in the GHC CI, an eight-months retrospective

2021-03-25 Thread Hécate

Hello fellow devs,

this email is an activity report on the integration of the HLint[0] tool 
in the Continuous Integration (CI) pipelines.


On Jul. 5, 2020 I opened a discussion ticket[1] on the topic of code 
linting in the several components of the GHC code-base. It has served as 
a reference anchor for the Merge Requests (MR) that stemmed from it, and 
allowed us to refine our expectations and processes. If you are not 
acquainted with its content, I invite you to read the whole conversation.


Subsequently, several Hadrian lint rules have been integrated in the 
following months, in order to run HLint on targeted components of the 
GHC repository (the base library, the compiler code-base, etc).
Being satisfied with the state of the rules we applied to the code-base, 
such as removing extraneous pragmata and keywords, it was decided to 
integrate the base library linting rule in the CI. This was five months 
ago, in September[2], and I am happy to report that developer friction 
has been so far minimal.
In parallel to this work on the base library, I took care of cleaning-up 
the compiler, and harmonised the various micro coding styles that have 
emerged quite organically during the decades of development that are 
behind us (I never realised how many variations of the same ten lines of 
pragmata could coexist in the same folders).
Upon feedback from stakeholders of this sub-code base, the rules file 
was altered to better suit their development needs, such as not removing 
extraneous `do` keywords, as they are useful to introduce a block in 
which debug statements can be easily inserted.


Since today, the linting of the compiler code-base has been integrated 
in our CI pipelines, without further burdening our CI times.
Things seem to run smoothly, and I welcome comments and requests of any 
kind related to this area of our code quality process.


Regarding our future plans, there has been a discussion about 
integrating such a linting mechanism for our C code-base, in the RTS. 
Nothing is formally established yet, so I would be grateful if people 
who have experience and wisdom about it can chime in to contribute to 
the discussion: https://gitlab.haskell.org/ghc/ghc/-/issues/19437.


And I would like to say that I am overall very thankful for the 
involvement of the people who have been giving us feedback and have been 
reviewing the resulting MRs.


Have a very nice day,
Hécate

---
[0]: https://github.com/ndmitchell/hlint
[1]: https://gitlab.haskell.org/ghc/ghc/-/issues/18424
[2]: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4147

--
Hécate ✨
: @TechnoEmpress
IRC: Uniaika
WWW: https://glitchbra.in
RUN: BSD

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-24 Thread Andreas Klebinger

> What about the case where the rebase *lessens* the improvement? That
is, you're expecting these 10 cases to improve, but after a rebase, only
1 improves. That's news! But a blanket "accept improvements" won't tell you.

I don't think that scenario currently triggers a CI failure. So this
wouldn't really change.

As I understand it the current logic is:

* Run tests
* Check if any cross the metric thresholds set in the test.
* If so check if that test is allowed to cross the threshold.

I believe we don't check that all benchmarks listed with an expected
in/decrease actually do so.
It would also be hard to do so reasonably without making it even harder
to push MRs through CI.

Andreas

Am 24/03/2021 um 13:08 schrieb Richard Eisenberg:

What about the case where the rebase *lessens* the improvement? That is, you're expecting 
these 10 cases to improve, but after a rebase, only 1 improves. That's news! But a 
blanket "accept improvements" won't tell you.

I'm not hard against this proposal, because I know precise tracking has its own 
costs. Just wanted to bring up another scenario that might be factored in.

Richard


On Mar 24, 2021, at 7:44 AM, Andreas Klebinger  wrote:

After the idea of letting marge accept unexpected perf improvements and
looking at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4759
which failed because of a single test, for a single build flavour
crossing the
improvement threshold where CI fails after rebasing I wondered.

When would accepting a unexpected perf improvement ever backfire?

In practice I either have a patch that I expect to improve performance
for some things
so I want to accept whatever gains I get. Or I don't expect improvements
so it's *maybe*
worth failing CI for in case I optimized away some code I shouldn't or
something of that
sort.

How could this be actionable? Perhaps having a set of indicator for CI of
"Accept allocation decreases"
"Accept residency decreases"

Would be saner. I have personally *never* gotten value out of the
requirement
to list the indivial tests that improve. Usually a whole lot of them do.
Some cross
the threshold so I add them. If I'm unlucky I have to rebase and a new
one might
make it across the threshold.

Being able to accept improvements (but not regressions) wholesale might be a
reasonable alternative.

Opinions?

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-24 Thread Moritz Angermann
Yes, this is exactly one of the issues that marge might run into as well,
the aggregate ends up performing differently from the individual ones. Now
we have marge to ensure that at least the aggregate builds together, which
is the whole point of these merge trains. Not to end up in a situation
where two patches that are fine on their own, end up to produce a broken
merged state that doesn't build anymore.

Now we have marge to ensure every commit is buildable. Next we should run
regression tests on all commits on master (and that includes each and
everyone that marge brings into master. Then we have visualisation that
tells us how performance metrics go up/down over time, and we can drill
down into commits if they yield interesting results in either way.

Now lets say you had a commit that should have made GHC 50% faster across
the board, but somehow after the aggregate with other patches this didn't
happen anymore? We'd still expect this to somehow show in each of the
singular commits on master right?

On Wed, Mar 24, 2021 at 8:09 PM Richard Eisenberg  wrote:

> What about the case where the rebase *lessens* the improvement? That is,
> you're expecting these 10 cases to improve, but after a rebase, only 1
> improves. That's news! But a blanket "accept improvements" won't tell you.
>
> I'm not hard against this proposal, because I know precise tracking has
> its own costs. Just wanted to bring up another scenario that might be
> factored in.
>
> Richard
>
> > On Mar 24, 2021, at 7:44 AM, Andreas Klebinger 
> wrote:
> >
> > After the idea of letting marge accept unexpected perf improvements and
> > looking at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4759
> > which failed because of a single test, for a single build flavour
> > crossing the
> > improvement threshold where CI fails after rebasing I wondered.
> >
> > When would accepting a unexpected perf improvement ever backfire?
> >
> > In practice I either have a patch that I expect to improve performance
> > for some things
> > so I want to accept whatever gains I get. Or I don't expect improvements
> > so it's *maybe*
> > worth failing CI for in case I optimized away some code I shouldn't or
> > something of that
> > sort.
> >
> > How could this be actionable? Perhaps having a set of indicator for CI of
> > "Accept allocation decreases"
> > "Accept residency decreases"
> >
> > Would be saner. I have personally *never* gotten value out of the
> > requirement
> > to list the indivial tests that improve. Usually a whole lot of them do.
> > Some cross
> > the threshold so I add them. If I'm unlucky I have to rebase and a new
> > one might
> > make it across the threshold.
> >
> > Being able to accept improvements (but not regressions) wholesale might
> be a
> > reasonable alternative.
> >
> > Opinions?
> >
> > ___
> > ghc-devs mailing list
> > ghc-devs@haskell.org
> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-24 Thread Richard Eisenberg
What about the case where the rebase *lessens* the improvement? That is, you're 
expecting these 10 cases to improve, but after a rebase, only 1 improves. 
That's news! But a blanket "accept improvements" won't tell you.

I'm not hard against this proposal, because I know precise tracking has its own 
costs. Just wanted to bring up another scenario that might be factored in.

Richard

> On Mar 24, 2021, at 7:44 AM, Andreas Klebinger  
> wrote:
> 
> After the idea of letting marge accept unexpected perf improvements and
> looking at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4759
> which failed because of a single test, for a single build flavour
> crossing the
> improvement threshold where CI fails after rebasing I wondered.
> 
> When would accepting a unexpected perf improvement ever backfire?
> 
> In practice I either have a patch that I expect to improve performance
> for some things
> so I want to accept whatever gains I get. Or I don't expect improvements
> so it's *maybe*
> worth failing CI for in case I optimized away some code I shouldn't or
> something of that
> sort.
> 
> How could this be actionable? Perhaps having a set of indicator for CI of
> "Accept allocation decreases"
> "Accept residency decreases"
> 
> Would be saner. I have personally *never* gotten value out of the
> requirement
> to list the indivial tests that improve. Usually a whole lot of them do.
> Some cross
> the threshold so I add them. If I'm unlucky I have to rebase and a new
> one might
> make it across the threshold.
> 
> Being able to accept improvements (but not regressions) wholesale might be a
> reasonable alternative.
> 
> Opinions?
> 
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-24 Thread Andreas Klebinger

After the idea of letting marge accept unexpected perf improvements and
looking at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4759
which failed because of a single test, for a single build flavour
crossing the
improvement threshold where CI fails after rebasing I wondered.

When would accepting a unexpected perf improvement ever backfire?

In practice I either have a patch that I expect to improve performance
for some things
so I want to accept whatever gains I get. Or I don't expect improvements
so it's *maybe*
worth failing CI for in case I optimized away some code I shouldn't or
something of that
sort.

How could this be actionable? Perhaps having a set of indicator for CI of
"Accept allocation decreases"
"Accept residency decreases"

Would be saner. I have personally *never* gotten value out of the
requirement
to list the indivial tests that improve. Usually a whole lot of them do.
Some cross
the threshold so I add them. If I'm unlucky I have to rebase and a new
one might
make it across the threshold.

Being able to accept improvements (but not regressions) wholesale might be a
reasonable alternative.

Opinions?

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


RE: On CI

2021-03-18 Thread Ben Gamari
Simon Peyton Jones via ghc-devs  writes:

> > We need to do something about this, and I'd advocate for just not making 
> > stats fail with marge.
>
> Generally I agree. One point you don’t mention is that our perf tests
> (which CI forces us to look at assiduously) are often pretty weird
> cases. So there is at least a danger that these more exotic cases will
> stand in the way of (say) a perf improvement in the typical case.
>
> But “not making stats fail” is a bit crude.   Instead how about
>
To be clear, the proposal isn't to accept stats failures for merge request
validation jobs. I believe Moritz was merely suggesting that we accept
such failures in marge-bot validations (that is, the pre-merge
validation done on batches of merge requests).

In my opinion this is reasonable since we know that all of the MRs in
the batch do not individually regress. While it's possible that
interactions between two or more MRs result in a qualitative change in
performance, it seems quite unlikely. What is far *more* likely (and
what we see regularly) is that the cumulative effect of a batch of
improving patches pushes the batches' overall stat change out of the
acceptance threshold. This is quite annoying as it dooms the entire
batch.

For this reason, I think we should at very least accept stat
improvements during Marge validations (as you suggest). I agree that we
probably want a batch to fail if two patches accumulate to form a
regression, even if the two passed CI individually.

>   * We already have per-benchmark windows. If the stat falls outside
>   the window, we fail. You are effectively saying “widen all windows
>   to infinity”. If something makes a stat 10 times worse, I think we
>   *should* fail. But 10% worse? Maybe we should accept and look later
>   as you suggest. So I’d argue for widening the windows rather than
>   disabling them completely.
>
Yes, I agree.
>
>   * If we did that we’d need good instrumentation to spot steps and
>   drift in perf, as you say. An advantage is that since the perf
>   instrumentation runs only on committed master patches, not on every
>   CI, it can cost more. In particular , it could run a bunch of
>   “typical” tests, including nofib and compiling Cabal or other
>   libraries.
>
We already have the beginnings of such instrumentation.

> The big danger is that by relieving patch authors from worrying about
> perf drift, it’ll end up in the lap of the GHC HQ team. If it’s hard
> for the author of a single patch (with which she is intimately
> familiar) to work out why it’s making some test 2% worse, imagine how
> hard, and demotivating, it’d be for Ben to wonder why 50 patches (with
> which he is unfamiliar) are making some test 5% worse.
>
Yes, I absolutely agree with this. I would very much like to avoid
having to do this sort of post-hoc investigation any more than
necessary.

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-18 Thread Ben Gamari
Karel Gardas  writes:

> On 3/17/21 4:16 PM, Andreas Klebinger wrote:
>> Now that isn't really an issue anyway I think. The question is rather is
>> 2% a large enough regression to worry about? 5%? 10%?
>
> 5-10% is still around system noise even on lightly loaded workstation.
> Not sure if CI is not run on some shared cloud resources where it may be
> even higher.
>
I think when we say "performance" we should be clear about what we are
referring to. Currently, GHC does not measure instructions/cycles/time.
We only measure allocations and residency. These are significantly more
deterministic than time measurements, even on cloud hardware.

I do think that eventually we should start to measure a broader spectrum
of metrics, but this is something that can be done on dedicated hardware
as a separate CI job.

> I've done simple experiment of pining ghc compiling ghc-cabal and I've
> been able to "speed" it up by 5-10% on W-2265.
>
Do note that once we switch to Hadrian ghc-cabal will vanish entirely
(since Hadrian implements its functionality directly).

> Also following this CI/performance regs discussion I'm not entirely sure
> if  this is not just a witch-hunt hurting/beating mostly most active GHC
> developers. Another idea may be to give up on CI doing perf reg testing
> at all and invest saved resources into proper investigation of
> GHC/Haskell programs performance. Not sure, if this would not be more
> beneficial longer term.
>
I don't think this would be beneficial. It's much easier to prevent a
regression from getting into the tree than it is to find and
characterise it after it has been merged.

> Just one random number thrown to the ring. Linux's perf claims that
> nearly every second L3 cache access on the example above ends with cache
> miss. Is it a good number or bad number? See stats below (perf stat -d
> on ghc with +RTS -T -s -RTS').
>
It is very hard to tell; it sounds bad but it is not easy to know why or
whether it is possible to improve. This is one of the reasons why I have
been trying to improve sharing within GHC recently; reducing residency should
improve cache locality.

Nevertheless, the difficulty interpreting architectural events is why I
generally only use `perf` for differential measurements.

Cheers,

- Ben



signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-18 Thread John Ericson
My guess is most of the "noise" is not run time, but the compiled code 
changing in hard to predict ways.


https://gitlab.haskell.org/ghc/ghc/-/merge_requests/1776/diffs for 
example was a very small PR that took *months* of on-off work to get 
passing metrics tests. In the end, binding `is_boot` twice helped a bit, 
and dumb luck helped a little bit more. No matter how you analyze that, 
that's a lot of pain for what's manifestly a performance-irrelevant MR 
--- no one is writing 10,000 default methods or whatever could possibly 
make this the micro-optimizing worth it!


Perhaps this is an extreme example, but my rough sense is that it's not 
an isolated outlier.


John

On 3/18/21 1:39 PM, davean wrote:
I left the wiggle room for things like longer wall time causing more 
time events in the IO Manager/RTS which can be a thermal/HW issue.

They're small and indirect though

-davean

On Thu, Mar 18, 2021 at 1:37 PM Sebastian Graf <mailto:sgraf1...@gmail.com>> wrote:


To be clear: All performance tests that run as part of CI measure
allocations only. No wall clock time.
Those measurements are (mostly) deterministic and reproducible
between compiles of the same worktree and not impacted by thermal
issues/hardware at all.

Am Do., 18. März 2021 um 18:09 Uhr schrieb davean mailto:dav...@xkcd.com>>:

That really shouldn't be near system noise for a well
constructed performance test. You might be seeing things like
thermal issues, etc though - good benchmarking is a serious
subject.
Also we're not talking wall clock tests, we're talking
specific metrics. The machines do tend to be bare metal, but
many of these are entirely CPU performance independent, memory
timing independent, etc. Well not quite but that's a longer
discussion.

The investigation of Haskell code performance is a very good
thing to do BTW, but you'd still want to avoid regressions in
the improvements you made. How well we can do that and the
cost of it is the primary issue here.

-davean


On Wed, Mar 17, 2021 at 6:22 PM Karel Gardas
mailto:karel.gar...@centrum.cz>> wrote:

On 3/17/21 4:16 PM, Andreas Klebinger wrote:
> Now that isn't really an issue anyway I think. The
question is rather is
> 2% a large enough regression to worry about? 5%? 10%?

5-10% is still around system noise even on lightly loaded
workstation.
Not sure if CI is not run on some shared cloud resources
where it may be
even higher.

I've done simple experiment of pining ghc compiling
ghc-cabal and I've
been able to "speed" it up by 5-10% on W-2265.

Also following this CI/performance regs discussion I'm not
entirely sure
if  this is not just a witch-hunt hurting/beating mostly
most active GHC
    developers. Another idea may be to give up on CI doing
perf reg testing
at all and invest saved resources into proper investigation of
GHC/Haskell programs performance. Not sure, if this would
not be more
beneficial longer term.

Just one random number thrown to the ring. Linux's perf
claims that
nearly every second L3 cache access on the example above
ends with cache
miss. Is it a good number or bad number? See stats below
(perf stat -d
on ghc with +RTS -T -s -RTS').

Good luck to anybody working on that!

Karel


Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ...
  61,020,836,136 bytes allocated in the heap
   5,229,185,608 bytes copied during GC
     301,742,768 bytes maximum residency (19 sample(s))
       3,533,000 bytes maximum slop
             840 MiB total memory in use (0 MB lost due to
fragmentation)

                                     Tot time (elapsed) 
Avg pause  Max
pause
  Gen  0      2012 colls,     0 par    5.725s  5.731s   
 0.0028s
0.1267s
  Gen  1        19 colls,     0 par    1.695s  1.696s   
 0.0893s
0.2636s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0
fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   27.849s  ( 32.163s elapsed)
  GC      time    7.419s  (  7.427s elapsed)
  EXIT    time    0.000s  (  0.010s elapsed)
  Total   time   35.269s  ( 39.601s elapsed)

  Alloc rate    2,191,122,004 bytes per MUT sec

Re: On CI

2021-03-18 Thread davean
I left the wiggle room for things like longer wall time causing more time
events in the IO Manager/RTS which can be a thermal/HW issue.
They're small and indirect though

-davean

On Thu, Mar 18, 2021 at 1:37 PM Sebastian Graf  wrote:

> To be clear: All performance tests that run as part of CI measure
> allocations only. No wall clock time.
> Those measurements are (mostly) deterministic and reproducible between
> compiles of the same worktree and not impacted by thermal issues/hardware
> at all.
>
> Am Do., 18. März 2021 um 18:09 Uhr schrieb davean :
>
>> That really shouldn't be near system noise for a well constructed
>> performance test. You might be seeing things like thermal issues, etc
>> though - good benchmarking is a serious subject.
>> Also we're not talking wall clock tests, we're talking specific metrics.
>> The machines do tend to be bare metal, but many of these are entirely CPU
>> performance independent, memory timing independent, etc. Well not quite but
>> that's a longer discussion.
>>
>> The investigation of Haskell code performance is a very good thing to do
>> BTW, but you'd still want to avoid regressions in the improvements you
>> made. How well we can do that and the cost of it is the primary issue here.
>>
>> -davean
>>
>>
>> On Wed, Mar 17, 2021 at 6:22 PM Karel Gardas 
>> wrote:
>>
>>> On 3/17/21 4:16 PM, Andreas Klebinger wrote:
>>> > Now that isn't really an issue anyway I think. The question is rather
>>> is
>>> > 2% a large enough regression to worry about? 5%? 10%?
>>>
>>> 5-10% is still around system noise even on lightly loaded workstation.
>>> Not sure if CI is not run on some shared cloud resources where it may be
>>> even higher.
>>>
>>> I've done simple experiment of pining ghc compiling ghc-cabal and I've
>>> been able to "speed" it up by 5-10% on W-2265.
>>>
>>> Also following this CI/performance regs discussion I'm not entirely sure
>>> if  this is not just a witch-hunt hurting/beating mostly most active GHC
>>> developers. Another idea may be to give up on CI doing perf reg testing
>>> at all and invest saved resources into proper investigation of
>>> GHC/Haskell programs performance. Not sure, if this would not be more
>>> beneficial longer term.
>>>
>>> Just one random number thrown to the ring. Linux's perf claims that
>>> nearly every second L3 cache access on the example above ends with cache
>>> miss. Is it a good number or bad number? See stats below (perf stat -d
>>> on ghc with +RTS -T -s -RTS').
>>>
>>> Good luck to anybody working on that!
>>>
>>> Karel
>>>
>>>
>>> Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ...
>>>   61,020,836,136 bytes allocated in the heap
>>>5,229,185,608 bytes copied during GC
>>>  301,742,768 bytes maximum residency (19 sample(s))
>>>3,533,000 bytes maximum slop
>>>  840 MiB total memory in use (0 MB lost due to fragmentation)
>>>
>>>  Tot time (elapsed)  Avg pause  Max
>>> pause
>>>   Gen  0  2012 colls, 0 par5.725s   5.731s 0.0028s
>>> 0.1267s
>>>   Gen  119 colls, 0 par1.695s   1.696s 0.0893s
>>> 0.2636s
>>>
>>>   TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
>>>
>>>   SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
>>>
>>>   INITtime0.000s  (  0.000s elapsed)
>>>   MUT time   27.849s  ( 32.163s elapsed)
>>>   GC  time7.419s  (  7.427s elapsed)
>>>   EXITtime0.000s  (  0.010s elapsed)
>>>   Total   time   35.269s  ( 39.601s elapsed)
>>>
>>>   Alloc rate2,191,122,004 bytes per MUT second
>>>
>>>   Productivity  79.0% of total user, 81.2% of total elapsed
>>>
>>>
>>>  Performance counter stats for
>>> '/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0
>>> -hide-all-packages -package ghc-prim -package base -package binary
>>> -package array -package transformers -package time -package containers
>>> -package bytestring -package deepseq -package process -package pretty
>>> -package directory -package filepath -package template-haskell -package
>>> unix --make utils/ghc-cabal/Main.hs -o
>>> utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall
>>> -fno-warn-unused-imports

Re: On CI

2021-03-18 Thread Sebastian Graf
To be clear: All performance tests that run as part of CI measure
allocations only. No wall clock time.
Those measurements are (mostly) deterministic and reproducible between
compiles of the same worktree and not impacted by thermal issues/hardware
at all.

Am Do., 18. März 2021 um 18:09 Uhr schrieb davean :

> That really shouldn't be near system noise for a well constructed
> performance test. You might be seeing things like thermal issues, etc
> though - good benchmarking is a serious subject.
> Also we're not talking wall clock tests, we're talking specific metrics.
> The machines do tend to be bare metal, but many of these are entirely CPU
> performance independent, memory timing independent, etc. Well not quite but
> that's a longer discussion.
>
> The investigation of Haskell code performance is a very good thing to do
> BTW, but you'd still want to avoid regressions in the improvements you
> made. How well we can do that and the cost of it is the primary issue here.
>
> -davean
>
>
> On Wed, Mar 17, 2021 at 6:22 PM Karel Gardas 
> wrote:
>
>> On 3/17/21 4:16 PM, Andreas Klebinger wrote:
>> > Now that isn't really an issue anyway I think. The question is rather is
>> > 2% a large enough regression to worry about? 5%? 10%?
>>
>> 5-10% is still around system noise even on lightly loaded workstation.
>> Not sure if CI is not run on some shared cloud resources where it may be
>> even higher.
>>
>> I've done simple experiment of pining ghc compiling ghc-cabal and I've
>> been able to "speed" it up by 5-10% on W-2265.
>>
>> Also following this CI/performance regs discussion I'm not entirely sure
>> if  this is not just a witch-hunt hurting/beating mostly most active GHC
>> developers. Another idea may be to give up on CI doing perf reg testing
>> at all and invest saved resources into proper investigation of
>> GHC/Haskell programs performance. Not sure, if this would not be more
>> beneficial longer term.
>>
>> Just one random number thrown to the ring. Linux's perf claims that
>> nearly every second L3 cache access on the example above ends with cache
>> miss. Is it a good number or bad number? See stats below (perf stat -d
>> on ghc with +RTS -T -s -RTS').
>>
>> Good luck to anybody working on that!
>>
>> Karel
>>
>>
>> Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ...
>>   61,020,836,136 bytes allocated in the heap
>>5,229,185,608 bytes copied during GC
>>  301,742,768 bytes maximum residency (19 sample(s))
>>3,533,000 bytes maximum slop
>>  840 MiB total memory in use (0 MB lost due to fragmentation)
>>
>>  Tot time (elapsed)  Avg pause  Max
>> pause
>>   Gen  0  2012 colls, 0 par5.725s   5.731s 0.0028s
>> 0.1267s
>>   Gen  119 colls, 0 par1.695s   1.696s 0.0893s
>> 0.2636s
>>
>>   TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
>>
>>   SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
>>
>>   INITtime0.000s  (  0.000s elapsed)
>>   MUT time   27.849s  ( 32.163s elapsed)
>>   GC  time7.419s  (  7.427s elapsed)
>>   EXITtime0.000s  (  0.010s elapsed)
>>   Total   time   35.269s  ( 39.601s elapsed)
>>
>>   Alloc rate2,191,122,004 bytes per MUT second
>>
>>   Productivity  79.0% of total user, 81.2% of total elapsed
>>
>>
>>  Performance counter stats for
>> '/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0
>> -hide-all-packages -package ghc-prim -package base -package binary
>> -package array -package transformers -package time -package containers
>> -package bytestring -package deepseq -package process -package pretty
>> -package directory -package filepath -package template-haskell -package
>> unix --make utils/ghc-cabal/Main.hs -o
>> utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall
>> -fno-warn-unused-imports -fno-warn-warnings-deprecations
>> -DCABAL_VERSION=3,4,0,0 -DBOOTSTRAPPING -odir bootstrapping -hidir
>> bootstrapping libraries/Cabal/Cabal/Distribution/Fields/Lexer.hs
>> -ilibraries/Cabal/Cabal -ilibraries/binary/src -ilibraries/filepath
>> -ilibraries/hpc -ilibraries/mtl -ilibraries/text/src
>> libraries/text/cbits/cbits.c -Ilibraries/text/include
>> -ilibraries/parsec/src +RTS -T -s -RTS':
>>
>>  39,632.99 msec task-clock#0.999 CPUs
>> utilized
>> 17,191  context-switches  #0.434 K/sec
>>
>>  0

Re: On CI

2021-03-18 Thread davean
That really shouldn't be near system noise for a well constructed
performance test. You might be seeing things like thermal issues, etc
though - good benchmarking is a serious subject.
Also we're not talking wall clock tests, we're talking specific metrics.
The machines do tend to be bare metal, but many of these are entirely CPU
performance independent, memory timing independent, etc. Well not quite but
that's a longer discussion.

The investigation of Haskell code performance is a very good thing to do
BTW, but you'd still want to avoid regressions in the improvements you
made. How well we can do that and the cost of it is the primary issue here.

-davean


On Wed, Mar 17, 2021 at 6:22 PM Karel Gardas 
wrote:

> On 3/17/21 4:16 PM, Andreas Klebinger wrote:
> > Now that isn't really an issue anyway I think. The question is rather is
> > 2% a large enough regression to worry about? 5%? 10%?
>
> 5-10% is still around system noise even on lightly loaded workstation.
> Not sure if CI is not run on some shared cloud resources where it may be
> even higher.
>
> I've done simple experiment of pining ghc compiling ghc-cabal and I've
> been able to "speed" it up by 5-10% on W-2265.
>
> Also following this CI/performance regs discussion I'm not entirely sure
> if  this is not just a witch-hunt hurting/beating mostly most active GHC
> developers. Another idea may be to give up on CI doing perf reg testing
> at all and invest saved resources into proper investigation of
> GHC/Haskell programs performance. Not sure, if this would not be more
> beneficial longer term.
>
> Just one random number thrown to the ring. Linux's perf claims that
> nearly every second L3 cache access on the example above ends with cache
> miss. Is it a good number or bad number? See stats below (perf stat -d
> on ghc with +RTS -T -s -RTS').
>
> Good luck to anybody working on that!
>
> Karel
>
>
> Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ...
>   61,020,836,136 bytes allocated in the heap
>5,229,185,608 bytes copied during GC
>  301,742,768 bytes maximum residency (19 sample(s))
>3,533,000 bytes maximum slop
>  840 MiB total memory in use (0 MB lost due to fragmentation)
>
>  Tot time (elapsed)  Avg pause  Max
> pause
>   Gen  0  2012 colls, 0 par5.725s   5.731s 0.0028s
> 0.1267s
>   Gen  119 colls, 0 par1.695s   1.696s 0.0893s
> 0.2636s
>
>   TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
>
>   SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
>
>   INITtime0.000s  (  0.000s elapsed)
>   MUT time   27.849s  ( 32.163s elapsed)
>   GC  time7.419s  (  7.427s elapsed)
>   EXITtime0.000s  (  0.010s elapsed)
>   Total   time   35.269s  ( 39.601s elapsed)
>
>   Alloc rate2,191,122,004 bytes per MUT second
>
>   Productivity  79.0% of total user, 81.2% of total elapsed
>
>
>  Performance counter stats for
> '/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0
> -hide-all-packages -package ghc-prim -package base -package binary
> -package array -package transformers -package time -package containers
> -package bytestring -package deepseq -package process -package pretty
> -package directory -package filepath -package template-haskell -package
> unix --make utils/ghc-cabal/Main.hs -o
> utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall
> -fno-warn-unused-imports -fno-warn-warnings-deprecations
> -DCABAL_VERSION=3,4,0,0 -DBOOTSTRAPPING -odir bootstrapping -hidir
> bootstrapping libraries/Cabal/Cabal/Distribution/Fields/Lexer.hs
> -ilibraries/Cabal/Cabal -ilibraries/binary/src -ilibraries/filepath
> -ilibraries/hpc -ilibraries/mtl -ilibraries/text/src
> libraries/text/cbits/cbits.c -Ilibraries/text/include
> -ilibraries/parsec/src +RTS -T -s -RTS':
>
>  39,632.99 msec task-clock#0.999 CPUs
> utilized
> 17,191  context-switches  #0.434 K/sec
>
>  0  cpu-migrations#0.000 K/sec
>
>899,930  page-faults   #0.023 M/sec
>
>177,636,979,975  cycles#4.482 GHz
>   (87.54%)
>181,945,795,221  instructions  #1.02  insn per
> cycle   (87.59%)
> 34,033,574,511  branches  #  858.718 M/sec
>   (87.42%)
>  1,664,969,299  branch-misses #4.89% of all
> branches  (87.48%)
> 41,522,737,426  L1-dcache-loads   # 1047.681 M/sec
>   (87.53%)
>  2,675,319,939  L1-dcache-load-misses #6.44% of 

Re: On CI

2021-03-17 Thread Karel Gardas
On 3/17/21 4:16 PM, Andreas Klebinger wrote:
> Now that isn't really an issue anyway I think. The question is rather is
> 2% a large enough regression to worry about? 5%? 10%?

5-10% is still around system noise even on lightly loaded workstation.
Not sure if CI is not run on some shared cloud resources where it may be
even higher.

I've done simple experiment of pining ghc compiling ghc-cabal and I've
been able to "speed" it up by 5-10% on W-2265.

Also following this CI/performance regs discussion I'm not entirely sure
if  this is not just a witch-hunt hurting/beating mostly most active GHC
developers. Another idea may be to give up on CI doing perf reg testing
at all and invest saved resources into proper investigation of
GHC/Haskell programs performance. Not sure, if this would not be more
beneficial longer term.

Just one random number thrown to the ring. Linux's perf claims that
nearly every second L3 cache access on the example above ends with cache
miss. Is it a good number or bad number? See stats below (perf stat -d
on ghc with +RTS -T -s -RTS').

Good luck to anybody working on that!

Karel


Linking utils/ghc-cabal/dist/build/tmp/ghc-cabal ...
  61,020,836,136 bytes allocated in the heap
   5,229,185,608 bytes copied during GC
 301,742,768 bytes maximum residency (19 sample(s))
   3,533,000 bytes maximum slop
 840 MiB total memory in use (0 MB lost due to fragmentation)

 Tot time (elapsed)  Avg pause  Max
pause
  Gen  0  2012 colls, 0 par5.725s   5.731s 0.0028s
0.1267s
  Gen  119 colls, 0 par1.695s   1.696s 0.0893s
0.2636s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INITtime0.000s  (  0.000s elapsed)
  MUT time   27.849s  ( 32.163s elapsed)
  GC  time7.419s  (  7.427s elapsed)
  EXITtime0.000s  (  0.010s elapsed)
  Total   time   35.269s  ( 39.601s elapsed)

  Alloc rate2,191,122,004 bytes per MUT second

  Productivity  79.0% of total user, 81.2% of total elapsed


 Performance counter stats for
'/export/home/karel/sfw/ghc-8.10.3/bin/ghc -H32m -O -Wall -optc-Wall -O0
-hide-all-packages -package ghc-prim -package base -package binary
-package array -package transformers -package time -package containers
-package bytestring -package deepseq -package process -package pretty
-package directory -package filepath -package template-haskell -package
unix --make utils/ghc-cabal/Main.hs -o
utils/ghc-cabal/dist/build/tmp/ghc-cabal -no-user-package-db -Wall
-fno-warn-unused-imports -fno-warn-warnings-deprecations
-DCABAL_VERSION=3,4,0,0 -DBOOTSTRAPPING -odir bootstrapping -hidir
bootstrapping libraries/Cabal/Cabal/Distribution/Fields/Lexer.hs
-ilibraries/Cabal/Cabal -ilibraries/binary/src -ilibraries/filepath
-ilibraries/hpc -ilibraries/mtl -ilibraries/text/src
libraries/text/cbits/cbits.c -Ilibraries/text/include
-ilibraries/parsec/src +RTS -T -s -RTS':

 39,632.99 msec task-clock#0.999 CPUs
utilized
17,191  context-switches  #0.434 K/sec

 0  cpu-migrations#0.000 K/sec

   899,930  page-faults   #0.023 M/sec

   177,636,979,975  cycles#4.482 GHz
  (87.54%)
   181,945,795,221  instructions  #1.02  insn per
cycle   (87.59%)
34,033,574,511  branches  #  858.718 M/sec
  (87.42%)
 1,664,969,299  branch-misses #4.89% of all
branches  (87.48%)
41,522,737,426  L1-dcache-loads   # 1047.681 M/sec
  (87.53%)
 2,675,319,939  L1-dcache-load-misses #6.44% of all
L1-dcache hits(87.48%)
   372,370,395  LLC-loads #9.395 M/sec
  (87.49%)
   173,614,140  LLC-load-misses   #   46.62% of all
LL-cache hits (87.46%)

  39.663103602 seconds time elapsed

  38.288158000 seconds user
   1.358263000 seconds sys
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Merijn Verstraaten
On 17 Mar 2021, at 16:16, Andreas Klebinger  wrote:
> 
> While I fully agree with this. We should *always* want to know if a small 
> syntetic benchmark regresses by a lot.
> Or in other words we don't want CI to accept such a regression for us ever, 
> but the developer of a patch should need to explicitly ok it.
> 
> Otherwise we just slow down a lot of seldom-used code paths by a lot.
> 
> Now that isn't really an issue anyway I think. The question is rather is 2% a 
> large enough regression to worry about? 5%? 10%?

You probably want a sliding window anyway. Having N 1.8% regressions in a row 
can still slow things down a lot. While a 3% regression after a 5% improvement 
is probably fine.

- Merijn


signature.asc
Description: Message signed with OpenPGP
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Andreas Klebinger

> I'd be quite happy to accept a 25% regression on T9872c if it yielded
a 1% improvement on compiling Cabal. T9872 is very very very strange!
(Maybe if *all* the T9872 tests regressed, I'd be more worried.)

While I fully agree with this. We should *always* want to know if a
small syntetic benchmark regresses by a lot.
Or in other words we don't want CI to accept such a regression for us
ever, but the developer of a patch should need to explicitly ok it.

Otherwise we just slow down a lot of seldom-used code paths by a lot.

Now that isn't really an issue anyway I think. The question is rather is
2% a large enough regression to worry about? 5%? 10%?

Cheers,
Andreas

Am 17/03/2021 um 14:39 schrieb Richard Eisenberg:




On Mar 17, 2021, at 6:18 AM, Moritz Angermann
mailto:moritz.angerm...@gmail.com>> wrote:

But what do we expect of patch authors? Right now if five people
write patches to GHC, and each of them eventually manage to get their
MRs green, after a long review, they finally see it assigned to
marge, and then it starts failing? Their patch on its own was fine,
but their aggregate with other people's code leads to regressions? So
we now expect all patch authors together to try to figure out what
happened? Figuring out why something regressed is hard enough, and we
only have a very few people who are actually capable of debugging
this. Thus I believe it would end up with Ben, Andreas, Matthiew,
Simon, ... or someone else from GHC HQ anyway to figure out why it
regressed, be it in the Review Stage, or dissecting a marge
aggregate, or on master.


I have previously posted against the idea of allowing Marge to accept
regressions... but the paragraph above is sadly convincing. Maybe
Simon is right about opening up the windows to, say, be 100% (which
would catch a 10x regression) instead of infinite, but I'm now
convinced that Marge should be very generous in allowing regressions
-- provided we also have some way of monitoring drift over time.

Separately, I've been concerned for some time about the peculiarity of
our perf tests. For example, I'd be quite happy to accept a 25%
regression on T9872c if it yielded a 1% improvement on compiling
Cabal. T9872 is very very very strange! (Maybe if *all* the T9872
tests regressed, I'd be more worried.) I would be very happy to learn
that some more general, representative tests are included in our
examinations.

Richard

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread John Ericson
Yes, I think the counter point of "automating what Ben does" so people 
besides Ben can do it is very important. In this case, I think a good 
thing we could do is asynchronously build more of master post-merge, 
such as use the perf stats to automatically bisect anything that is 
fishy, including within marge bot roll-ups which wouldn't be built by 
the regular workflow anyways.


I also agree with Sebastian that the overfit/overly-synthetic nature of 
our current tests + the sketchy way we ignored drift makes the current 
approach worth abandoning in any event. The fact that the gold standard 
must include tests of larger, "real world" code, which unfortunately 
takes longer to build, I also think is a point towards this asynchronous 
approach: We trade MR latency for stat latency, but better utilize our 
build machines and get better stats, and when a human is to fix 
something a few days later, they have a much better foundation to start 
their investigation.


Finally I agree with SPJ that for fairness and sustainability's sake, 
the person investigating issues after the fact should ideally be the MR 
authors, and definitely definitely not Ben. But I hope that better 
stats, nice looking graphs, and maybe a system to automatically ping MR 
authors, will make the perf debugging much more accessible enabling that 
goal.


John

On 3/17/21 9:47 AM, Sebastian Graf wrote:
Re: Performance drift: I opened 
https://gitlab.haskell.org/ghc/ghc/-/issues/17658 
 a while ago with 
an idea of how to measure drift a bit better.
It's basically an automatically checked version of "Ben stares at 
performance reports every two weeks and sees that T9872 has regressed 
by 10% since 9.0"


Maybe we can have Marge check for drift and each individual MR for 
incremental perf regressions?


Sebastian

Am Mi., 17. März 2021 um 14:40 Uhr schrieb Richard Eisenberg 
mailto:r...@richarde.dev>>:





On Mar 17, 2021, at 6:18 AM, Moritz Angermann
mailto:moritz.angerm...@gmail.com>>
wrote:

But what do we expect of patch authors? Right now if five people
write patches to GHC, and each of them eventually manage to get
their MRs green, after a long review, they finally see it
assigned to marge, and then it starts failing? Their patch on its
own was fine, but their aggregate with other people's code leads
to regressions? So we now expect all patch authors together to
try to figure out what happened? Figuring out why something
regressed is hard enough, and we only have a very few people who
are actually capable of debugging this. Thus I believe it would
end up with Ben, Andreas, Matthiew, Simon, ... or someone else
from GHC HQ anyway to figure out why it regressed, be it in the
Review Stage, or dissecting a marge aggregate, or on master.


I have previously posted against the idea of allowing Marge to
accept regressions... but the paragraph above is sadly convincing.
Maybe Simon is right about opening up the windows to, say, be 100%
(which would catch a 10x regression) instead of infinite, but I'm
now convinced that Marge should be very generous in allowing
regressions -- provided we also have some way of monitoring drift
over time.

Separately, I've been concerned for some time about the
peculiarity of our perf tests. For example, I'd be quite happy to
accept a 25% regression on T9872c if it yielded a 1% improvement
on compiling Cabal. T9872 is very very very strange! (Maybe if
*all* the T9872 tests regressed, I'd be more worried.) I would be
very happy to learn that some more general, representative tests
are included in our examinations.

Richard
___
ghc-devs mailing list
ghc-devs@haskell.org 
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs



___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Sebastian Graf
Re: Performance drift: I opened
https://gitlab.haskell.org/ghc/ghc/-/issues/17658 a while ago with an idea
of how to measure drift a bit better.
It's basically an automatically checked version of "Ben stares at
performance reports every two weeks and sees that T9872 has regressed by
10% since 9.0"

Maybe we can have Marge check for drift and each individual MR for
incremental perf regressions?

Sebastian

Am Mi., 17. März 2021 um 14:40 Uhr schrieb Richard Eisenberg <
r...@richarde.dev>:

>
>
> On Mar 17, 2021, at 6:18 AM, Moritz Angermann 
> wrote:
>
> But what do we expect of patch authors? Right now if five people write
> patches to GHC, and each of them eventually manage to get their MRs green,
> after a long review, they finally see it assigned to marge, and then it
> starts failing? Their patch on its own was fine, but their aggregate with
> other people's code leads to regressions? So we now expect all patch
> authors together to try to figure out what happened? Figuring out why
> something regressed is hard enough, and we only have a very few people who
> are actually capable of debugging this. Thus I believe it would end up with
> Ben, Andreas, Matthiew, Simon, ... or someone else from GHC HQ anyway to
> figure out why it regressed, be it in the Review Stage, or dissecting a
> marge aggregate, or on master.
>
>
> I have previously posted against the idea of allowing Marge to accept
> regressions... but the paragraph above is sadly convincing. Maybe Simon is
> right about opening up the windows to, say, be 100% (which would catch a
> 10x regression) instead of infinite, but I'm now convinced that Marge
> should be very generous in allowing regressions -- provided we also have
> some way of monitoring drift over time.
>
> Separately, I've been concerned for some time about the peculiarity of our
> perf tests. For example, I'd be quite happy to accept a 25% regression on
> T9872c if it yielded a 1% improvement on compiling Cabal. T9872 is very
> very very strange! (Maybe if *all* the T9872 tests regressed, I'd be more
> worried.) I would be very happy to learn that some more general,
> representative tests are included in our examinations.
>
> Richard
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Richard Eisenberg


> On Mar 17, 2021, at 6:18 AM, Moritz Angermann  
> wrote:
> 
> But what do we expect of patch authors? Right now if five people write 
> patches to GHC, and each of them eventually manage to get their MRs green, 
> after a long review, they finally see it assigned to marge, and then it 
> starts failing? Their patch on its own was fine, but their aggregate with 
> other people's code leads to regressions? So we now expect all patch authors 
> together to try to figure out what happened? Figuring out why something 
> regressed is hard enough, and we only have a very few people who are actually 
> capable of debugging this. Thus I believe it would end up with Ben, Andreas, 
> Matthiew, Simon, ... or someone else from GHC HQ anyway to figure out why it 
> regressed, be it in the Review Stage, or dissecting a marge aggregate, or on 
> master.

I have previously posted against the idea of allowing Marge to accept 
regressions... but the paragraph above is sadly convincing. Maybe Simon is 
right about opening up the windows to, say, be 100% (which would catch a 10x 
regression) instead of infinite, but I'm now convinced that Marge should be 
very generous in allowing regressions -- provided we also have some way of 
monitoring drift over time.

Separately, I've been concerned for some time about the peculiarity of our perf 
tests. For example, I'd be quite happy to accept a 25% regression on T9872c if 
it yielded a 1% improvement on compiling Cabal. T9872 is very very very 
strange! (Maybe if *all* the T9872 tests regressed, I'd be more worried.) I 
would be very happy to learn that some more general, representative tests are 
included in our examinations.

Richard___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Moritz Angermann
I am not advocating to drop perf tests during merge requests, I just want
them to not be fatal for marge batches. Yes this means that a bunch of
unrelated merge requests all could be fine wrt to the perf checks per merge
request, but the aggregate might fail perf.  And then subsequently the next
MR against the merged aggregate will start failing. Even that is a pretty
bad situation imo.

I honestly don't have a good answer, I just see marge work on batches, over
and over and over again, just to fail. Eventually marge should figure out a
subset of the merges that fit into the perf window, but that might be after
10 tries? So after up to ~30+hours?, which means there won't be any merge
request landing in GHC for 30hs. I find that rather unacceptable.

I think we need better visualisation of perf regressions that happen on
master. Ben has some wip for this, and I think John said there might be
some way to add a nice (maybe reflex) ui to it.  If we can see regressions
on master easily, and go from "ohh this point in time GHC got worse", to
"this is the commit". We might be able to figure it out.

But what do we expect of patch authors? Right now if five people write
patches to GHC, and each of them eventually manage to get their MRs green,
after a long review, they finally see it assigned to marge, and then it
starts failing? Their patch on its own was fine, but their aggregate with
other people's code leads to regressions? So we now expect all patch
authors together to try to figure out what happened? Figuring out why
something regressed is hard enough, and we only have a very few people who
are actually capable of debugging this. Thus I believe it would end up with
Ben, Andreas, Matthiew, Simon, ... or someone else from GHC HQ anyway to
figure out why it regressed, be it in the Review Stage, or dissecting a
marge aggregate, or on master.

Thus I believe in most cases we'd have to look at the regressions anyway,
and right now we just convolutedly make working on GHC a rather depressing
job. Increasing the barrier to entry by also requiring everyone to have
absolutely stellar perf regression skills is quite a challenge.

There is also the question of our synthetic benchmarks actually measuring
real world performance? Do the micro benchmarks translate to the same
regressions in say building aeson, vector or Cabal? The latter being what
most practitioners care about more than the micro benchmarks.

Again, I'm absolutely not in favour of GHC regressing, it's slow enough as
it is. I just think CI should be assisting us and not holding development
back.

Cheers,
 Moritz

On Wed, Mar 17, 2021 at 5:54 PM Spiwack, Arnaud 
wrote:

> Ah, so it was really two identical pipelines (one for the branch where
> Margebot batches commits, and one for the MR that Margebot creates before
> merging). That's indeed a non-trivial amount of purely wasted
> computer-hours.
>
> Taking a step back, I am inclined to agree with the proposal of not
> checking stat regressions in Margebot. My high-level opinion on this is
> that perf tests don't actually test the right thing. Namely, they don't
> prevent performance drift over time (if a given test is allowed to degrade
> by 2% every commit, it can take a 100% performance hit in just 35 commits).
> While it is important to measure performance, and to avoid too egregious
> performance degradation in a given commit, it's usually performance over
> time which matters. I don't really know how to apply it to collaborative
> development, and help maintain healthy performance. But flagging
> performance regressions in MRs, while not making them block batched merges
> sounds like a reasonable compromise.
>
>
> On Wed, Mar 17, 2021 at 9:34 AM Moritz Angermann <
> moritz.angerm...@gmail.com> wrote:
>
>> *why* is a very good question. The MR fixing it is here:
>> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5275
>>
>> On Wed, Mar 17, 2021 at 4:26 PM Spiwack, Arnaud 
>> wrote:
>>
>>> Then I have a question: why are there two pipelines running on each
>>> merge batch?
>>>
>>> On Wed, Mar 17, 2021 at 9:22 AM Moritz Angermann <
>>> moritz.angerm...@gmail.com> wrote:
>>>
>>>> No it wasn't. It was about the stat failures described in the next
>>>> paragraph. I could have been more clear about that. My apologies!
>>>>
>>>> On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud <
>>>> arnaud.spiw...@tweag.io> wrote:
>>>>
>>>>>
>>>>> and if either of both (see below) failed, marge's merge would fail as
>>>>>> well.
>>>>>>
>>>>>
>>>>> Re: “see below” is this referring to a missing part of your email?
>>>>>
>>>>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Spiwack, Arnaud
Ah, so it was really two identical pipelines (one for the branch where
Margebot batches commits, and one for the MR that Margebot creates before
merging). That's indeed a non-trivial amount of purely wasted
computer-hours.

Taking a step back, I am inclined to agree with the proposal of not
checking stat regressions in Margebot. My high-level opinion on this is
that perf tests don't actually test the right thing. Namely, they don't
prevent performance drift over time (if a given test is allowed to degrade
by 2% every commit, it can take a 100% performance hit in just 35 commits).
While it is important to measure performance, and to avoid too egregious
performance degradation in a given commit, it's usually performance over
time which matters. I don't really know how to apply it to collaborative
development, and help maintain healthy performance. But flagging
performance regressions in MRs, while not making them block batched merges
sounds like a reasonable compromise.


On Wed, Mar 17, 2021 at 9:34 AM Moritz Angermann 
wrote:

> *why* is a very good question. The MR fixing it is here:
> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5275
>
> On Wed, Mar 17, 2021 at 4:26 PM Spiwack, Arnaud 
> wrote:
>
>> Then I have a question: why are there two pipelines running on each merge
>> batch?
>>
>> On Wed, Mar 17, 2021 at 9:22 AM Moritz Angermann <
>> moritz.angerm...@gmail.com> wrote:
>>
>>> No it wasn't. It was about the stat failures described in the next
>>> paragraph. I could have been more clear about that. My apologies!
>>>
>>> On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud 
>>> wrote:
>>>

 and if either of both (see below) failed, marge's merge would fail as
> well.
>

 Re: “see below” is this referring to a missing part of your email?

>>>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


RE: On CI

2021-03-17 Thread Simon Peyton Jones via ghc-devs
We need to do something about this, and I'd advocate for just not making stats 
fail with marge.

Generally I agree.   One point you don’t mention is that our perf tests (which 
CI forces us to look at assiduously) are often pretty weird cases.  So there is 
at least a danger that these more exotic cases will stand in the way of (say) a 
perf improvement in the typical case.

But “not making stats fail” is a bit crude.   Instead how about

  *   Always accept stat improvements



  *   We already have per-benchmark windows.  If the stat falls outside the 
window, we fail.  You are effectively saying “widen all windows to infinity”.  
If something makes a stat 10 times worse, I think we *should* fail.  But 10% 
worse?  Maybe we should accept and look later as you suggest.   So I’d argue 
for widening the windows rather than disabling them completely.


  *   If we did that we’d need good instrumentation to spot steps and drift in 
perf, as you say.  An advantage is that since the perf instrumentation runs 
only on committed master patches, not on every CI, it can cost more.  In 
particular , it could run a bunch of “typical” tests, including nofib and 
compiling Cabal or other libraries.

The big danger is that by relieving patch authors from worrying about perf 
drift, it’ll end up in the lap of the GHC HQ team.  If it’s hard for the author 
of a single patch (with which she is intimately familiar) to work out why it’s 
making some test 2% worse, imagine how hard, and demotivating, it’d be for Ben 
to wonder why 50 patches (with which he is unfamiliar) are making some test 5% 
worse.

I’m not sure how to address this problem.   At least we should make it clear 
that patch authors are expected to engage *actively* in a conversation about 
why their patch is making something worse, even after it lands.

Simon

From: ghc-devs  On Behalf Of Moritz Angermann
Sent: 17 March 2021 03:00
To: ghc-devs 
Subject: On CI

Hi there!

Just a quick update on our CI situation. Ben, John, Davean and I have been
discussion on CI yesterday, and what we can do about it, as well as some
minor notes on why we are frustrated with it. This is an open invitation to 
anyone who in earnest wants to work on CI. Please come forward and help!
We'd be glad to have more people involved!

First the good news, over the last few weeks we've seen we *can* improve
CI performance quite substantially. And the goal is now to have MR go through
CI within at most 3hs.  There are some ideas on how to make this even faster,
especially on wide (high core count) machines; however that will take a bit more
time.

Now to the more thorny issue: Stat failures.  We do not want GHC to regress,
and I believe everyone is on board with that mission.  Yet we have just 
witnessed a train of marge trials all fail due to a -2% regression in a few 
tests. Thus we've been blocking getting stuff into master for at least another 
day. This is (in my opinion) not acceptable! We just had five days of nothing 
working because master was broken and subsequently all CI pipelines kept 
failing. We have thus effectively wasted a week. While we can mitigate the 
latter part by enforcing marge for all merges to master (and with faster 
pipeline turnaround times this might be more palatable than with 9-12h 
turnaround times -- when you need to get something done! ha!), but that won't 
help us with issues where marge can't find a set of buildable MRs, because she 
just keeps hitting a combination of MRs that somehow together increase or 
decrease metrics.

We have three knobs to adjust:
- Make GHC build faster / make the testsuite run faster.
  There is some rather interesting work going on about parallelizing (earlier)
  during builds. We've also seen that we've wasted enormous amounts of
  time during darwin builds in the kernel, because of a bug in the testdriver.
- Use faster hardware.
  We've seen that just this can cut windows build times from 220min to 80min.
- Reduce the amount of builds.
  We used to build two pipelines for each marge merge, and if either of both
  (see below) failed, marge's merge would fail as well. So not only did we build
  twice as much as we needed, we also increased our chances to hit bogous
  build failures by 2.

We need to do something about this, and I'd advocate for just not making stats 
fail with marge. Build errors of course, but stat failures, no. And then have a 
separate dashboard (and Ben has some old code lying around for this, which 
someone would need to pick up and polish, ...), that tracks GHC's Performance 
for each commit to master, with easy access from the dashboard to the offending 
commit. We will also need to consider the implications of synthetic micro 
benchmarks, as opposed to say building Cabal or other packages, that reflect 
more real-world experience of users using GHC.

I will try to provide a data driven report on GHC's CI on a bi-weekly or month 
(we will have to see what the costs for writing it up

Re: On CI

2021-03-17 Thread Moritz Angermann
*why* is a very good question. The MR fixing it is here:
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/5275

On Wed, Mar 17, 2021 at 4:26 PM Spiwack, Arnaud 
wrote:

> Then I have a question: why are there two pipelines running on each merge
> batch?
>
> On Wed, Mar 17, 2021 at 9:22 AM Moritz Angermann <
> moritz.angerm...@gmail.com> wrote:
>
>> No it wasn't. It was about the stat failures described in the next
>> paragraph. I could have been more clear about that. My apologies!
>>
>> On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud 
>> wrote:
>>
>>>
>>> and if either of both (see below) failed, marge's merge would fail as
 well.

>>>
>>> Re: “see below” is this referring to a missing part of your email?
>>>
>>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Spiwack, Arnaud
Then I have a question: why are there two pipelines running on each merge
batch?

On Wed, Mar 17, 2021 at 9:22 AM Moritz Angermann 
wrote:

> No it wasn't. It was about the stat failures described in the next
> paragraph. I could have been more clear about that. My apologies!
>
> On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud 
> wrote:
>
>>
>> and if either of both (see below) failed, marge's merge would fail as
>>> well.
>>>
>>
>> Re: “see below” is this referring to a missing part of your email?
>>
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Moritz Angermann
No it wasn't. It was about the stat failures described in the next
paragraph. I could have been more clear about that. My apologies!

On Wed, Mar 17, 2021 at 4:14 PM Spiwack, Arnaud 
wrote:

>
> and if either of both (see below) failed, marge's merge would fail as well.
>>
>
> Re: “see below” is this referring to a missing part of your email?
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-03-17 Thread Spiwack, Arnaud
> and if either of both (see below) failed, marge's merge would fail as well.
>

Re: “see below” is this referring to a missing part of your email?
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


On CI

2021-03-16 Thread Moritz Angermann
Hi there!

Just a quick update on our CI situation. Ben, John, Davean and I have been
discussion on CI yesterday, and what we can do about it, as well as some
minor notes on why we are frustrated with it. This is an open invitation to
anyone who in earnest wants to work on CI. Please come forward and help!
We'd be glad to have more people involved!

First the good news, over the last few weeks we've seen we *can* improve
CI performance quite substantially. And the goal is now to have MR go
through
CI within at most 3hs.  There are some ideas on how to make this even
faster,
especially on wide (high core count) machines; however that will take a bit
more
time.

Now to the more thorny issue: Stat failures.  We do not want GHC to regress,
and I believe everyone is on board with that mission.  Yet we have just
witnessed a train of marge trials all fail due to a -2% regression in a few
tests. Thus we've been blocking getting stuff into master for at least
another day. This is (in my opinion) not acceptable! We just had five days
of nothing working because master was broken and subsequently all CI
pipelines kept failing. We have thus effectively wasted a week. While we
can mitigate the latter part by enforcing marge for all merges to master
(and with faster pipeline turnaround times this might be more palatable
than with 9-12h turnaround times -- when you need to get something done!
ha!), but that won't help us with issues where marge can't find a set of
buildable MRs, because she just keeps hitting a combination of MRs that
somehow together increase or decrease metrics.

We have three knobs to adjust:
- Make GHC build faster / make the testsuite run faster.
  There is some rather interesting work going on about parallelizing
(earlier)
  during builds. We've also seen that we've wasted enormous amounts of
  time during darwin builds in the kernel, because of a bug in the
testdriver.
- Use faster hardware.
  We've seen that just this can cut windows build times from 220min to
80min.
- Reduce the amount of builds.
  We used to build two pipelines for each marge merge, and if either of both
  (see below) failed, marge's merge would fail as well. So not only did we
build
  twice as much as we needed, we also increased our chances to hit bogous
  build failures by 2.

We need to do something about this, and I'd advocate for just not making
stats fail with marge. Build errors of course, but stat failures, no. And
then have a separate dashboard (and Ben has some old code lying around for
this, which someone would need to pick up and polish, ...), that tracks
GHC's Performance for each commit to master, with easy access from the
dashboard to the offending commit. We will also need to consider the
implications of synthetic micro benchmarks, as opposed to say building
Cabal or other packages, that reflect more real-world experience of users
using GHC.

I will try to provide a data driven report on GHC's CI on a bi-weekly or
month (we will have to see what the costs for writing it up, and the
usefulness is) going forward. And my sincere hope is that it will help us
better understand our CI situation; instead of just having some vague
complaints about it.

Cheers,
 Moritz
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-02-22 Thread John Ericson
I agree one should be able to get most of the testing value from stage1. 
And the tooling team at IOHK has done some work in 
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3652 to allow a 
stage 1 compiler to be tested. That's a very important first step!


But TH and GHCi require either iserv (external interpreter) or an 
compiler whose own ABI and the outputted ABI match for the internal 
interpreter, and ideally we should test both. I think doing a --freeze1 
stage2 build *in addition* to the stage1 build would work in the 
majority of cases, and that would allow us to incrementally build and 
test both. Remember that iserv uses the ghc library, and needs to be ABI 
comparable with the stage1 compiler that is using it, so it is less a 
panacea than it might seem like for ABI changes vs mere cross compilation.


I opened https://github.com/ghc-proposals/ghc-proposals/issues/162 for 
an ABI-agnostic interpreter that would allow stage1 alone to do GHCi and 
TH a third away unconditionally. This would also allow TH to safely be 
used in GHC itself, but for the purposes of this discussion, it's nice 
to make testing more reliable without the --freeze1 stage 2 gamble.


Bottom line is, yes, building stage 2 from a freshly-built stage 1 will 
invalidate any cache, and so we should avoid that.


John

On 2/22/21 8:42 AM, Spiwack, Arnaud wrote:
Let me know if I'm talking nonsense, but I believe that we are 
building both stages for each architecture and flavour. Do we need to 
build two stages everywhere? What stops us from building a single 
stage? And if anything, what can we change to get into a situation 
where we can?


Quite better than reusing build incrementally, is not building at all.

On Mon, Feb 22, 2021 at 10:09 AM Simon Peyton Jones via ghc-devs 
mailto:ghc-devs@haskell.org>> wrote:


Incremental CI can cut multiple hours to < mere minutes,
especially with the test suite being embarrassingly parallel.
There simply no way optimizations to the compiler independent from
sharing a cache between CI runs can get anywhere close to that
return on investment.

I rather agree with this.  I don’t think there is much low-hanging
fruit on compile times, aside from coercion-zapping which we are
working on anyway.  If we got a 10% reduction in compile time we’d
be over the moon, but our users would barely notice.

To get truly substantial improvements (a factor of 2 or 10) I
think we need to do less compiling – hence incremental CI.


Simon

*From:*ghc-devs mailto:ghc-devs-boun...@haskell.org>> *On Behalf Of *John Ericson
*Sent:* 22 February 2021 05:53
*To:* ghc-devs mailto:ghc-devs@haskell.org>>
    *Subject:* Re: On CI

I'm not opposed to some effort going into this, but I would
strongly opposite putting all our effort there. Incremental CI can
cut multiple hours to < mere minutes, especially with the test
suite being embarrassingly parallel. There simply no way
optimizations to the compiler independent from sharing a cache
between CI runs can get anywhere close to that return on investment.

(FWIW, I'm also skeptical that the people complaining about GHC
performance know what's hurting them most. For example, after
non-incrementality, the next slowest thing is linking, which
is...not done by GHC! But all that is a separate conversation.)

John

On 2/19/21 2:42 PM, Richard Eisenberg wrote:

There are some good ideas here, but I want to throw out
another one: put all our effort into reducing compile times.
There is a loud plea to do this on Discourse

<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdiscourse.haskell.org%2Ft%2Fcall-for-ideas-forming-a-technical-agenda%2F1901%2F24=04%7C01%7Csimonpj%40microsoft.com%7C9d7043627f5042598e5b08d8d6f648c4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637495701691120329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=1CV0MEVUZpbAbmKAWTIiqLgjft7IbN%2BCSnvB3W3iX%2FU%3D=0>,
and it would both solve these CI problems and also help
everyone else.

This isn't to say to stop exploring the ideas here. But since
time is mostly fixed, tackling compilation times in general
may be the best way out of this. Ben's survey of other
projects (thanks!) shows that we're way, way behind in how
long our CI takes to run.

Richard



On Feb 19, 2021, at 7:20 AM, Sebastian Graf
mailto:sgraf1...@gmail.com>> wrote:

Recompilation avoidance

    I think in order to cache more in CI, we first have to
invest some time in fixing recompilation avoidance in our
bootstrapped build system.

I just tested on a hadrian perf ticky build: Adding one
line of *comment* in the compiler causes

  * a (

Re: On CI

2021-02-22 Thread Spiwack, Arnaud
Let me know if I'm talking nonsense, but I believe that we are building
both stages for each architecture and flavour. Do we need to build two
stages everywhere? What stops us from building a single stage? And if
anything, what can we change to get into a situation where we can?

Quite better than reusing build incrementally, is not building at all.

On Mon, Feb 22, 2021 at 10:09 AM Simon Peyton Jones via ghc-devs <
ghc-devs@haskell.org> wrote:

> Incremental CI can cut multiple hours to < mere minutes, especially with
> the test suite being embarrassingly parallel. There simply no way
> optimizations to the compiler independent from sharing a cache between CI
> runs can get anywhere close to that return on investment.
>
> I rather agree with this.  I don’t think there is much low-hanging fruit
> on compile times, aside from coercion-zapping which we are working on
> anyway.  If we got a 10% reduction in compile time we’d be over the moon,
> but our users would barely notice.
>
>
>
> To get truly substantial improvements (a factor of 2 or 10) I think we
> need to do less compiling – hence incremental CI.
>
>
> Simon
>
>
>
> *From:* ghc-devs  *On Behalf Of *John
> Ericson
> *Sent:* 22 February 2021 05:53
> *To:* ghc-devs 
> *Subject:* Re: On CI
>
>
>
> I'm not opposed to some effort going into this, but I would strongly
> opposite putting all our effort there. Incremental CI can cut multiple
> hours to < mere minutes, especially with the test suite being
> embarrassingly parallel. There simply no way optimizations to the compiler
> independent from sharing a cache between CI runs can get anywhere close to
> that return on investment.
>
> (FWIW, I'm also skeptical that the people complaining about GHC
> performance know what's hurting them most. For example, after
> non-incrementality, the next slowest thing is linking, which is...not done
> by GHC! But all that is a separate conversation.)
>
> John
>
> On 2/19/21 2:42 PM, Richard Eisenberg wrote:
>
> There are some good ideas here, but I want to throw out another one: put
> all our effort into reducing compile times. There is a loud plea to do this
> on Discourse
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdiscourse.haskell.org%2Ft%2Fcall-for-ideas-forming-a-technical-agenda%2F1901%2F24=04%7C01%7Csimonpj%40microsoft.com%7C9d7043627f5042598e5b08d8d6f648c4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637495701691120329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=1CV0MEVUZpbAbmKAWTIiqLgjft7IbN%2BCSnvB3W3iX%2FU%3D=0>,
> and it would both solve these CI problems and also help everyone else.
>
>
>
> This isn't to say to stop exploring the ideas here. But since time is
> mostly fixed, tackling compilation times in general may be the best way out
> of this. Ben's survey of other projects (thanks!) shows that we're way, way
> behind in how long our CI takes to run.
>
>
>
> Richard
>
>
>
> On Feb 19, 2021, at 7:20 AM, Sebastian Graf  wrote:
>
>
>
> Recompilation avoidance
>
>
>
> I think in order to cache more in CI, we first have to invest some time in
> fixing recompilation avoidance in our bootstrapped build system.
>
>
>
> I just tested on a hadrian perf ticky build: Adding one line of *comment*
> in the compiler causes
>
>- a (pretty slow, yet negligible) rebuild of the stage1 compiler
>- 2 minutes of RTS rebuilding (Why do we have to rebuild the RTS? It
>doesn't depend in any way on the change I made)
>- apparent full rebuild the libraries
>- apparent full rebuild of the stage2 compiler
>
> That took 17 minutes, a full build takes ~45minutes. So there definitely
> is some caching going on, but not nearly as much as there could be.
>
> I know there have been great and boring efforts on compiler determinism in
> the past, but either it's not good enough or our build system needs fixing.
>
> I think a good first step to assert would be to make sure that the hash of
> the stage1 compiler executable doesn't change if I only change a comment.
>
> I'm aware there probably is stuff going on, like embedding configure dates
> in interface files and executables, that would need to go, but if possible
> this would be a huge improvement.
>
>
>
> On the other hand, we can simply tack on a [skip ci] to the commit
> message, as I did for
> https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975
> <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.haskell.org%2Fghc%2Fghc%2F-%2Fmerge_requests%2F4975=04%7C01%7Csimonpj%40microsoft.com%7C9d7043627f5042598e5b08d8d6f648c4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637495701691130329

RE: On CI

2021-02-22 Thread Simon Peyton Jones via ghc-devs
Incremental CI can cut multiple hours to < mere minutes, especially with the 
test suite being embarrassingly parallel. There simply no way optimizations to 
the compiler independent from sharing a cache between CI runs can get anywhere 
close to that return on investment.
I rather agree with this.  I don't think there is much low-hanging fruit on 
compile times, aside from coercion-zapping which we are working on anyway.  If 
we got a 10% reduction in compile time we'd be over the moon, but our users 
would barely notice.

To get truly substantial improvements (a factor of 2 or 10) I think we need to 
do less compiling - hence incremental CI.

Simon

From: ghc-devs  On Behalf Of John Ericson
Sent: 22 February 2021 05:53
To: ghc-devs 
Subject: Re: On CI


I'm not opposed to some effort going into this, but I would strongly opposite 
putting all our effort there. Incremental CI can cut multiple hours to < mere 
minutes, especially with the test suite being embarrassingly parallel. There 
simply no way optimizations to the compiler independent from sharing a cache 
between CI runs can get anywhere close to that return on investment.

(FWIW, I'm also skeptical that the people complaining about GHC performance 
know what's hurting them most. For example, after non-incrementality, the next 
slowest thing is linking, which is...not done by GHC! But all that is a 
separate conversation.)

John
On 2/19/21 2:42 PM, Richard Eisenberg wrote:
There are some good ideas here, but I want to throw out another one: put all 
our effort into reducing compile times. There is a loud plea to do this on 
Discourse<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdiscourse.haskell.org%2Ft%2Fcall-for-ideas-forming-a-technical-agenda%2F1901%2F24=04%7C01%7Csimonpj%40microsoft.com%7C9d7043627f5042598e5b08d8d6f648c4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637495701691120329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=1CV0MEVUZpbAbmKAWTIiqLgjft7IbN%2BCSnvB3W3iX%2FU%3D=0>,
 and it would both solve these CI problems and also help everyone else.

This isn't to say to stop exploring the ideas here. But since time is mostly 
fixed, tackling compilation times in general may be the best way out of this. 
Ben's survey of other projects (thanks!) shows that we're way, way behind in 
how long our CI takes to run.

Richard


On Feb 19, 2021, at 7:20 AM, Sebastian Graf 
mailto:sgraf1...@gmail.com>> wrote:

Recompilation avoidance

I think in order to cache more in CI, we first have to invest some time in 
fixing recompilation avoidance in our bootstrapped build system.

I just tested on a hadrian perf ticky build: Adding one line of *comment* in 
the compiler causes

  *   a (pretty slow, yet negligible) rebuild of the stage1 compiler
  *   2 minutes of RTS rebuilding (Why do we have to rebuild the RTS? It 
doesn't depend in any way on the change I made)
  *   apparent full rebuild the libraries
  *   apparent full rebuild of the stage2 compiler
That took 17 minutes, a full build takes ~45minutes. So there definitely is 
some caching going on, but not nearly as much as there could be.
I know there have been great and boring efforts on compiler determinism in the 
past, but either it's not good enough or our build system needs fixing.
I think a good first step to assert would be to make sure that the hash of the 
stage1 compiler executable doesn't change if I only change a comment.
I'm aware there probably is stuff going on, like embedding configure dates in 
interface files and executables, that would need to go, but if possible this 
would be a huge improvement.

On the other hand, we can simply tack on a [skip ci] to the commit message, as 
I did for 
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.haskell.org%2Fghc%2Fghc%2F-%2Fmerge_requests%2F4975=04%7C01%7Csimonpj%40microsoft.com%7C9d7043627f5042598e5b08d8d6f648c4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637495701691130329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=bgT0LeZXjF%2BMklzctvZL6WaVpaddN7%2FSpojcEXGXv7Q%3D=0>.
 Variants like [skip tests] or [frontend] could help to identify which tests to 
run by default.

Lean

I had a chat with a colleague about how they do CI for Lean. Apparently, CI 
turnaround time including tests is generally 25 minutes (~15 minutes for the 
build) for a complete pipeline, testing 6 different OSes and configurations in 
parallel: 
https://github.com/leanprover/lean4/actions/workflows/ci.yml<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fleanprover%2Flean4%2Factions%2Fworkflows%2Fci.yml=04%7C01%7Csimonpj%40microsoft.com%7C9d7043627f5042598e5b08d8d6f648c4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637495701691140326%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik

Re: On CI

2021-02-21 Thread John Ericson
I'm not opposed to some effort going into this, but I would strongly 
opposite putting all our effort there. Incremental CI can cut multiple 
hours to < mere minutes, especially with the test suite being 
embarrassingly parallel. There simply no way optimizations to the 
compiler independent from sharing a cache between CI runs can get 
anywhere close to that return on investment.


(FWIW, I'm also skeptical that the people complaining about GHC 
performance know what's hurting them most. For example, after 
non-incrementality, the next slowest thing is linking, which is...not 
done by GHC! But all that is a separate conversation.)


John

On 2/19/21 2:42 PM, Richard Eisenberg wrote:
There are some good ideas here, but I want to throw out another one: 
put all our effort into reducing compile times. There is a loud plea 
to do this on Discourse 
<https://discourse.haskell.org/t/call-for-ideas-forming-a-technical-agenda/1901/24>, 
and it would both solve these CI problems and also help everyone else.


This isn't to say to stop exploring the ideas here. But since time is 
mostly fixed, tackling compilation times in general may be the best 
way out of this. Ben's survey of other projects (thanks!) shows that 
we're way, way behind in how long our CI takes to run.


Richard

On Feb 19, 2021, at 7:20 AM, Sebastian Graf <mailto:sgraf1...@gmail.com>> wrote:


Recompilation avoidance

I think in order to cache more in CI, we first have to invest some 
time in fixing recompilation avoidance in our bootstrapped build system.


I just tested on a hadrian perf ticky build: Adding one line of 
*comment* in the compiler causes


  * a (pretty slow, yet negligible) rebuild of the stage1 compiler
  * 2 minutes of RTS rebuilding (Why do we have to rebuild the RTS?
It doesn't depend in any way on the change I made)
  * apparent full rebuild the libraries
  * apparent full rebuild of the stage2 compiler

That took 17 minutes, a full build takes ~45minutes. So there 
definitely is some caching going on, but not nearly as much as there 
could be.
I know there have been great and boring efforts on compiler 
determinism in the past, but either it's not good enough or our build 
system needs fixing.
I think a good first step to assert would be to make sure that the 
hash of the stage1 compiler executable doesn't change if I only 
change a comment.
I'm aware there probably is stuff going on, like embedding configure 
dates in interface files and executables, that would need to go, but 
if possible this would be a huge improvement.


On the other hand, we can simply tack on a [skip ci] to the commit 
message, as I did for 
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975 
<https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975>. Variants 
like [skip tests] or [frontend] could help to identify which tests to 
run by default.


Lean

I had a chat with a colleague about how they do CI for Lean. 
Apparently, CI turnaround time including tests is generally 25 
minutes (~15 minutes for the build) for a complete pipeline, testing 
6 different OSes and configurations in parallel: 
https://github.com/leanprover/lean4/actions/workflows/ci.yml 
<https://github.com/leanprover/lean4/actions/workflows/ci.yml>
They utilise ccache to cache the clang-based C++-backend, so that 
they only have to re-run the front- and middle-end. In effect, they 
take advantage of the fact that the "function" clang, in contrast to 
the "function" stage1 compiler, stays the same.
It's hard to achieve that for GHC, where a complete compiler pipeline 
comes as one big, fused "function": An external tool can never be 
certain that a change to Parser.y could not affect the CodeGen phase.


Inspired by Lean, the following is a bit inconcrete and imaginary, 
but maybe we could make it so that compiler phases "sign" parts of 
the interface file with the binary hash of the respective 
subcomponents of the phase?
E.g., if all the object files that influence CodeGen (that will later 
be linked into the stage1 compiler) result in a hash of 0xdeadbeef 
before and after the change to Parser.y, we know we can stop 
recompiling Data.List with the stage1 compiler when we see that the 
IR passed to CodeGen didn't change, because the last compile did 
CodeGen with a stage1 compiler with the same hash 0xdeadbeef. The 
0xdeadbeef hash is a proxy for saying "the function CodeGen stayed 
the same", so we can reuse its cached outputs.
Of course, that is utopic without a tool that does the "taint 
analysis" of which modules in GHC influence CodeGen. Probably just 
including all the transitive dependencies of GHC.CmmToAsm suffices, 
but probably that's too crude already. For another example, a change 
to GHC.Utils.Unique would probably entail a full rebuild of the 
compiler because it basically affects all compiler phases.
There are probably parallels with recompilation avoidance in a 

Re: On CI

2021-02-19 Thread Richard Eisenberg
There are some good ideas here, but I want to throw out another one: put all 
our effort into reducing compile times. There is a loud plea to do this on 
Discourse 
<https://discourse.haskell.org/t/call-for-ideas-forming-a-technical-agenda/1901/24>,
 and it would both solve these CI problems and also help everyone else.

This isn't to say to stop exploring the ideas here. But since time is mostly 
fixed, tackling compilation times in general may be the best way out of this. 
Ben's survey of other projects (thanks!) shows that we're way, way behind in 
how long our CI takes to run.

Richard

> On Feb 19, 2021, at 7:20 AM, Sebastian Graf  wrote:
> 
> Recompilation avoidance
> 
> I think in order to cache more in CI, we first have to invest some time in 
> fixing recompilation avoidance in our bootstrapped build system.
> 
> I just tested on a hadrian perf ticky build: Adding one line of *comment* in 
> the compiler causes
> a (pretty slow, yet negligible) rebuild of the stage1 compiler
> 2 minutes of RTS rebuilding (Why do we have to rebuild the RTS? It doesn't 
> depend in any way on the change I made)
> apparent full rebuild the libraries
> apparent full rebuild of the stage2 compiler
> That took 17 minutes, a full build takes ~45minutes. So there definitely is 
> some caching going on, but not nearly as much as there could be.
> I know there have been great and boring efforts on compiler determinism in 
> the past, but either it's not good enough or our build system needs fixing.
> I think a good first step to assert would be to make sure that the hash of 
> the stage1 compiler executable doesn't change if I only change a comment.
> I'm aware there probably is stuff going on, like embedding configure dates in 
> interface files and executables, that would need to go, but if possible this 
> would be a huge improvement.
> 
> On the other hand, we can simply tack on a [skip ci] to the commit message, 
> as I did for https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975 
> <https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975>. Variants like 
> [skip tests] or [frontend] could help to identify which tests to run by 
> default.
> 
> Lean
> 
> I had a chat with a colleague about how they do CI for Lean. Apparently, CI 
> turnaround time including tests is generally 25 minutes (~15 minutes for the 
> build) for a complete pipeline, testing 6 different OSes and configurations 
> in parallel: https://github.com/leanprover/lean4/actions/workflows/ci.yml 
> <https://github.com/leanprover/lean4/actions/workflows/ci.yml>
> They utilise ccache to cache the clang-based C++-backend, so that they only 
> have to re-run the front- and middle-end. In effect, they take advantage of 
> the fact that the "function" clang, in contrast to the "function" stage1 
> compiler, stays the same.
> It's hard to achieve that for GHC, where a complete compiler pipeline comes 
> as one big, fused "function": An external tool can never be certain that a 
> change to Parser.y could not affect the CodeGen phase.
> 
> Inspired by Lean, the following is a bit inconcrete and imaginary, but maybe 
> we could make it so that compiler phases "sign" parts of the interface file 
> with the binary hash of the respective subcomponents of the phase?
> E.g., if all the object files that influence CodeGen (that will later be 
> linked into the stage1 compiler) result in a hash of 0xdeadbeef before and 
> after the change to Parser.y, we know we can stop recompiling Data.List with 
> the stage1 compiler when we see that the IR passed to CodeGen didn't change, 
> because the last compile did CodeGen with a stage1 compiler with the same 
> hash 0xdeadbeef. The 0xdeadbeef hash is a proxy for saying "the function 
> CodeGen stayed the same", so we can reuse its cached outputs.
> Of course, that is utopic without a tool that does the "taint analysis" of 
> which modules in GHC influence CodeGen. Probably just including all the 
> transitive dependencies of GHC.CmmToAsm suffices, but probably that's too 
> crude already. For another example, a change to GHC.Utils.Unique would 
> probably entail a full rebuild of the compiler because it basically affects 
> all compiler phases.
> There are probably parallels with recompilation avoidance in a language with 
> staged meta-programming.
> 
> Am Fr., 19. Feb. 2021 um 11:42 Uhr schrieb Josef Svenningsson via ghc-devs 
> mailto:ghc-devs@haskell.org>>:
> Doing "optimistic caching" like you suggest sounds very promising. A way to 
> regain more robustness would be as follows.
> If the build fails while building the libraries or the stage2 compiler, this 
> might be a false negative due to the optimistic caching. Th

Re: On CI

2021-02-19 Thread Sebastian Graf
Recompilation avoidance

I think in order to cache more in CI, we first have to invest some time in
fixing recompilation avoidance in our bootstrapped build system.

I just tested on a hadrian perf ticky build: Adding one line of *comment*
in the compiler causes

   - a (pretty slow, yet negligible) rebuild of the stage1 compiler
   - 2 minutes of RTS rebuilding (Why do we have to rebuild the RTS? It
   doesn't depend in any way on the change I made)
   - apparent full rebuild the libraries
   - apparent full rebuild of the stage2 compiler

That took 17 minutes, a full build takes ~45minutes. So there definitely is
some caching going on, but not nearly as much as there could be.
I know there have been great and boring efforts on compiler determinism in
the past, but either it's not good enough or our build system needs fixing.
I think a good first step to assert would be to make sure that the hash of
the stage1 compiler executable doesn't change if I only change a comment.
I'm aware there probably is stuff going on, like embedding configure dates
in interface files and executables, that would need to go, but if possible
this would be a huge improvement.

On the other hand, we can simply tack on a [skip ci] to the commit message,
as I did for https://gitlab.haskell.org/ghc/ghc/-/merge_requests/4975.
Variants like [skip tests] or [frontend] could help to identify which tests
to run by default.

Lean

I had a chat with a colleague about how they do CI for Lean. Apparently, CI
turnaround time including tests is generally 25 minutes (~15 minutes for
the build) for a complete pipeline, testing 6 different OSes and
configurations in parallel:
https://github.com/leanprover/lean4/actions/workflows/ci.yml
They utilise ccache to cache the clang-based C++-backend, so that they only
have to re-run the front- and middle-end. In effect, they take advantage of
the fact that the "function" clang, in contrast to the "function" stage1
compiler, stays the same.
It's hard to achieve that for GHC, where a complete compiler pipeline comes
as one big, fused "function": An external tool can never be certain that a
change to Parser.y could not affect the CodeGen phase.

Inspired by Lean, the following is a bit inconcrete and imaginary, but
maybe we could make it so that compiler phases "sign" parts of the
interface file with the binary hash of the respective subcomponents of the
phase?
E.g., if all the object files that influence CodeGen (that will later be
linked into the stage1 compiler) result in a hash of 0xdeadbeef before and
after the change to Parser.y, we know we can stop recompiling Data.List
with the stage1 compiler when we see that the IR passed to CodeGen didn't
change, because the last compile did CodeGen with a stage1 compiler with
the same hash 0xdeadbeef. The 0xdeadbeef hash is a proxy for saying "the
function CodeGen stayed the same", so we can reuse its cached outputs.
Of course, that is utopic without a tool that does the "taint analysis" of
which modules in GHC influence CodeGen. Probably just including all the
transitive dependencies of GHC.CmmToAsm suffices, but probably that's too
crude already. For another example, a change to GHC.Utils.Unique would
probably entail a full rebuild of the compiler because it basically affects
all compiler phases.
There are probably parallels with recompilation avoidance in a language
with staged meta-programming.

Am Fr., 19. Feb. 2021 um 11:42 Uhr schrieb Josef Svenningsson via ghc-devs <
ghc-devs@haskell.org>:

> Doing "optimistic caching" like you suggest sounds very promising. A way
> to regain more robustness would be as follows.
> If the build fails while building the libraries or the stage2 compiler,
> this might be a false negative due to the optimistic caching. Therefore,
> evict the "optimistic caches" and restart building the libraries. That way
> we can validate that the build failure was a true build failure and not
> just due to the aggressive caching scheme.
>
> Just my 2p
>
> Josef
>
> --
> *From:* ghc-devs  on behalf of Simon Peyton
> Jones via ghc-devs 
> *Sent:* Friday, February 19, 2021 8:57 AM
> *To:* John Ericson ; ghc-devs <
> ghc-devs@haskell.org>
> *Subject:* RE: On CI
>
>
>1. Building and testing happen together. When tests failure
>spuriously, we also have to rebuild GHC in addition to re-running the
>tests. That's pure waste.
>https://gitlab.haskell.org/ghc/ghc/-/issues/13897
>
> <https://nam06.safelinks.protection.outlook.com/?url=https://gitlab.haskell.org/ghc/ghc/-/issues/13897=04%7c01%7csimo...@microsoft.com%7C3d503922473f4cd0543f08d8d48522b2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637493018301253098%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0=%7C3000=FG2fyYCXbacp69Q8

Re: On CI

2021-02-19 Thread Josef Svenningsson via ghc-devs
Doing "optimistic caching" like you suggest sounds very promising. A way to 
regain more robustness would be as follows.
If the build fails while building the libraries or the stage2 compiler, this 
might be a false negative due to the optimistic caching. Therefore, evict the 
"optimistic caches" and restart building the libraries. That way we can 
validate that the build failure was a true build failure and not just due to 
the aggressive caching scheme.

Just my 2p

Josef


From: ghc-devs  on behalf of Simon Peyton Jones 
via ghc-devs 
Sent: Friday, February 19, 2021 8:57 AM
To: John Ericson ; ghc-devs 

Subject: RE: On CI


  1.  Building and testing happen together. When tests failure spuriously, we 
also have to rebuild GHC in addition to re-running the tests. That's pure 
waste. 
https://gitlab.haskell.org/ghc/ghc/-/issues/13897<https://nam06.safelinks.protection.outlook.com/?url=https://gitlab.haskell.org/ghc/ghc/-/issues/13897=04|01|simo...@microsoft.com|3d503922473f4cd0543f08d8d48522b2|72f988bf86f141af91ab2d7cd011db47|1|0|637493018301253098|Unknown|TWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0=|3000=FG2fyYCXbacp69Q8Il6GE0aX+7ZLNkH1u84NA/VMjQc==0>
 tracks this more or less.

I don’t get this.  We have to build GHC before we can test it, don’t we?

2 .  We don't cache between jobs.

This is, I think, the big one.   We endlessly build the exact same binaries.

There is a problem, though.  If we make *any* change in GHC, even a trivial 
refactoring, its binary will change slightly.  So now any caching build system 
will assume that anything built by that GHC must be rebuilt – we can’t use the 
cached version.  That includes all the libraries and the stage2 compiler.  So 
caching can save all the preliminaries (building the initial Cabal, and large 
chunk of stage1, since they are built with the same bootstrap compiler) but 
after that we are dead.

I don’t know any robust way out of this.  That small change in the source code 
of GHC might be trivial refactoring, or it might introduce a critical 
mis-compilation which we really want to see in its build products.

However, for smoke-testing MRs, on every architecture, we could perhaps cut 
corners.  (Leaving Marge to do full diligence.)  For example, we could declare 
that if we have the result of compiling library module X.hs with the stage1 GHC 
in the last full commit in master, then we can re-use that build product rather 
than compiling X.hs with the MR’s slightly modified stage1 GHC.  That *might* 
be wrong; but it’s usually right.

Anyway, there are big wins to be had here.

Simon







From: ghc-devs  On Behalf Of John Ericson
Sent: 19 February 2021 03:19
To: ghc-devs 
Subject: Re: On CI



I am also wary of us to deferring checking whole platforms and what not. I 
think that's just kicking the can down the road, and will result in more 
variance and uncertainty. It might be alright for those authoring PRs, but it 
will make Ben's job keeping the system running even more grueling.

Before getting into these complex trade-offs, I think we should focus on the 
cornerstone issue that CI isn't incremental.

  1.  Building and testing happen together. When tests failure spuriously, we 
also have to rebuild GHC in addition to re-running the tests. That's pure 
waste. 
https://gitlab.haskell.org/ghc/ghc/-/issues/13897<https://nam06.safelinks.protection.outlook.com/?url=https://gitlab.haskell.org/ghc/ghc/-/issues/13897=04|01|simo...@microsoft.com|3d503922473f4cd0543f08d8d48522b2|72f988bf86f141af91ab2d7cd011db47|1|0|637493018301253098|Unknown|TWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0=|3000=FG2fyYCXbacp69Q8Il6GE0aX+7ZLNkH1u84NA/VMjQc==0>
 tracks this more or less.
  2.  We don't cache between jobs. Shake and Make do not enforce dependency 
soundness, nor cache-correctness when the build plan itself changes, and this 
had made this hard/impossible to do safely. Naively this only helps with stage 
1 and not stage 2, but if we have separate stage 1 and --freeze1 stage 2 
builds, both can be incremental. Yes, this is also lossy, but I only see it 
leading to false failures not false acceptances (if we can also test the stage 
1 one), so I consider it safe. MRs that only work with a slow full build 
because ABI can so indicate.

The second, main part is quite hard to tackle, but I strongly believe 
incrementality is what we need most, and what we should remain focused on.

John
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


RE: On CI

2021-02-19 Thread Ben Gamari
Simon Peyton Jones via ghc-devs  writes:

>>   1. Building and testing happen together. When tests failure
>>   spuriously, we also have to rebuild GHC in addition to re-running
>>   the tests. That's pure waste.
>>   https://gitlab.haskell.org/ghc/ghc/-/issues/13897 tracks this more
>>   or less.

> I don't get this.  We have to build GHC before we can test it, don't we?

>> 2 .  We don't cache between jobs.

> This is, I think, the big one.   We endlessly build the exact same binaries.
> There is a problem, though. If we make *any* change in GHC, even a
> trivial refactoring, its binary will change slightly. So now any
> caching build system will assume that anything built by that GHC must
> be rebuilt - we can't use the cached version. That includes all the
> libraries and the stage2 compiler. So caching can save all the
> preliminaries (building the initial Cabal, and large chunk of stage1,
> since they are built with the same bootstrap compiler) but after that
> we are dead.
>
> I don't know any robust way out of this. That small change in the
> source code of GHC might be trivial refactoring, or it might introduce
> a critical mis-compilation which we really want to see in its build
> products.
>
> However, for smoke-testing MRs, on every architecture, we could
> perhaps cut corners. (Leaving Marge to do full diligence.) For
> example, we could declare that if we have the result of compiling
> library module X.hs with the stage1 GHC in the last full commit in
> master, then we can re-use that build product rather than compiling
> X.hs with the MR's slightly modified stage1 GHC. That *might* be
> wrong; but it's usually right.
>
The question is: what happens if the it *is* wrong?

There are three answers here:

 a. Allowing the build pipeline to pass despite a build/test failure
eliminates most of the benefit of running the job to begin with as
allow-failure jobs tend to be ignored.

 b. Making the pipeline fail leaves the contributor to pick up the pieces of a
failure that they may or may not be responsible for, which sounds
frustrating indeed.

 c. Retry the build, but this time from scratch. This is a tantalizing option
but carries the risk that we end up doing *more* work than we do now
(namely, if all jobs end up running both builds)

The only tenable option here in my opinion is (c). It's ugly, but may be
viable.

Cheers,

- Ben



signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


RE: On CI

2021-02-19 Thread Simon Peyton Jones via ghc-devs
  1.  Building and testing happen together. When tests failure spuriously, we 
also have to rebuild GHC in addition to re-running the tests. That's pure 
waste. 
https://gitlab.haskell.org/ghc/ghc/-/issues/13897<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.haskell.org%2Fghc%2Fghc%2F-%2Fissues%2F13897=04%7C01%7Csimonpj%40microsoft.com%7C3d503922473f4cd0543f08d8d48522b2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637493018301253098%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=FG2fyYCXbacp69Q8Il6GE0aX%2B7ZLNkH1u84NA%2FVMjQc%3D=0>
 tracks this more or less.
I don't get this.  We have to build GHC before we can test it, don't we?
2 .  We don't cache between jobs.
This is, I think, the big one.   We endlessly build the exact same binaries.
There is a problem, though.  If we make *any* change in GHC, even a trivial 
refactoring, its binary will change slightly.  So now any caching build system 
will assume that anything built by that GHC must be rebuilt - we can't use the 
cached version.  That includes all the libraries and the stage2 compiler.  So 
caching can save all the preliminaries (building the initial Cabal, and large 
chunk of stage1, since they are built with the same bootstrap compiler) but 
after that we are dead.
I don't know any robust way out of this.  That small change in the source code 
of GHC might be trivial refactoring, or it might introduce a critical 
mis-compilation which we really want to see in its build products.
However, for smoke-testing MRs, on every architecture, we could perhaps cut 
corners.  (Leaving Marge to do full diligence.)  For example, we could declare 
that if we have the result of compiling library module X.hs with the stage1 GHC 
in the last full commit in master, then we can re-use that build product rather 
than compiling X.hs with the MR's slightly modified stage1 GHC.  That *might* 
be wrong; but it's usually right.
Anyway, there are big wins to be had here.
Simon



From: ghc-devs  On Behalf Of John Ericson
Sent: 19 February 2021 03:19
To: ghc-devs 
Subject: Re: On CI


I am also wary of us to deferring checking whole platforms and what not. I 
think that's just kicking the can down the road, and will result in more 
variance and uncertainty. It might be alright for those authoring PRs, but it 
will make Ben's job keeping the system running even more grueling.

Before getting into these complex trade-offs, I think we should focus on the 
cornerstone issue that CI isn't incremental.

  1.  Building and testing happen together. When tests failure spuriously, we 
also have to rebuild GHC in addition to re-running the tests. That's pure 
waste. 
https://gitlab.haskell.org/ghc/ghc/-/issues/13897<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.haskell.org%2Fghc%2Fghc%2F-%2Fissues%2F13897=04%7C01%7Csimonpj%40microsoft.com%7C3d503922473f4cd0543f08d8d48522b2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637493018301253098%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000=FG2fyYCXbacp69Q8Il6GE0aX%2B7ZLNkH1u84NA%2FVMjQc%3D=0>
 tracks this more or less.
  2.  We don't cache between jobs. Shake and Make do not enforce dependency 
soundness, nor cache-correctness when the build plan itself changes, and this 
had made this hard/impossible to do safely. Naively this only helps with stage 
1 and not stage 2, but if we have separate stage 1 and --freeze1 stage 2 
builds, both can be incremental. Yes, this is also lossy, but I only see it 
leading to false failures not false acceptances (if we can also test the stage 
1 one), so I consider it safe. MRs that only work with a slow full build 
because ABI can so indicate.
The second, main part is quite hard to tackle, but I strongly believe 
incrementality is what we need most, and what we should remain focused on.

John
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-02-18 Thread John Ericson
I am also wary of us to deferring checking whole platforms and what not. 
I think that's just kicking the can down the road, and will result in 
more variance and uncertainty. It might be alright for those authoring 
PRs, but it will make Ben's job keeping the system running even more 
grueling.


Before getting into these complex trade-offs, I think we should focus on 
the cornerstone issue that CI isn't incremental.


1. Building and testing happen together. When tests failure spuriously,
   we also have to rebuild GHC in addition to re-running the tests.
   That's pure waste. https://gitlab.haskell.org/ghc/ghc/-/issues/13897
   tracks this more or less.
2. We don't cache between jobs. Shake and Make do not enforce
   dependency soundness, nor cache-correctness when the build plan
   itself changes, and this had made this hard/impossible to do safely.
   Naively this only helps with stage 1 and not stage 2, but if we have
   separate stage 1 and --freeze1 stage 2 builds, both can be
   incremental. Yes, this is also lossy, but I only see it leading to
   false failures not false acceptances (if we can also test the stage
   1 one), so I consider it safe. MRs that only work with a slow full
   build because ABI can so indicate.

The second, main part is quite hard to tackle, but I strongly believe 
incrementality is what we need most, and what we should remain focused on.


John

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-02-18 Thread Ben Gamari
Moritz Angermann  writes:

> At this point I believe we have ample Linux build capacity. Darwin looks
> pretty good as well the ~4 M1s we have should in principle also be able to
> build x86_64-darwin at acceptable speeds. Although on Big Sur only.
>
> The aarch64-Linux story is a bit constraint by powerful and fast CI
> machines but probabaly bearable for the time being. I doubt anyone really
> looks at those jobs anyway as they are permitted to fail.

For the record, I look at this once in a while to make sure that they
haven't broken (and usually pick off one or two failures in the
process).

> If aarch64 would become a bottle neck, I’d be inclined to just disable
> them. With the NCG soon this will likely become much more bearable as
> wel, even though we might want to run the nightly llvm builds.
>
> To be frank, I don’t see 9.2 happening in two weeks with the current CI.
>
I'm not sure what you mean. Is this in reference to your own 9.2-slated
work or the release as a whole?

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: On CI

2021-02-18 Thread Ben Gamari
Apologies for the latency here. This thread has required a fair amount of
reflection.

Sebastian Graf  writes:

> Hi Moritz,
>
> I, too, had my gripes with CI turnaround times in the past. Here's a
> somewhat radical proposal:
>
>- Run "full-build" stage builds only on Marge MRs. Then we can assign to
>Marge much earlier, but probably have to do a bit more of (manual)
>bisecting of spoiled Marge batches.
>   - I hope this gets rid of a bit of the friction of small MRs. I
>   recently caught myself wanting to do a bunch of small, independent, but
>   related changes as part of the same MR, simply because it's such a 
> hassle
>   to post them in individual MRs right now and also because it
>   steals so much CI capacity.
>
>- Regular MRs should still have the ability to easily run individual
>builds of what is now the "full-build" stage, similar to how we can run
>optional "hackage" builds today. This is probably useful to pin down the
>reason for a spoiled Marge batch.


I am torn here. For most of my non-trivial patches I personally don't
mind long turnarounds: I walk away and return a day later to see whether
anything failed. Spurious failures due to fragile tests make this a bit
tiresome, but this is a problem that we are gradually solving (by fixing
bugs and marking tests as fragile).

However, I agree that small MRs are currently rather painful. On the
other hand, diagnosing failed Marge batches is *also* rather tiresome. I
am worried that by deferring full validation of MRs we will only
exacerbate this problem. Furthermore, I worry that by deferring full
validation we run the risk of rather *increasing* the MR turnaround
time, since there are entire classes of issues that wouldn't be caught
until the MR made it to Marge.

Ultimately it's unclear to me whether this proposal would help or hurt.
Nevertheless, I am willing to try it. However, if we go this route we
should consider what can be done to reduce the incidence of failed Marge
batches.

One problem that I'm particularly worried about is that of tests with
OS-dependent expected output (e.g. `$test_name.stdout-mingw32). I find
that people (understandably) forget to update these when updating test
output. I suspect that this will be a frequent source of failed Marge
batches if we defer full validation. I can see a few ways that would
mitigate this:

 * eliminate platform-dependent output files
 * introduce a linter that fails if it sees a test with
   platform-dependent output that doesn't touch all output files
 * always run the full-build stage on MRs that touch tests with
   platform-dependent output files

Regardless of whether we implement Sebastian's proposal, one smaller
measure we could implement to help the problem of small MRs is to
introduce some sort of mechanism to mark MRs as "trivial" (e.g. a label
or a commit/MR description keyword), which results in the `full-build`
being skipped for that MR. Perhaps this would be helpful?


> Another frustrating aspect is that if you want to merge an n-sized chain of
> dependent changes individually, you have to
>
>- Open an MR for each change (initially the last change will be
>comprised of n commits)
>- Review first change, turn pipeline green   (A)
>- Assign to Marge, wait for batch to be merged   (B)
>- Review second change, turn pipeline green
>- Assign to Marge, wait for batch to be merged
>- ... and so on ...
>
> Note that this (A) incurs many context switches for the dev and the latency of
> *at least* one run of CI.
> And then (B) incurs the latency of *at least* one full-build, if you're
> lucky and the batch succeeds. I've recently seen batches that were
> resubmitted by Marge at least 5 times due to spurious CI failures and
> timeouts. I think this is a huge factor for latency.
>
> Although after (A), I should just pop the the patch off my mental stack,
> that isn't particularly true, because Marge keeps on reminding me when a
> stack fails or succeeds, both of which require at least some attention from
> me: Failed 2 times => Make sure it was spurious, Succeeds => Rebase next
> change.
>
> Maybe we can also learn from other projects like Rust, GCC or clang, which
> I haven't had a look at yet.
>
I did a bit of digging on this.

 * Rust: It appears that Rust's CI scheme is somewhat similar to what
   you proposed above. They do relatively minimal validation of MRs
   (e.g. https://github.com/rust-lang/rust/runs/1905017693),
   with a full-validation for merges
   (e.g. https://github.com/rust-lang-ci/rust/runs/1925049948). The latter
   usually takes between 3 and 4 hours, with some jobs taking 5 hours.

 * GCC: As far as I can tell, gcc doesn't actually have any (functional)
   continuous integration. Discussions with contr

Re: On CI

2021-02-18 Thread Moritz Angermann
I'm glad to report that my math was off. But it was off only because I
assumed that we'd successfully build all
windows configurations, which we of course don't. Thus some builds fail
faster.

Sylvain also provided a windows machine temporarily, until it expired.
This led to a slew of new windows wibbles.
The CI script Ben wrote, and generously used to help set up the new
builder, seems to assume an older Git install,
and thus a path was broken which thankfully to gitlab led to the brilliant
error of just stalling.
Next up, because we use msys2's pacman to provision the windows builders,
and pacman essentially gives us
symbols for packages to install, we ended up getting a newer autoconf onto
the new builder (and I assume this
will happen with any other builders we add as well). This new autoconf
(which I've also ran into on the M1s) doesn't
like our configure.ac/aclocal.m4 anymore and barfs; I wasn't able to figure
out how to force pacman to install an
older version and *not* give it some odd version suffix (which prevents it
from working as a drop in replacement).

In any case we *must* update our autoconf files. So I guess the time is now.


On Wed, Feb 17, 2021 at 6:58 PM Moritz Angermann 
wrote:

> At this point I believe we have ample Linux build capacity. Darwin looks
> pretty good as well the ~4 M1s we have should in principle also be able to
> build x86_64-darwin at acceptable speeds. Although on Big Sur only.
>
> The aarch64-Linux story is a bit constraint by powerful and fast CI
> machines but probabaly bearable for the time being. I doubt anyone really
> looks at those jobs anyway as they are permitted to fail. If aarch64 would
> become a bottle neck, I’d be inclined to just disable them. With the NCG
> soon this will likely become much more bearable as wel, even though we
> might want to run the nightly llvm builds.
>
> To be frank, I don’t see 9.2 happening in two weeks with the current CI.
>
> If we subtract aarch64-linux and windows builds we could probably do a
> full run in less than three hours maybe even less. And that is mostly
> because we have a serialized pipeline. I have discussed some ideas with Ben
> on prioritizing the first few stages by the faster ci machines to
> effectively fail fast and provide feedback.
>
> But yes. Working on ghc right now is quite painful due to long and
> unpredictable CI times.
>
> Cheers,
>  Moritz
>
> On Wed, 17 Feb 2021 at 6:31 PM, Sebastian Graf 
> wrote:
>
>> Hi Moritz,
>>
>> I, too, had my gripes with CI turnaround times in the past. Here's a
>> somewhat radical proposal:
>>
>>- Run "full-build" stage builds only on Marge MRs. Then we can assign
>>to Marge much earlier, but probably have to do a bit more of (manual)
>>bisecting of spoiled Marge batches.
>>   - I hope this gets rid of a bit of the friction of small MRs. I
>>   recently caught myself wanting to do a bunch of small, independent, but
>>   related changes as part of the same MR, simply because it's such a 
>> hassle
>>   to post them in individual MRs right now and also because it steals so 
>> much
>>   CI capacity.
>>- Regular MRs should still have the ability to easily run individual
>>builds of what is now the "full-build" stage, similar to how we can run
>>optional "hackage" builds today. This is probably useful to pin down the
>>reason for a spoiled Marge batch.
>>- The CI capacity we free up can probably be used to run a perf build
>>(such as the fedora release build) on the "build" stage (the one where we
>>currently run stack-hadrian-build and the validate-deb9-hadrian build), in
>>parallel.
>>- If we decide against the latter, a micro-optimisation could be to
>>cache the build artifacts of the "lint-base" build and continue the build
>>in the validate-deb9-hadrian build of the "build" stage.
>>
>> The usefulness of this approach depends on how many MRs cause metric
>> changes on different architectures.
>>
>> Another frustrating aspect is that if you want to merge an n-sized chain
>> of dependent changes individually, you have to
>>
>>- Open an MR for each change (initially the last change will be
>>comprised of n commits)
>>- Review first change, turn pipeline green   (A)
>>- Assign to Marge, wait for batch to be merged   (B)
>>- Review second change, turn pipeline green
>>- Assign to Marge, wait for batch to be merged
>>- ... and so on ...
>>
>> Note that (A) incurs many context switches for the dev and the latency of
>> *at least* one run of CI.
>> And then (B) in

Re: On CI

2021-02-17 Thread Moritz Angermann
At this point I believe we have ample Linux build capacity. Darwin looks
pretty good as well the ~4 M1s we have should in principle also be able to
build x86_64-darwin at acceptable speeds. Although on Big Sur only.

The aarch64-Linux story is a bit constraint by powerful and fast CI
machines but probabaly bearable for the time being. I doubt anyone really
looks at those jobs anyway as they are permitted to fail. If aarch64 would
become a bottle neck, I’d be inclined to just disable them. With the NCG
soon this will likely become much more bearable as wel, even though we
might want to run the nightly llvm builds.

To be frank, I don’t see 9.2 happening in two weeks with the current CI.

If we subtract aarch64-linux and windows builds we could probably do a full
run in less than three hours maybe even less. And that is mostly because we
have a serialized pipeline. I have discussed some ideas with Ben on
prioritizing the first few stages by the faster ci machines to effectively
fail fast and provide feedback.

But yes. Working on ghc right now is quite painful due to long and
unpredictable CI times.

Cheers,
 Moritz

On Wed, 17 Feb 2021 at 6:31 PM, Sebastian Graf  wrote:

> Hi Moritz,
>
> I, too, had my gripes with CI turnaround times in the past. Here's a
> somewhat radical proposal:
>
>- Run "full-build" stage builds only on Marge MRs. Then we can assign
>to Marge much earlier, but probably have to do a bit more of (manual)
>bisecting of spoiled Marge batches.
>   - I hope this gets rid of a bit of the friction of small MRs. I
>   recently caught myself wanting to do a bunch of small, independent, but
>   related changes as part of the same MR, simply because it's such a 
> hassle
>   to post them in individual MRs right now and also because it steals so 
> much
>   CI capacity.
>- Regular MRs should still have the ability to easily run individual
>builds of what is now the "full-build" stage, similar to how we can run
>optional "hackage" builds today. This is probably useful to pin down the
>reason for a spoiled Marge batch.
>- The CI capacity we free up can probably be used to run a perf build
>(such as the fedora release build) on the "build" stage (the one where we
>currently run stack-hadrian-build and the validate-deb9-hadrian build), in
>parallel.
>- If we decide against the latter, a micro-optimisation could be to
>cache the build artifacts of the "lint-base" build and continue the build
>in the validate-deb9-hadrian build of the "build" stage.
>
> The usefulness of this approach depends on how many MRs cause metric
> changes on different architectures.
>
> Another frustrating aspect is that if you want to merge an n-sized chain
> of dependent changes individually, you have to
>
>- Open an MR for each change (initially the last change will be
>comprised of n commits)
>- Review first change, turn pipeline green   (A)
>- Assign to Marge, wait for batch to be merged   (B)
>- Review second change, turn pipeline green
>    - Assign to Marge, wait for batch to be merged
>- ... and so on ...
>
> Note that (A) incurs many context switches for the dev and the latency of
> *at least* one run of CI.
> And then (B) incurs the latency of *at least* one full-build, if you're
> lucky and the batch succeeds. I've recently seen batches that were
> resubmitted by Marge at least 5 times due to spurious CI failures and
> timeouts. I think this is a huge factor for latency.
>
> Although after (A), I should just pop the the patch off my mental stack,
> that isn't particularly true, because Marge keeps on reminding me when a
> stack fails or succeeds, both of which require at least some attention from
> me: Failed 2 times => Make sure it was spurious, Succeeds => Rebase next
> change.
>
> Maybe we can also learn from other projects like Rust, GCC or clang, which
> I haven't had a look at yet.
>
> Cheers,
> Sebastian
>
> Am Mi., 17. Feb. 2021 um 09:11 Uhr schrieb Moritz Angermann <
> moritz.angerm...@gmail.com>:
>
>> Friends,
>>
>> I've been looking at CI recently again, as I was facing CI turnaround
>> times of 9-12hs; and this just keeps dragging out and making progress hard.
>>
>> The pending pipeline currently has 2 darwin, and 15 windows builds
>> waiting. Windows builds on average take ~220minutes. We have five builders,
>> so we can expect this queue to be done in ~660 minutes assuming perfect
>> scheduling and good performance. That is 11hs! The next windows build can
>> be started in 11hs. Please check my math and tell me I'm wrong!
>>
>> If you submit a MR today, with some luck, you'l

Re: On CI

2021-02-17 Thread Sebastian Graf
Hi Moritz,

I, too, had my gripes with CI turnaround times in the past. Here's a
somewhat radical proposal:

   - Run "full-build" stage builds only on Marge MRs. Then we can assign to
   Marge much earlier, but probably have to do a bit more of (manual)
   bisecting of spoiled Marge batches.
  - I hope this gets rid of a bit of the friction of small MRs. I
  recently caught myself wanting to do a bunch of small, independent, but
  related changes as part of the same MR, simply because it's such a hassle
  to post them in individual MRs right now and also because it
steals so much
  CI capacity.
   - Regular MRs should still have the ability to easily run individual
   builds of what is now the "full-build" stage, similar to how we can run
   optional "hackage" builds today. This is probably useful to pin down the
   reason for a spoiled Marge batch.
   - The CI capacity we free up can probably be used to run a perf build
   (such as the fedora release build) on the "build" stage (the one where we
   currently run stack-hadrian-build and the validate-deb9-hadrian build), in
   parallel.
   - If we decide against the latter, a micro-optimisation could be to
   cache the build artifacts of the "lint-base" build and continue the build
   in the validate-deb9-hadrian build of the "build" stage.

The usefulness of this approach depends on how many MRs cause metric
changes on different architectures.

Another frustrating aspect is that if you want to merge an n-sized chain of
dependent changes individually, you have to

   - Open an MR for each change (initially the last change will be
   comprised of n commits)
   - Review first change, turn pipeline green   (A)
   - Assign to Marge, wait for batch to be merged   (B)
   - Review second change, turn pipeline green
   - Assign to Marge, wait for batch to be merged
   - ... and so on ...

Note that (A) incurs many context switches for the dev and the latency of
*at least* one run of CI.
And then (B) incurs the latency of *at least* one full-build, if you're
lucky and the batch succeeds. I've recently seen batches that were
resubmitted by Marge at least 5 times due to spurious CI failures and
timeouts. I think this is a huge factor for latency.

Although after (A), I should just pop the the patch off my mental stack,
that isn't particularly true, because Marge keeps on reminding me when a
stack fails or succeeds, both of which require at least some attention from
me: Failed 2 times => Make sure it was spurious, Succeeds => Rebase next
change.

Maybe we can also learn from other projects like Rust, GCC or clang, which
I haven't had a look at yet.

Cheers,
Sebastian

Am Mi., 17. Feb. 2021 um 09:11 Uhr schrieb Moritz Angermann <
moritz.angerm...@gmail.com>:

> Friends,
>
> I've been looking at CI recently again, as I was facing CI turnaround
> times of 9-12hs; and this just keeps dragging out and making progress hard.
>
> The pending pipeline currently has 2 darwin, and 15 windows builds
> waiting. Windows builds on average take ~220minutes. We have five builders,
> so we can expect this queue to be done in ~660 minutes assuming perfect
> scheduling and good performance. That is 11hs! The next windows build can
> be started in 11hs. Please check my math and tell me I'm wrong!
>
> If you submit a MR today, with some luck, you'll be able to know if it
> will be mergeable some time tomorrow. At which point you can assign it to
> marge, and marge, if you are lucky and the set of patches she tries to
> merge together is mergeable, will merge you work into master probably some
> time on Friday. If a job fails, well you have to start over again.
>
> What are our options here? Ben has been pretty clear about not wanting a
> broken commit for windows to end up in the tree, and I'm there with him.
>
> Cheers,
>  Moritz
> ___
> ghc-devs mailing list
> ghc-devs@haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


On CI

2021-02-17 Thread Moritz Angermann
Friends,

I've been looking at CI recently again, as I was facing CI turnaround times
of 9-12hs; and this just keeps dragging out and making progress hard.

The pending pipeline currently has 2 darwin, and 15 windows builds waiting.
Windows builds on average take ~220minutes. We have five builders, so we
can expect this queue to be done in ~660 minutes assuming perfect
scheduling and good performance. That is 11hs! The next windows build can
be started in 11hs. Please check my math and tell me I'm wrong!

If you submit a MR today, with some luck, you'll be able to know if it will
be mergeable some time tomorrow. At which point you can assign it to marge,
and marge, if you are lucky and the set of patches she tries to merge
together is mergeable, will merge you work into master probably some time
on Friday. If a job fails, well you have to start over again.

What are our options here? Ben has been pretty clear about not wanting a
broken commit for windows to end up in the tree, and I'm there with him.

Cheers,
 Moritz
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Reduction in Windows CI capacity

2021-02-08 Thread Ben Gamari
tl;dr. GHC's CI capacity will be a reduced due to a loss of sponsorship,
   particularly in Windows runner capacity. Help wanted in finding
   additional capacity.


Hi all,

For many years Google X has generously donated Google Compute Engine
resources to GHC's CI infrastructure. We all owe a debt of gratitude to
Google X for providing us with what has undoubtedly amounted to tens of
thousands of dollars of computational capacity over the years. I would
especially like to thank Greg Steuck, whose advocacy made this donation
possible.

Of course, organizational priorities understandably change and Google X
will be unable to continue their sponsorship in the future. This puts us
in a bit of a tricky situation; Google Compute Engine is currently the
home of nearly 100 cores worth of CI capacity, including:

  * roughly 20% of our x86-64/Linux capacity
  * our only x86-64/FreeBSD runner
  * all five of our x86-64 Windows runners

While the Linux runners are fairly easy to replace, the Windows capacity
is a bit harder since Windows cloud capacity is quite expensive (IIRC
nearly half of the cost of our Windows GCE instances is put towards the
license).

In the short term I can cover for some of this lost capacity by bringing
up a Windows runner using our generous donation from Packet [1].
However, I am extrmely wary of outspending our welcome on Packet's
infrastructure and therefore we will need to accept a small reduction in
capacity for a bit while we work out a more sustainable path forward. We
will have to see how things go but it may be necessary to disable the
Windows jobs on (non-Marge) merge request validation pipelines.

I am looking into various options for again reaching our previous
capacity, but this is an area where you might be able to help:

 * Familiarity with Windows licensing. Unfortunately the details of Windows
   licensing for virtualization purposes are a bit tricky. I suspect
   that the cheapest way forward is a single Windows Server license on a
   large machine but if you are familiar with Windows licensing in this
   setting, please do be in touch.

 * Providing Windows licenses. If you know of an organization
   that may be able to donate Windows licenses either in-kind or via
   financial support, please do be in touch.

 * Providing Windows cloud instances. If you know of an organization
   that may be able to donate Windows cloud instances, do holler.

As always, we welcome any hardware or cloud instance contributions. Do
be in touch if you may be in an position to help out.

Cheers,

- Ben


[1] https://www.packet.com/


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


RE: Allowing Windows CI to fail

2020-02-03 Thread Ben Gamari
Simon Peyton Jones via ghc-devs  writes:

> Ben
>
> This sounds like a good decision to me, thanks.
>
> Is there a possibility to have a slow CI-on-windows job (not part of
> the "this must pass before merging" step), which will slowly, but
> reliably, fail if the Windows build fails. E.g. does it help to make
> the build be 100% sequential?
>
Sadly that won't fix the underlying problem.

> Or is there currently no way to build GHC at all on Windows in a way
> that won't fail? (That would be surprising to me. Until relatively
> recently I was *only* building on Windows.)
>
There is no way to build GHC that won't have a chance of failing. Indeed
Phyx and I also find it quite surprising how the probability of failure
seems to be higher now than in the past. However, we also both agree
that the status quo, when it works, works only accidentally (if the
win32 API documentation is to be believed).

What is especially intriguing is the fact that mingw32 gnu make should
also be affected by the same `exec` issue that we are struggling with,
does none of the job object headstands that we are doing, and yet
*appears* to be quite reliable. Tamar had a hypothesis for why this
might be that he will test when he has time.

Cheers,

- Ben



signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


RE: Allowing Windows CI to fail

2020-02-03 Thread Simon Peyton Jones via ghc-devs
Ben

This sounds like a good decision to me, thanks.

Is there a possibility to have a slow CI-on-windows job (not part of the "this 
must pass before merging" step), which will slowly, but reliably, fail if the 
Windows build fails. E.g. does it help to make the build be 100% sequential?

Or is there currently no way to build GHC at all on Windows in a way that won't 
fail?  (That would be surprising to me.  Until relatively recently I was *only* 
building on Windows.)

Simon

| -Original Message-
| From: ghc-devs  On Behalf Of Ben Gamari
| Sent: 03 February 2020 16:03
| To: GHC developers 
| Subject: Allowing Windows CI to fail
| 
| Hi everyone,
| 
| After multiple weeks of effort struggling to get Windows CI into a stable
| condition I'm sorry to say that we're going to need to revert to allowing
| it to fail for a bit longer. The status quo is essentially holding up the
| entire merge queue and we still seem quite far from resolving the issues.
| 
| I have summarised the current state-of-play in #1. In short, the gcc
| toolchain likely can't be used reliably on Windows due to its ubiquitous
| use of `exec`, which cannot be reliably implemented on Windows.
| 
| Switching to LLVM as our native toolchain was my (initially promising)
| last-ditch attempt at avoiding this issue but sadly this looks to be a long
| road. My current attempt is stuck on an inscrutable loader error.
| 
| For the short-term, I am afraid I have run out of time for this effort.
| My current plan is to merge what I can from my wip/windows-ci branch but
| again enable the Windows CI jobs' allow_failure flag so that its unreliable
| nature doesn't hold up otherwise-passing CI jobs.
| 
| While it's unfortunately that we still lack reliable CI on Windows, I think
| the effords of the last few weeks were quite worthwhile. We now
| have:
| 
|  * A much better understanding of the issues affecting us on Windows
|  * Significantly better documentation and automation for producing our
|mingw toolchain artifacts
|  * better scripting for setting up Windows CI runners
|  * fixed several bugs in the ghc-jailbreak library used to work
|around the Windows MAX_PATH limitation
| 
| Many thanks to Tamar Christina for his many hours of patient help.
| Without him, GHC's Windows support would be in significantly worse shape
| than it is.
| 
| Users of GHC should note that the CI issues we are struggling with *do
| not* affect compiled code. These bugs manifest only as (rare) failed
| compilations (particularly when building GHC itself); however, once
| compilation succeeds the program that results is correct and reliable.
| 
| Cheers,
| 
| - Ben
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Allowing Windows CI to fail

2020-02-03 Thread Ben Gamari
Hi everyone,

After multiple weeks of effort struggling to get Windows CI into a stable
condition I'm sorry to say that we're going to need to revert to
allowing it to fail for a bit longer. The status quo is essentially
holding up the entire merge queue and we still seem quite far from
resolving the issues.

I have summarised the current state-of-play in #1. In short, the
gcc toolchain likely can't be used reliably on Windows due to its
ubiquitous use of `exec`, which cannot be reliably implemented on
Windows.

Switching to LLVM as our native toolchain was my (initially promising)
last-ditch attempt at avoiding this issue but sadly this looks to be a
long road. My current attempt is stuck on an inscrutable loader error.

For the short-term, I am afraid I have run out of time for this effort.
My current plan is to merge what I can from my wip/windows-ci branch but
again enable the Windows CI jobs' allow_failure flag so that its
unreliable nature doesn't hold up otherwise-passing CI jobs.

While it's unfortunately that we still lack reliable CI on Windows,
I think the effords of the last few weeks were quite worthwhile. We now
have:

 * A much better understanding of the issues affecting us on Windows
 * Significantly better documentation and automation for producing our
   mingw toolchain artifacts
 * better scripting for setting up Windows CI runners
 * fixed several bugs in the ghc-jailbreak library used to work
   around the Windows MAX_PATH limitation

Many thanks to Tamar Christina for his many hours of patient help.
Without him, GHC's Windows support would be in significantly worse shape
than it is.

Users of GHC should note that the CI issues we are struggling with *do
not* affect compiled code. These bugs manifest only as (rare) failed
compilations (particularly when building GHC itself); however, once
compilation succeeds the program that results is correct and reliable.

Cheers,

- Ben


signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


CI fails in git submodule update --init --recursive for 386-linux-deb9

2019-10-13 Thread Roland Senn
Hi 

After pushing some commits, while running a CI pipeline, I persistently
get the following error for the validate-i386-linux-deb9 step:

$ git submodule update --init --recursive
fatal: Unable to create
'/builds/RolandSenn/ghc/.git/modules/libraries/Cabal/index.lock': File
exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
Unable to checkout '63331c95ed15cc7e3d83850d308dc3a86a8c3c76' in
submodule path 'libraries/Cabal'

See https://gitlab.haskell.org/RolandSenn/ghc/pipelines/11311 .

How can I fix this? I have no access to this 386 linux machine, so I'm
unable to delete the file.

Roland
  
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Is your MR failing CI test T9630 or haddock.base?

2019-05-27 Thread David Eichmann

Try rebasing!

Due to some unfortunate circumstances the performance tests (T9630 and 
haddock.base) became fragile. This should be fixed, but you need to 
rebase off of the latest master (at at least 
c931f2561207aa06f1750827afbb68fbee241c6f) for the tests to pass.


Happy Hacking,

David Eichmann

--
David Eichmann, Haskell Consultant
Well-Typed LLP, http://www.well-typed.com

Registered in England & Wales, OC335890
118 Wymering Mansions, Wymering Road, London W9 2NF, England

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI on forked projects: Darwin woes

2019-05-15 Thread Carter Schonwald
Yeah.  That’s my current theory.  It doesn’t help that the queue length
isn’t visible

On Mon, May 13, 2019 at 8:43 AM Ben Gamari  wrote:

> Carter Schonwald  writes:
>
> > Cool.  I recommend irc and devs list plus urls / copies of error
> messages.
> >
> > Hard to debug timeout if we don’t have the literal url or error messages
> shared !
> >
> For what it's worth I suspect these timeouts are simply due to the fact
> that we are somewhat lacking in Darwin builder capacity. There are
> rarely fewer than five builds queued to run on our two Darwin machines
> and this number can sometimes spike to much higher than the machines can
> run in the 10-hour build timeout.
>
> Cheers,
>
> - Ben
>
>
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI on forked projects: Darwin woes

2019-05-13 Thread Ben Gamari
Carter Schonwald  writes:

> Cool.  I recommend irc and devs list plus urls / copies of error messages.
>
> Hard to debug timeout if we don’t have the literal url or error messages 
> shared !
>
For what it's worth I suspect these timeouts are simply due to the fact
that we are somewhat lacking in Darwin builder capacity. There are
rarely fewer than five builds queued to run on our two Darwin machines
and this number can sometimes spike to much higher than the machines can
run in the 10-hour build timeout.

Cheers,

- Ben



signature.asc
Description: PGP signature
___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


Re: CI on forked projects: Darwin woes

2019-05-12 Thread Carter Schonwald
Cool.  I recommend irc and devs list plus urls / copies of error messages.

Hard to debug timeout if we don’t have the literal url or error messages shared 
!

-Carter


From: Kevin Buhr 
Sent: Sunday, May 12, 2019 11:01 AM
To: Carter Schonwald
Cc: Iavor Diatchki
Subject: Re: CI on forked projects: Darwin woes

Thanks!  I'll send a note if it starts happening again.


On 5/12/19 7:23 AM, Carter Schonwald wrote:
>
[ . . . ]
> Next time you hit a failure could you share with the devs list and or
> #ghc irc ?

--
Kevin Buhr 

___
ghc-devs mailing list
ghc-devs@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs


  1   2   >