Re: CI caching improvement

Willy Tarreau Tue, 08 Mar 2022 11:12:32 -0800

On Tue, Mar 08, 2022 at 04:40:40PM +0100, Tim Düsterhus wrote:
> > > Please don't. You always want to run a clean build, otherwise you're going
> > > to get super-hard-to-debug failures if some object file is accidentally
> > > taken from the cache instead of a clean build.
> > 
> > What risk are you having in mind for example ? I'm personally thinking
> > that vtest is sufficiently self-contained to represent almost zero risk.
> > We could possibly even pre-populate it and decide to rebuild if it's not
> > able to dump its help page.
> 
> This was a reply to "cache the build of HAProxy", so unrelated to VTest. As
> the HAProxy build is what we want to test, it's important to always perform
> a clean build.


Yes my point was about VTest. However you made me think about a very good
reason for caching haproxy builds as well :-)  Very commonly, some VTest
randomly fails. Timing etc are involved. And at the moment, it's impossible
to restart the tests without rebuilding everything. And it happens to me to
click "restart all jobs" sometimes up to 2-3 times in a row in order to end
up on a valid one. It really takes ages. Just for this it would be really
useful to cache the haproxy builds so that re-running the jobs only runs
vtest.

The way I'm seeing an efficient process would be this:
  1) prepare OS images
  2) retrieve cached dependencies if any, and check for their freshness,
     otherwise build them
  3) retrieve cached haproxy if any and check for its freshness, otherwise
     build it (note: every single push will cause a rebuild). If we can
     include the branch name and/or tag name it's even better.
  4) retrieve cached VTest if any, and check for its freshness,
     otherwise build it
  5) run VTest

This way, a push to a branch will cause a rebuild, but restarting the exact
same test will not. This would finally allow us to double-check for unstable
VTest reports. Right now we re-discover them when stumbling upon an old tab
in the browser that was left there all day.

> Looking at this: https://github.com/haproxy/haproxy/actions/runs/1952139455,
> the fastest build is the gcc no features build with 1:33. Even if we
> subtract the 8 seconds for VTest, then that's more than I'd be willing to
> synchronously wait.

I understand, but that starts to count when you have to re-do all this
just for a single failed vtest timing out on an overloaded VM.

> The slowest are the QUICTLS builds with 6:21, because
> QUICTLS is not currently cached.

Which is an excellent reason for also caching QUICTLS as is already done
for libressl or openssl3, I don't remember (maybe both).

> FWIW: Even 6:21 is super fast compared to other projects. I've recently
> contributed to PHP and CI results appear after 20 minutes or so.

The fact that others spend even longer than us is not an excuse for
us to be as bad :-)

You & Ilya were among those insisting for a CI to improve our quality,
and it works! It works so well that developers are now impatient to see
the result to be sure they didn't break $RANDOM_OS. This morning I merged
David's build fix for kfreebsd and started to work on the backports. When
I finished, it looked like the build was OK but apparently I was still on
a previous result, I don't know, or maybe Cirrus is much more delayed, I
don't know, but in the end I thought everything was good and I pushed the
backport to 2.5. Only later I noticed I was receiving more "failed" mails
than usual, looked at it and there it was, the patch broke freebsd after
I had already pushed the backport to two branches. It's not dramatic, but
this method of working increases the context-switch rate and I often find
myself doing more mistakes or at least less controls when switching all
the time between branches and tests, because once you see an error, you
switch back to the original branch, fix the issue, push again, switch back
to the other branch you were, get notified about another issue (still
unrelated to what you're doing) etc. It's mentally difficult (at least
for me). Being able to shorten the time between a push and a result will
leave me less concentration time on another subject and destroy a bit less
what remains of my brain.

Nothing there is critical at all, the quality is already improved, but it
seems to me that for a little cost we can significantly improve some of
the remaining rough edges.

> In any case I believe that our goal should be that one does not need to
> check the CI, because no bad commits make it into the repository.

That's utopic and contrary to any principles of development because it
contests the sole existence of bugs. What matters is that we limit the
number of issues, we limit their duration, we remain as transparent as
possible on the fixes for these issues, and we limit the impact and
exposure for the undetected ones. The amount of mental efforts needed
to go through hard processes is huge and a first cause of errors. I
don't count the number of times I've done some mistakes when pushing
to a temporary branch to test some code for example, because all this
adds more complexity, more things to think about, which is incompatible
with dealing with 10 open issues at once and working on backports in 3
branches. Fortunately "push -f" is not blocked and it can sometimes be
used as a last resort immediately after discovering a mistake (I guess
I used it once over the last year, it's already once too much but better
than ruining everything).

I think our process works reasonably well and the amount of bad stuff
that is pushed remains low thanks to the mandatory local VTest runs.
And it works so well that when someone merges something that breaks a
less common platform, everyone complains till this is fixed. This is
a good indication that the CI does add value to everyone, particularly
for the combination of platforms and build options.

So the CI is not strictly necessary but it's much appreciated and created
some addiction, and the reward here is that its users want to improve it
to make it even better.

> Unfortunately we still have some flaky tests, that needlessly take up
> attention, because one will need to check the red builds.

It's not dramatic but given the failure rate in the tests, it requires
to frequently restart everything, and this part is annoying because it
derails our attention after having switched to something else.

> If those are gone,
> then I expect the vast majority of the commits to be green so that it only
> catches mistakes in strange configurations that one cannot easily test
> locally.

This is already the case for the vast majority of them. I think that if we
eliminate the random failures on vtest, we're around 95-98% of successful
pushes. The remaining ones are caused by occasional download failures to
install dependencies and variables referenced outside an obscure ifdef
combination that only strikes on one platform.

> Of course if you think that the 8 seconds will improve your workflow, then
> by all means: Commit it. From my POV in this case the cost is larger than
> the benefit. Caching is one of the hard things in computer science :-)

No, caching is hard when done *optimally*. I'm not interested at all in
an optimal cache. That's the same we applied on haproxy's RAM cache. If
in doubt, do not cache, to keep it hassle-free. A cache is very useful
when it's 100% safe. I guess you haven't disabled your CPU's cache for
a very long time! I've known this period 30 years ago where bogus CPU
caches sometimes needed to be disabled for certain applications. Nowadays
if you disable your CPU cache you get something looking like a 386...
That's just the proof that caching done right can be both safe and
extremely useful. Translating this back to haproxy builds, for me it
means: "if at any point there is any doubt that something that might
affect the build result might have changed, better rebuild". And that's
true for all components. With that if we can shrink build time of
successful builds by 15-20% and restarted jobs by 80% that would be
awesome and should not add maintenance burden (and if for any reason
it does we can revert).

Now rest assured that I won't roll over the floor crying for this, but
I think that this is heading in the right direction. And with more CI
time available, we can even hope to test more combinations and squash
more bugs before they reach users. That's another aspect to think about.
I'm really happy to see that the duration between versions .0 and .1
increases from version to version, despite the fact that we add a lot
more code and more complex one. For now the CI scales well with the
code, I'm interested in continuing to help it scale even better.

Regards,
Willy

Re: CI caching improvement

Reply via email to