Re: [ovs-dev] [PATCH ovn v2 0/3] Stabilize CI by pinning container, runner and Python versions.

Ales Musil Thu, 16 Nov 2023 04:44:45 -0800

On Thu, Nov 16, 2023 at 1:40 PM Dumitru Ceara <dce...@redhat.com> wrote:


> On 11/16/23 13:05, Ilya Maximets wrote:
> > On 11/16/23 10:17, Dumitru Ceara wrote:
> >> On 11/16/23 10:02, Dumitru Ceara wrote:
> >>> On 11/16/23 09:51, Dumitru Ceara wrote:
> >>>> On 11/16/23 09:44, Ales Musil wrote:
> >>>>> On Thu, Nov 16, 2023 at 9:34 AM Dumitru Ceara <dce...@redhat.com>
> wrote:
> >>>>>
> >>>>>> On 11/16/23 00:12, Ilya Maximets wrote:
> >>>>>>> On 11/15/23 22:20, Mark Michelson wrote:
> >>>>>>>> On 11/15/23 12:02, Ilya Maximets wrote:
> >>>>>>>>> On 11/15/23 17:00, Dumitru Ceara wrote:
> >>>>>>>>>> All of the above were changed to track the latest available
> releases.
> >>>>>>>>>> Initially that seemed like a good idea but in practice, a new
> release
> >>>>>> would
> >>>>>>>>>> potentially (silently) cause CI runs that used to pass on given
> stable
> >>>>>>>>>> versions to unexpectedly start failing.
> >>>>>>>>>>
> >>>>>>>>>> To address that this series pins all versions we can control
> allowing
> >>>>>> us
> >>>>>>>>>> to use different values for different branches.
> >>>>>>>>>>
> >>>>>>>>>> NOTE: the last commit of this series will look slightly
> different on
> >>>>>>>>>> stable branches, e.g., on branches <= 23.06 we need Python <=
> 3.11 and
> >>>>>>>>>> Fedora 37.
> >>>>>>>>>>
> >>>>>>>>>> The first commit of the series bumps the OVS submodule to
> include the
> >>>>>>>>>> latest fixes to errors reported by recent versions of flake8.
> >>>>>>>>>>
> >>>>>>>>>> Changes in v2:
> >>>>>>>>>> - added first patch to bump OVS submodule
> >>>>>>>>>> - addressed Ales' review comments:
> >>>>>>>>>>    - moved the logic to determine which image to use out of
> ci.sh;
> >>>>>>>>>>      it's now in the workflow itself.
> >>>>>>>>>>    - moved setting of OVN_IMAGE in the same block with all
> related
> >>>>>>>>>>      env vars
> >>>>>>>>>> - added a note for maintainers in the release process
> documentation
> >>>>>>>>>>    to mention that the new workflow will likely have to be
> triggered
> >>>>>>>>>>    manually (at least once) when branching for a new release.
> GitHub
> >>>>>>>>>>    doesn't allow periodic jobs on non-default branches.
> >>>>>>>>>
> >>>>>>>>> IMHO, that sounds horrible to maintain.  Can we just build
> things per
> >>>>>> run?
> >>>>>>>>
> >>>>>>>> What if we set up the "Containers" action to run on branch
> creation?
> >>>>>>>> Would that adequately substitute for the manual run that creates
> the
> >>>>>> image?
> >>>>>>>>
> >>>>>>>> If we only have to run the action once each time a branch is
> created, I
> >>>>>>>> don't think that qualifies as "horrible" to maintain. This could
> be
> >>>>>>>> lumped into our branch creation scripts. However, if it can be
> >>>>>> automated
> >>>>>>>> as I suggested above, that would be better.
> >>>>>>>
> >>>>>>> The job will never be re-run, meaning we'll be using the same
> container
> >>>>>>> for many years.  In case we'll need an update for this container,
> it will
> >>>>>>> need to be done manually for each branch, except for main.
> >>>>>>>
> >>>>>>> Also, in the patch, the Cirrus CI is set to use the container from
> main,
> >>>>>>> so this will need to be adjusted for each branch, will need to
> become
> >>>>>>> part of the release patches (?).
> >>>>>>>
> >>>>>>> Another issue with current implementation of containers is that
> it's
> >>>>>>> a pain for external contributor to change dependencies, because
> they'll
> >>>>>>> have to modify the CI in multiple places in order to build and use
> their
> >>>>>>> own containers.
> >>>>>>> And with these new changes it will become worse.  For example, I
> never
> >>>>>> use
> >>>>>>> 'main' or 'branch-*' branches in my own forks.  And CI depends on
> the
> >>>>>> branch
> >>>>>>> name now, and it will fall back to an unmaintained old-style-named
> image.
> >>>>>>> Might not be easy to figure out why all your builds are failing,
> >>>>>> especially
> >>>>>>> if you're new to the project.
> >>>>>>>
> >>>>>>
> >>>>>> We could try to make it build the container image every time on
> stable
> >>>>>> branches and avoid any manual steps.
> >
> > But the branch name will still be a problem.  Every time people use any
> > branch name other than 'main' for their own development, they'll get an
> > outdated container, unless they change the CI scripts.  Or am I missing
> > something?
> >
>
> Yeah, we'd have to make it build the container image every time on
> non-default branches.  That's a problem because the fedora container
> takes so long.  So we'd have to separate the two types of container
> builds.  Seems like a lot of trouble to go through indeed.
>
> >>>>>>
> >>>>>>>>
> >>>>>>>>> How much value these containers actually have?
> >>>>>>
> >>>>>> I think you mentioned the benefit lower already:
> >>>>>> - we can share the artifact (container image) with Cirrus CI;
> aarch64
> >>>>>> container build takes significantly longer
> >>>>>> - we can share the artifact with downstream CIs
> >>>>>>
> >>>>>>>>
> >>>>>>>> As patch 2 mentions, it allows for us to use different distro
> bases for
> >>>>>>>> each version.
> >>>>>>>
> >>>>>>> This can be done by simply specifying 'container: name' in the job,
> >>>>>>> GHA will fetch and use that container.
> >>>>>>>
> >>>>>>>> But I also think that having the container image cached
> >>>>>>>> allows for quicker test runs since we are not having to
> reconstruct the
> >>>>>>>> environment each time we perform test runs.
> >>>>>>>
> >>>>>>> It takes less than 2 minutes to create a full x86 environment.  We
> spend
> >>>>>>> more time re-building OVS and OVN for every variant of tests, even
> though
> >>>>>>> most of them are using the same compiler flags.
> >>>>>>>
> >>>>>>> Anyways, there is no need to push the container into registry, we
> can
> >>>>>> build
> >>>>>>> it and pass to a next job via artifacts.  I don't think spending
> extra
> >>>>>> 1-2
> >>>>>>> minutes per workflow (not job, workflow) is a significant problem,
> if it
> >>>>>>> allows to just build whatever the current dependencies are and not
> think
> >>>>>>> about what kind of outdated build from which branch is going to be
> used.
> >>>>>>>
> >>>>>>> The only downside of this is that we can't share artifact with
> Cirrus CI
> >>>>>>> or some downstream CIs, but that also doesn't seem to be a
> significant
> >>>>>>> issue.  1-2 minutes per workflow is not that much.  Building the
> aarch64
> >>>>>>> container takes a lot more time, but switching it from fedora to
> ubuntu
> >>>>>>> takes only 10 minutes to build, so might be reasonable to just
> rebuild it
> >>>>>>> on-demand as well.
> >>>>>>>
> >>>>>>
> >>>>>> Ales, why do we need the Cirrus CI image to be Fedora based?
> >>>>>>
> >>>>>
> >>>>> We don't, but it's way easier for me to navigate. I'm not even sure
> how/if
> >>>>> Ubuntu one works for aarch64. We can switch that one, however I
> don't see
> >>>>> any clear benefit in doing so. Also one thing to note, even if it's
> 1-2
> >>>>
> >>>> Well, it takes way less time to build, that might be an argument for
> it.
> >
> > FWIW, Fedora takes an hour to build, Ubuntu only takes ~10 minutes.
> >
> >>>>
> >>>>> minutes (it's probably slightly more) we are losing the ability to
> have
> >>>>> reproducible builds. It's very easy today to take the image and
> reproduce
> >>>>> the same issue IMO.
> >>>>>
> >>>>
> >>>> That's a good point, reproducible builds are important.  Storing the
> >>>> image as GitHub CI artifact would still allow us to download it and
> load
> >>>> it locally but it's way easier if it's just stored in the ghcr.io
> register.
> >
> > How many times did you use this for debugging in the past year?
> > If it's not a day-to-day churn, it might be worth to add an extra
> > step of downloading an artifact.
> >
>
> I'm not saying it's hard to use the artifact directly, just that it's
> more convenient to get the image from the registry.  I don't use it that
> much, but when I do need it it's way easier for me to do a regular
> "podman/docker run" pointing it to the registry than to figure out how
> to download and load the image from the artifact.  Anyhow, it's a minor
> annoyance and can be automated.
>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> Not a strong opinion, but the branch name thing might be the most
> >>>>>> annoying,
> >>>>>>> if I understand correctly.
> >>>>>>>
> >>>>>>
> >>>>>> On the (very) short term however, my problem is that I can't merge
> any
> >>>>>> bug fix in OVN because I can't run any CI at the moment.  Mostly
> that's
> >>>>>> because of different versions of OVN (and OVS via submodule) python
> code
> >>>>>> that triggers different flake8 warnings with different
> >>>>>> flake8/python/distro versions.
> >>>>>>
> >>>>>> A quick fix for that for now could be to bump OVS submodule again to
> >>>>>> fdbf0bb2aed5 ("flake8: Fix E721 check failures.") (which is what
> patch
> >>>>>> 1/3 is doing) but on _all_ stable branches.  That's not ideal
> because
> >>>>>> that commit (and its backports) are not part of any release yet.
> >>>>>>
> >>>>>> I wanted to avoid doing that but it seems we don't have an easy
> >>>>>> alternative so I'm starting to think we should just go for the OVS
> >>>>>> submodule bump and try to find a better solution for the container
> >>>>>> workflow on the longer term.
> >>>>>>
> >>>
> >>> Just in case I wasn't clear, what I meant here is applying patch 1/3 to
> >>> main and its corresponding stable branch backports bumping the
> submodule
> >>> to the tips of OVS branch-3.2, branch-3.1 and branch-3.0.
> >
> > Bumping submodule to the tips of OVS stable branches seems OK to me.
> >
>
> OK.
>
> >>>
> >>
> >> .. and we need to pin Python version to 3.11 on branches <=23.03.
> >
> > In general, patches 1 and 3 of this set are fine (I didn't test).
>
> OK, I'll prepare v3 and backports and post them soon.
>
> > The main problem, IMO, is the branch name dependency that makes
> > day-to-day development harder.
> >
>
> I think it's still useful to be able to pin container versions on stable
> branches and make them re-usable for other types of (downstream) CI.
>
> OTOH, if we have a dockerfile we use for building these CI container
> images, maybe other CI systems (e.g., the RedHat downstream CI) could
> just use the dockerfile to periodically pre-build their own environment.
>
> Ales, what do you think?
>


It's possible, but also very easy for that dockerfile to fall behind.
Currently we test both images. If we switch to Ubuntu for Cirrus CI and
disable automatic builds of Fedora I'm afraid that no one will care and
Fedora will be just unmaintained.


>
> >>
> >>>>>> What do you guys think?
> >>>>>>
> >>>>>
> >>>>> I'm afraid that we will constantly chase those ghosts of different
> >>>>> dependency problems for various branches. And I'm not sure what
> would be
> >>>>> the proper way to handle that. The unfortunate thing is that we don't
> >>>>> control directly OvS so some of those "hacks" might be always
> necessary to
> >>>>> have it working.
> >>>>>
> >>>>
> >>>> Isn't that a different discussion?  Partially because OVN uses
> >>>> "internal" OVS structures directly.
> >>>>
> >>
> >
>
>
Thanks,
Ales

-- 

Ales Musil

Senior Software Engineer - OVN Core

Red Hat EMEA <https://www.redhat.com>

amu...@redhat.com
<https://red.ht/sig>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2 0/3] Stabilize CI by pinning container, runner and Python versions.

Reply via email to