Re: [ovs-dev] [PATCH ovn v2 0/3] Stabilize CI by pinning container, runner and Python versions.

Dumitru Ceara Thu, 16 Nov 2023 04:40:50 -0800

On 11/16/23 13:05, Ilya Maximets wrote:
> On 11/16/23 10:17, Dumitru Ceara wrote:
>> On 11/16/23 10:02, Dumitru Ceara wrote:
>>> On 11/16/23 09:51, Dumitru Ceara wrote:
>>>> On 11/16/23 09:44, Ales Musil wrote:
>>>>> On Thu, Nov 16, 2023 at 9:34 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>>>
>>>>>> On 11/16/23 00:12, Ilya Maximets wrote:
>>>>>>> On 11/15/23 22:20, Mark Michelson wrote:
>>>>>>>> On 11/15/23 12:02, Ilya Maximets wrote:
>>>>>>>>> On 11/15/23 17:00, Dumitru Ceara wrote:
>>>>>>>>>> All of the above were changed to track the latest available releases.
>>>>>>>>>> Initially that seemed like a good idea but in practice, a new release
>>>>>> would
>>>>>>>>>> potentially (silently) cause CI runs that used to pass on given 
>>>>>>>>>> stable
>>>>>>>>>> versions to unexpectedly start failing.
>>>>>>>>>>
>>>>>>>>>> To address that this series pins all versions we can control allowing
>>>>>> us
>>>>>>>>>> to use different values for different branches.
>>>>>>>>>>
>>>>>>>>>> NOTE: the last commit of this series will look slightly different on
>>>>>>>>>> stable branches, e.g., on branches <= 23.06 we need Python <= 3.11 
>>>>>>>>>> and
>>>>>>>>>> Fedora 37.
>>>>>>>>>>
>>>>>>>>>> The first commit of the series bumps the OVS submodule to include the
>>>>>>>>>> latest fixes to errors reported by recent versions of flake8.
>>>>>>>>>>
>>>>>>>>>> Changes in v2:
>>>>>>>>>> - added first patch to bump OVS submodule
>>>>>>>>>> - addressed Ales' review comments:
>>>>>>>>>>    - moved the logic to determine which image to use out of ci.sh;
>>>>>>>>>>      it's now in the workflow itself.
>>>>>>>>>>    - moved setting of OVN_IMAGE in the same block with all related
>>>>>>>>>>      env vars
>>>>>>>>>> - added a note for maintainers in the release process documentation
>>>>>>>>>>    to mention that the new workflow will likely have to be triggered
>>>>>>>>>>    manually (at least once) when branching for a new release.  GitHub
>>>>>>>>>>    doesn't allow periodic jobs on non-default branches.
>>>>>>>>>
>>>>>>>>> IMHO, that sounds horrible to maintain.  Can we just build things per
>>>>>> run?
>>>>>>>>
>>>>>>>> What if we set up the "Containers" action to run on branch creation?
>>>>>>>> Would that adequately substitute for the manual run that creates the
>>>>>> image?
>>>>>>>>
>>>>>>>> If we only have to run the action once each time a branch is created, I
>>>>>>>> don't think that qualifies as "horrible" to maintain. This could be
>>>>>>>> lumped into our branch creation scripts. However, if it can be
>>>>>> automated
>>>>>>>> as I suggested above, that would be better.
>>>>>>>
>>>>>>> The job will never be re-run, meaning we'll be using the same container
>>>>>>> for many years.  In case we'll need an update for this container, it 
>>>>>>> will
>>>>>>> need to be done manually for each branch, except for main.
>>>>>>>
>>>>>>> Also, in the patch, the Cirrus CI is set to use the container from main,
>>>>>>> so this will need to be adjusted for each branch, will need to become
>>>>>>> part of the release patches (?).
>>>>>>>
>>>>>>> Another issue with current implementation of containers is that it's
>>>>>>> a pain for external contributor to change dependencies, because they'll
>>>>>>> have to modify the CI in multiple places in order to build and use their
>>>>>>> own containers.
>>>>>>> And with these new changes it will become worse.  For example, I never
>>>>>> use
>>>>>>> 'main' or 'branch-*' branches in my own forks.  And CI depends on the
>>>>>> branch
>>>>>>> name now, and it will fall back to an unmaintained old-style-named 
>>>>>>> image.
>>>>>>> Might not be easy to figure out why all your builds are failing,
>>>>>> especially
>>>>>>> if you're new to the project.
>>>>>>>
>>>>>>
>>>>>> We could try to make it build the container image every time on stable
>>>>>> branches and avoid any manual steps.
> 
> But the branch name will still be a problem.  Every time people use any
> branch name other than 'main' for their own development, they'll get an
> outdated container, unless they change the CI scripts.  Or am I missing
> something?
>


Yeah, we'd have to make it build the container image every time on
non-default branches.  That's a problem because the fedora container
takes so long.  So we'd have to separate the two types of container
builds.  Seems like a lot of trouble to go through indeed.

>>>>>>
>>>>>>>>
>>>>>>>>> How much value these containers actually have?
>>>>>>
>>>>>> I think you mentioned the benefit lower already:
>>>>>> - we can share the artifact (container image) with Cirrus CI; aarch64
>>>>>> container build takes significantly longer
>>>>>> - we can share the artifact with downstream CIs
>>>>>>
>>>>>>>>
>>>>>>>> As patch 2 mentions, it allows for us to use different distro bases for
>>>>>>>> each version.
>>>>>>>
>>>>>>> This can be done by simply specifying 'container: name' in the job,
>>>>>>> GHA will fetch and use that container.
>>>>>>>
>>>>>>>> But I also think that having the container image cached
>>>>>>>> allows for quicker test runs since we are not having to reconstruct the
>>>>>>>> environment each time we perform test runs.
>>>>>>>
>>>>>>> It takes less than 2 minutes to create a full x86 environment.  We spend
>>>>>>> more time re-building OVS and OVN for every variant of tests, even 
>>>>>>> though
>>>>>>> most of them are using the same compiler flags.
>>>>>>>
>>>>>>> Anyways, there is no need to push the container into registry, we can
>>>>>> build
>>>>>>> it and pass to a next job via artifacts.  I don't think spending extra
>>>>>> 1-2
>>>>>>> minutes per workflow (not job, workflow) is a significant problem, if it
>>>>>>> allows to just build whatever the current dependencies are and not think
>>>>>>> about what kind of outdated build from which branch is going to be used.
>>>>>>>
>>>>>>> The only downside of this is that we can't share artifact with Cirrus CI
>>>>>>> or some downstream CIs, but that also doesn't seem to be a significant
>>>>>>> issue.  1-2 minutes per workflow is not that much.  Building the aarch64
>>>>>>> container takes a lot more time, but switching it from fedora to ubuntu
>>>>>>> takes only 10 minutes to build, so might be reasonable to just rebuild 
>>>>>>> it
>>>>>>> on-demand as well.
>>>>>>>
>>>>>>
>>>>>> Ales, why do we need the Cirrus CI image to be Fedora based?
>>>>>>
>>>>>
>>>>> We don't, but it's way easier for me to navigate. I'm not even sure how/if
>>>>> Ubuntu one works for aarch64. We can switch that one, however I don't see
>>>>> any clear benefit in doing so. Also one thing to note, even if it's 1-2
>>>>
>>>> Well, it takes way less time to build, that might be an argument for it.
> 
> FWIW, Fedora takes an hour to build, Ubuntu only takes ~10 minutes.
> 
>>>>
>>>>> minutes (it's probably slightly more) we are losing the ability to have
>>>>> reproducible builds. It's very easy today to take the image and reproduce
>>>>> the same issue IMO.
>>>>>
>>>>
>>>> That's a good point, reproducible builds are important.  Storing the
>>>> image as GitHub CI artifact would still allow us to download it and load
>>>> it locally but it's way easier if it's just stored in the ghcr.io register.
> 
> How many times did you use this for debugging in the past year?
> If it's not a day-to-day churn, it might be worth to add an extra
> step of downloading an artifact.
> 

I'm not saying it's hard to use the artifact directly, just that it's
more convenient to get the image from the registry.  I don't use it that
much, but when I do need it it's way easier for me to do a regular
"podman/docker run" pointing it to the registry than to figure out how
to download and load the image from the artifact.  Anyhow, it's a minor
annoyance and can be automated.

>>>>
>>>>>
>>>>>>
>>>>>>> Not a strong opinion, but the branch name thing might be the most
>>>>>> annoying,
>>>>>>> if I understand correctly.
>>>>>>>
>>>>>>
>>>>>> On the (very) short term however, my problem is that I can't merge any
>>>>>> bug fix in OVN because I can't run any CI at the moment.  Mostly that's
>>>>>> because of different versions of OVN (and OVS via submodule) python code
>>>>>> that triggers different flake8 warnings with different
>>>>>> flake8/python/distro versions.
>>>>>>
>>>>>> A quick fix for that for now could be to bump OVS submodule again to
>>>>>> fdbf0bb2aed5 ("flake8: Fix E721 check failures.") (which is what patch
>>>>>> 1/3 is doing) but on _all_ stable branches.  That's not ideal because
>>>>>> that commit (and its backports) are not part of any release yet.
>>>>>>
>>>>>> I wanted to avoid doing that but it seems we don't have an easy
>>>>>> alternative so I'm starting to think we should just go for the OVS
>>>>>> submodule bump and try to find a better solution for the container
>>>>>> workflow on the longer term.
>>>>>>
>>>
>>> Just in case I wasn't clear, what I meant here is applying patch 1/3 to
>>> main and its corresponding stable branch backports bumping the submodule
>>> to the tips of OVS branch-3.2, branch-3.1 and branch-3.0.
> 
> Bumping submodule to the tips of OVS stable branches seems OK to me.
> 

OK.

>>>
>>
>> .. and we need to pin Python version to 3.11 on branches <=23.03.
> 
> In general, patches 1 and 3 of this set are fine (I didn't test).

OK, I'll prepare v3 and backports and post them soon.

> The main problem, IMO, is the branch name dependency that makes
> day-to-day development harder.
> 

I think it's still useful to be able to pin container versions on stable
branches and make them re-usable for other types of (downstream) CI.

OTOH, if we have a dockerfile we use for building these CI container
images, maybe other CI systems (e.g., the RedHat downstream CI) could
just use the dockerfile to periodically pre-build their own environment.

Ales, what do you think?

>>
>>>>>> What do you guys think?
>>>>>>
>>>>>
>>>>> I'm afraid that we will constantly chase those ghosts of different
>>>>> dependency problems for various branches. And I'm not sure what would be
>>>>> the proper way to handle that. The unfortunate thing is that we don't
>>>>> control directly OvS so some of those "hacks" might be always necessary to
>>>>> have it working.
>>>>>
>>>>
>>>> Isn't that a different discussion?  Partially because OVN uses
>>>> "internal" OVS structures directly.
>>>>
>>
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2 0/3] Stabilize CI by pinning container, runner and Python versions.

Reply via email to