Re: [ovs-dev] [PATCH ovn v2 0/3] Stabilize CI by pinning container, runner and Python versions.

Dumitru Ceara Thu, 16 Nov 2023 05:35:26 -0800

On 11/16/23 13:44, Ales Musil wrote:
> On Thu, Nov 16, 2023 at 1:40 PM Dumitru Ceara <dce...@redhat.com> wrote:
> 
>> On 11/16/23 13:05, Ilya Maximets wrote:
>>> On 11/16/23 10:17, Dumitru Ceara wrote:
>>>> On 11/16/23 10:02, Dumitru Ceara wrote:
>>>>> On 11/16/23 09:51, Dumitru Ceara wrote:
>>>>>> On 11/16/23 09:44, Ales Musil wrote:
>>>>>>> On Thu, Nov 16, 2023 at 9:34 AM Dumitru Ceara <dce...@redhat.com>
>> wrote:
>>>>>>>
>>>>>>>> On 11/16/23 00:12, Ilya Maximets wrote:
>>>>>>>>> On 11/15/23 22:20, Mark Michelson wrote:
>>>>>>>>>> On 11/15/23 12:02, Ilya Maximets wrote:
>>>>>>>>>>> On 11/15/23 17:00, Dumitru Ceara wrote:
>>>>>>>>>>>> All of the above were changed to track the latest available
>> releases.
>>>>>>>>>>>> Initially that seemed like a good idea but in practice, a new
>> release
>>>>>>>> would
>>>>>>>>>>>> potentially (silently) cause CI runs that used to pass on given
>> stable
>>>>>>>>>>>> versions to unexpectedly start failing.
>>>>>>>>>>>>
>>>>>>>>>>>> To address that this series pins all versions we can control
>> allowing
>>>>>>>> us
>>>>>>>>>>>> to use different values for different branches.
>>>>>>>>>>>>
>>>>>>>>>>>> NOTE: the last commit of this series will look slightly
>> different on
>>>>>>>>>>>> stable branches, e.g., on branches <= 23.06 we need Python <=
>> 3.11 and
>>>>>>>>>>>> Fedora 37.
>>>>>>>>>>>>
>>>>>>>>>>>> The first commit of the series bumps the OVS submodule to
>> include the
>>>>>>>>>>>> latest fixes to errors reported by recent versions of flake8.
>>>>>>>>>>>>
>>>>>>>>>>>> Changes in v2:
>>>>>>>>>>>> - added first patch to bump OVS submodule
>>>>>>>>>>>> - addressed Ales' review comments:
>>>>>>>>>>>>    - moved the logic to determine which image to use out of
>> ci.sh;
>>>>>>>>>>>>      it's now in the workflow itself.
>>>>>>>>>>>>    - moved setting of OVN_IMAGE in the same block with all
>> related
>>>>>>>>>>>>      env vars
>>>>>>>>>>>> - added a note for maintainers in the release process
>> documentation
>>>>>>>>>>>>    to mention that the new workflow will likely have to be
>> triggered
>>>>>>>>>>>>    manually (at least once) when branching for a new release.
>> GitHub
>>>>>>>>>>>>    doesn't allow periodic jobs on non-default branches.
>>>>>>>>>>>
>>>>>>>>>>> IMHO, that sounds horrible to maintain.  Can we just build
>> things per
>>>>>>>> run?
>>>>>>>>>>
>>>>>>>>>> What if we set up the "Containers" action to run on branch
>> creation?
>>>>>>>>>> Would that adequately substitute for the manual run that creates
>> the
>>>>>>>> image?
>>>>>>>>>>
>>>>>>>>>> If we only have to run the action once each time a branch is
>> created, I
>>>>>>>>>> don't think that qualifies as "horrible" to maintain. This could
>> be
>>>>>>>>>> lumped into our branch creation scripts. However, if it can be
>>>>>>>> automated
>>>>>>>>>> as I suggested above, that would be better.
>>>>>>>>>
>>>>>>>>> The job will never be re-run, meaning we'll be using the same
>> container
>>>>>>>>> for many years.  In case we'll need an update for this container,
>> it will
>>>>>>>>> need to be done manually for each branch, except for main.
>>>>>>>>>
>>>>>>>>> Also, in the patch, the Cirrus CI is set to use the container from
>> main,
>>>>>>>>> so this will need to be adjusted for each branch, will need to
>> become
>>>>>>>>> part of the release patches (?).
>>>>>>>>>
>>>>>>>>> Another issue with current implementation of containers is that
>> it's
>>>>>>>>> a pain for external contributor to change dependencies, because
>> they'll
>>>>>>>>> have to modify the CI in multiple places in order to build and use
>> their
>>>>>>>>> own containers.
>>>>>>>>> And with these new changes it will become worse.  For example, I
>> never
>>>>>>>> use
>>>>>>>>> 'main' or 'branch-*' branches in my own forks.  And CI depends on
>> the
>>>>>>>> branch
>>>>>>>>> name now, and it will fall back to an unmaintained old-style-named
>> image.
>>>>>>>>> Might not be easy to figure out why all your builds are failing,
>>>>>>>> especially
>>>>>>>>> if you're new to the project.
>>>>>>>>>
>>>>>>>>
>>>>>>>> We could try to make it build the container image every time on
>> stable
>>>>>>>> branches and avoid any manual steps.
>>>
>>> But the branch name will still be a problem.  Every time people use any
>>> branch name other than 'main' for their own development, they'll get an
>>> outdated container, unless they change the CI scripts.  Or am I missing
>>> something?
>>>
>>
>> Yeah, we'd have to make it build the container image every time on
>> non-default branches.  That's a problem because the fedora container
>> takes so long.  So we'd have to separate the two types of container
>> builds.  Seems like a lot of trouble to go through indeed.
>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> How much value these containers actually have?
>>>>>>>>
>>>>>>>> I think you mentioned the benefit lower already:
>>>>>>>> - we can share the artifact (container image) with Cirrus CI;
>> aarch64
>>>>>>>> container build takes significantly longer
>>>>>>>> - we can share the artifact with downstream CIs
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As patch 2 mentions, it allows for us to use different distro
>> bases for
>>>>>>>>>> each version.
>>>>>>>>>
>>>>>>>>> This can be done by simply specifying 'container: name' in the job,
>>>>>>>>> GHA will fetch and use that container.
>>>>>>>>>
>>>>>>>>>> But I also think that having the container image cached
>>>>>>>>>> allows for quicker test runs since we are not having to
>> reconstruct the
>>>>>>>>>> environment each time we perform test runs.
>>>>>>>>>
>>>>>>>>> It takes less than 2 minutes to create a full x86 environment.  We
>> spend
>>>>>>>>> more time re-building OVS and OVN for every variant of tests, even
>> though
>>>>>>>>> most of them are using the same compiler flags.
>>>>>>>>>
>>>>>>>>> Anyways, there is no need to push the container into registry, we
>> can
>>>>>>>> build
>>>>>>>>> it and pass to a next job via artifacts.  I don't think spending
>> extra
>>>>>>>> 1-2
>>>>>>>>> minutes per workflow (not job, workflow) is a significant problem,
>> if it
>>>>>>>>> allows to just build whatever the current dependencies are and not
>> think
>>>>>>>>> about what kind of outdated build from which branch is going to be
>> used.
>>>>>>>>>
>>>>>>>>> The only downside of this is that we can't share artifact with
>> Cirrus CI
>>>>>>>>> or some downstream CIs, but that also doesn't seem to be a
>> significant
>>>>>>>>> issue.  1-2 minutes per workflow is not that much.  Building the
>> aarch64
>>>>>>>>> container takes a lot more time, but switching it from fedora to
>> ubuntu
>>>>>>>>> takes only 10 minutes to build, so might be reasonable to just
>> rebuild it
>>>>>>>>> on-demand as well.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ales, why do we need the Cirrus CI image to be Fedora based?
>>>>>>>>
>>>>>>>
>>>>>>> We don't, but it's way easier for me to navigate. I'm not even sure
>> how/if
>>>>>>> Ubuntu one works for aarch64. We can switch that one, however I
>> don't see
>>>>>>> any clear benefit in doing so. Also one thing to note, even if it's
>> 1-2
>>>>>>
>>>>>> Well, it takes way less time to build, that might be an argument for
>> it.
>>>
>>> FWIW, Fedora takes an hour to build, Ubuntu only takes ~10 minutes.
>>>
>>>>>>
>>>>>>> minutes (it's probably slightly more) we are losing the ability to
>> have
>>>>>>> reproducible builds. It's very easy today to take the image and
>> reproduce
>>>>>>> the same issue IMO.
>>>>>>>
>>>>>>
>>>>>> That's a good point, reproducible builds are important.  Storing the
>>>>>> image as GitHub CI artifact would still allow us to download it and
>> load
>>>>>> it locally but it's way easier if it's just stored in the ghcr.io
>> register.
>>>
>>> How many times did you use this for debugging in the past year?
>>> If it's not a day-to-day churn, it might be worth to add an extra
>>> step of downloading an artifact.
>>>
>>
>> I'm not saying it's hard to use the artifact directly, just that it's
>> more convenient to get the image from the registry.  I don't use it that
>> much, but when I do need it it's way easier for me to do a regular
>> "podman/docker run" pointing it to the registry than to figure out how
>> to download and load the image from the artifact.  Anyhow, it's a minor
>> annoyance and can be automated.
>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Not a strong opinion, but the branch name thing might be the most
>>>>>>>> annoying,
>>>>>>>>> if I understand correctly.
>>>>>>>>>
>>>>>>>>
>>>>>>>> On the (very) short term however, my problem is that I can't merge
>> any
>>>>>>>> bug fix in OVN because I can't run any CI at the moment.  Mostly
>> that's
>>>>>>>> because of different versions of OVN (and OVS via submodule) python
>> code
>>>>>>>> that triggers different flake8 warnings with different
>>>>>>>> flake8/python/distro versions.
>>>>>>>>
>>>>>>>> A quick fix for that for now could be to bump OVS submodule again to
>>>>>>>> fdbf0bb2aed5 ("flake8: Fix E721 check failures.") (which is what
>> patch
>>>>>>>> 1/3 is doing) but on _all_ stable branches.  That's not ideal
>> because
>>>>>>>> that commit (and its backports) are not part of any release yet.
>>>>>>>>
>>>>>>>> I wanted to avoid doing that but it seems we don't have an easy
>>>>>>>> alternative so I'm starting to think we should just go for the OVS
>>>>>>>> submodule bump and try to find a better solution for the container
>>>>>>>> workflow on the longer term.
>>>>>>>>
>>>>>
>>>>> Just in case I wasn't clear, what I meant here is applying patch 1/3 to
>>>>> main and its corresponding stable branch backports bumping the
>> submodule
>>>>> to the tips of OVS branch-3.2, branch-3.1 and branch-3.0.
>>>
>>> Bumping submodule to the tips of OVS stable branches seems OK to me.
>>>
>>
>> OK.
>>
>>>>>
>>>>
>>>> .. and we need to pin Python version to 3.11 on branches <=23.03.
>>>
>>> In general, patches 1 and 3 of this set are fine (I didn't test).
>>
>> OK, I'll prepare v3 and backports and post them soon.
>>
>>> The main problem, IMO, is the branch name dependency that makes
>>> day-to-day development harder.
>>>
>>
>> I think it's still useful to be able to pin container versions on stable
>> branches and make them re-usable for other types of (downstream) CI.
>>
>> OTOH, if we have a dockerfile we use for building these CI container
>> images, maybe other CI systems (e.g., the RedHat downstream CI) could
>> just use the dockerfile to periodically pre-build their own environment.
>>
>> Ales, what do you think?
>>
> 
> 
> It's possible, but also very easy for that dockerfile to fall behind.
> Currently we test both images. If we switch to Ubuntu for Cirrus CI and
> disable automatic builds of Fedora I'm afraid that no one will care and
> Fedora will be just unmaintained.
>


Could we run a set of periodic GitHub workflows with Fedora as container
image?  Or does that become even more complex.

> 
>>
>>>>
>>>>>>>> What do you guys think?
>>>>>>>>
>>>>>>>
>>>>>>> I'm afraid that we will constantly chase those ghosts of different
>>>>>>> dependency problems for various branches. And I'm not sure what
>> would be
>>>>>>> the proper way to handle that. The unfortunate thing is that we don't
>>>>>>> control directly OvS so some of those "hacks" might be always
>> necessary to
>>>>>>> have it working.
>>>>>>>
>>>>>>
>>>>>> Isn't that a different discussion?  Partially because OVN uses
>>>>>> "internal" OVS structures directly.
>>>>>>
>>>>
>>>
>>
>>
> Thanks,
> Ales
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2 0/3] Stabilize CI by pinning container, runner and Python versions.

Reply via email to