Re: [ovs-dev] [PATCH ovn v2 0/3] Stabilize CI by pinning container, runner and Python versions.

Ilya Maximets Thu, 16 Nov 2023 04:05:37 -0800

On 11/16/23 10:17, Dumitru Ceara wrote:
> On 11/16/23 10:02, Dumitru Ceara wrote:
>> On 11/16/23 09:51, Dumitru Ceara wrote:
>>> On 11/16/23 09:44, Ales Musil wrote:
>>>> On Thu, Nov 16, 2023 at 9:34 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>>
>>>>> On 11/16/23 00:12, Ilya Maximets wrote:
>>>>>> On 11/15/23 22:20, Mark Michelson wrote:
>>>>>>> On 11/15/23 12:02, Ilya Maximets wrote:
>>>>>>>> On 11/15/23 17:00, Dumitru Ceara wrote:
>>>>>>>>> All of the above were changed to track the latest available releases.
>>>>>>>>> Initially that seemed like a good idea but in practice, a new release
>>>>> would
>>>>>>>>> potentially (silently) cause CI runs that used to pass on given stable
>>>>>>>>> versions to unexpectedly start failing.
>>>>>>>>>
>>>>>>>>> To address that this series pins all versions we can control allowing
>>>>> us
>>>>>>>>> to use different values for different branches.
>>>>>>>>>
>>>>>>>>> NOTE: the last commit of this series will look slightly different on
>>>>>>>>> stable branches, e.g., on branches <= 23.06 we need Python <= 3.11 and
>>>>>>>>> Fedora 37.
>>>>>>>>>
>>>>>>>>> The first commit of the series bumps the OVS submodule to include the
>>>>>>>>> latest fixes to errors reported by recent versions of flake8.
>>>>>>>>>
>>>>>>>>> Changes in v2:
>>>>>>>>> - added first patch to bump OVS submodule
>>>>>>>>> - addressed Ales' review comments:
>>>>>>>>>    - moved the logic to determine which image to use out of ci.sh;
>>>>>>>>>      it's now in the workflow itself.
>>>>>>>>>    - moved setting of OVN_IMAGE in the same block with all related
>>>>>>>>>      env vars
>>>>>>>>> - added a note for maintainers in the release process documentation
>>>>>>>>>    to mention that the new workflow will likely have to be triggered
>>>>>>>>>    manually (at least once) when branching for a new release.  GitHub
>>>>>>>>>    doesn't allow periodic jobs on non-default branches.
>>>>>>>>
>>>>>>>> IMHO, that sounds horrible to maintain.  Can we just build things per
>>>>> run?
>>>>>>>
>>>>>>> What if we set up the "Containers" action to run on branch creation?
>>>>>>> Would that adequately substitute for the manual run that creates the
>>>>> image?
>>>>>>>
>>>>>>> If we only have to run the action once each time a branch is created, I
>>>>>>> don't think that qualifies as "horrible" to maintain. This could be
>>>>>>> lumped into our branch creation scripts. However, if it can be
>>>>> automated
>>>>>>> as I suggested above, that would be better.
>>>>>>
>>>>>> The job will never be re-run, meaning we'll be using the same container
>>>>>> for many years.  In case we'll need an update for this container, it will
>>>>>> need to be done manually for each branch, except for main.
>>>>>>
>>>>>> Also, in the patch, the Cirrus CI is set to use the container from main,
>>>>>> so this will need to be adjusted for each branch, will need to become
>>>>>> part of the release patches (?).
>>>>>>
>>>>>> Another issue with current implementation of containers is that it's
>>>>>> a pain for external contributor to change dependencies, because they'll
>>>>>> have to modify the CI in multiple places in order to build and use their
>>>>>> own containers.
>>>>>> And with these new changes it will become worse.  For example, I never
>>>>> use
>>>>>> 'main' or 'branch-*' branches in my own forks.  And CI depends on the
>>>>> branch
>>>>>> name now, and it will fall back to an unmaintained old-style-named image.
>>>>>> Might not be easy to figure out why all your builds are failing,
>>>>> especially
>>>>>> if you're new to the project.
>>>>>>
>>>>>
>>>>> We could try to make it build the container image every time on stable
>>>>> branches and avoid any manual steps.


But the branch name will still be a problem.  Every time people use any
branch name other than 'main' for their own development, they'll get an
outdated container, unless they change the CI scripts.  Or am I missing
something?

>>>>>
>>>>>>>
>>>>>>>> How much value these containers actually have?
>>>>>
>>>>> I think you mentioned the benefit lower already:
>>>>> - we can share the artifact (container image) with Cirrus CI; aarch64
>>>>> container build takes significantly longer
>>>>> - we can share the artifact with downstream CIs
>>>>>
>>>>>>>
>>>>>>> As patch 2 mentions, it allows for us to use different distro bases for
>>>>>>> each version.
>>>>>>
>>>>>> This can be done by simply specifying 'container: name' in the job,
>>>>>> GHA will fetch and use that container.
>>>>>>
>>>>>>> But I also think that having the container image cached
>>>>>>> allows for quicker test runs since we are not having to reconstruct the
>>>>>>> environment each time we perform test runs.
>>>>>>
>>>>>> It takes less than 2 minutes to create a full x86 environment.  We spend
>>>>>> more time re-building OVS and OVN for every variant of tests, even though
>>>>>> most of them are using the same compiler flags.
>>>>>>
>>>>>> Anyways, there is no need to push the container into registry, we can
>>>>> build
>>>>>> it and pass to a next job via artifacts.  I don't think spending extra
>>>>> 1-2
>>>>>> minutes per workflow (not job, workflow) is a significant problem, if it
>>>>>> allows to just build whatever the current dependencies are and not think
>>>>>> about what kind of outdated build from which branch is going to be used.
>>>>>>
>>>>>> The only downside of this is that we can't share artifact with Cirrus CI
>>>>>> or some downstream CIs, but that also doesn't seem to be a significant
>>>>>> issue.  1-2 minutes per workflow is not that much.  Building the aarch64
>>>>>> container takes a lot more time, but switching it from fedora to ubuntu
>>>>>> takes only 10 minutes to build, so might be reasonable to just rebuild it
>>>>>> on-demand as well.
>>>>>>
>>>>>
>>>>> Ales, why do we need the Cirrus CI image to be Fedora based?
>>>>>
>>>>
>>>> We don't, but it's way easier for me to navigate. I'm not even sure how/if
>>>> Ubuntu one works for aarch64. We can switch that one, however I don't see
>>>> any clear benefit in doing so. Also one thing to note, even if it's 1-2
>>>
>>> Well, it takes way less time to build, that might be an argument for it.

FWIW, Fedora takes an hour to build, Ubuntu only takes ~10 minutes.

>>>
>>>> minutes (it's probably slightly more) we are losing the ability to have
>>>> reproducible builds. It's very easy today to take the image and reproduce
>>>> the same issue IMO.
>>>>
>>>
>>> That's a good point, reproducible builds are important.  Storing the
>>> image as GitHub CI artifact would still allow us to download it and load
>>> it locally but it's way easier if it's just stored in the ghcr.io register.

How many times did you use this for debugging in the past year?
If it's not a day-to-day churn, it might be worth to add an extra
step of downloading an artifact.

>>>
>>>>
>>>>>
>>>>>> Not a strong opinion, but the branch name thing might be the most
>>>>> annoying,
>>>>>> if I understand correctly.
>>>>>>
>>>>>
>>>>> On the (very) short term however, my problem is that I can't merge any
>>>>> bug fix in OVN because I can't run any CI at the moment.  Mostly that's
>>>>> because of different versions of OVN (and OVS via submodule) python code
>>>>> that triggers different flake8 warnings with different
>>>>> flake8/python/distro versions.
>>>>>
>>>>> A quick fix for that for now could be to bump OVS submodule again to
>>>>> fdbf0bb2aed5 ("flake8: Fix E721 check failures.") (which is what patch
>>>>> 1/3 is doing) but on _all_ stable branches.  That's not ideal because
>>>>> that commit (and its backports) are not part of any release yet.
>>>>>
>>>>> I wanted to avoid doing that but it seems we don't have an easy
>>>>> alternative so I'm starting to think we should just go for the OVS
>>>>> submodule bump and try to find a better solution for the container
>>>>> workflow on the longer term.
>>>>>
>>
>> Just in case I wasn't clear, what I meant here is applying patch 1/3 to
>> main and its corresponding stable branch backports bumping the submodule
>> to the tips of OVS branch-3.2, branch-3.1 and branch-3.0.

Bumping submodule to the tips of OVS stable branches seems OK to me.

>>
> 
> .. and we need to pin Python version to 3.11 on branches <=23.03.

In general, patches 1 and 3 of this set are fine (I didn't test).
The main problem, IMO, is the branch name dependency that makes
day-to-day development harder.

> 
>>>>> What do you guys think?
>>>>>
>>>>
>>>> I'm afraid that we will constantly chase those ghosts of different
>>>> dependency problems for various branches. And I'm not sure what would be
>>>> the proper way to handle that. The unfortunate thing is that we don't
>>>> control directly OvS so some of those "hacks" might be always necessary to
>>>> have it working.
>>>>
>>>
>>> Isn't that a different discussion?  Partially because OVN uses
>>> "internal" OVS structures directly.
>>>
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2 0/3] Stabilize CI by pinning container, runner and Python versions.

Reply via email to