Re: [vpp-dev] random failures in CI

Klement Sekera via lists.fd.io Mon, 08 Jun 2026 09:22:42 -0700

Hi Dave,

thanks for the information - that's helpful. When I have some spare cycles,
I'll take a look at the notorious CI offenders.


I've CC-ed Peter and hope you guys can take it from here.

Thanks,
Klement

On Mon, Jun 8, 2026 at 5:44 PM Dave Wallace via lists.fd.io <dwallacelf=
[email protected]> wrote:

> Hi Klement,
>
> On 6/8/26 10:56 AM, Klement Sekera via lists.fd.io wrote:
>
> Hi Dave,
>
> I'd say this explains why improving CPU utilisation by test framework makes
> it worse. If the runtime environment is unpredictable, bugs (I believe we
> are seeing bugs) will get exposed more.
>
> I'm not sure how I can help with anything concerning nomad or vexxhost - I
> don't know what these names mean. I could guess, but I'm not going to. My
> current email environment doesn't suggest a contact for Peter Mikus, so I'm
> leaving that up to you.
>
> Peter's email address is available via gerrit:
> https://gerrit.fd.io/r/q/owner:peter.mikus%2540icloud.com
>
> What could help me move forward with exploring all this is the ability to
> retrieve test artifacts from the test CI runs - how does one do that? I see
> the logs, and they do mention gzipping artifacts, but I don't see a way of
> getting a hold of them.
>
> Test artifacts are uploaded to AWS S3 -- for any given workflow, you will
> find the link to the artifacts in the "AWS Publish Logs" or "AWS Publish
> Artifacts" steps (same link is in both steps).  For example, in the test
> failure for 46002, this is the URL for the failing debian12 workflow:
>
> https://github.com/FDio/vpp/actions/runs/27020468660/job/79747148480
>
> Expanding the 'AWS Publish Logs' step (scroll down below the 'VPP Make
> Test' output) you will find this URL to the artifacts in AWS S3:
>
>
> https://logs.fd.io/vex-yul-rot-jenkins-1/gha-vpp-gerrit-patchset/verify-master-2026_06_05_142213_UTC-gerrit-46002-3/verify-maketest-release-builder-debian12-prod-x86_64/
>
> It might also be useful to merge https://gerrit.fd.io/r/c/vpp/+/46002 so
> that we get coredump visibility under systemd.
>
> I'm in the process of deploying the fix for the RETRIES env var for
> multi-worker OS tests (i.e.debain12) and will rebase this gerrit change
> once the fix is available in the production GHA CI.
>
> Thanks,
> -daw-
>
> Thanks,
> Klement
>
> On Sat, Jun 6, 2026 at 3:51 AM Dave Wallace via lists.fd.io <dwallacelf=
> [email protected]> <[email protected]> wrote:
>
>
> Hi Klement,
>
> The docker containers orchestrated on a Nomad cluster that is located in
> Vexxhost along with the CSIT teatbeds.  The containers are not pinned to
> cpus due to the fact that the Jenkins Nomad plugin did not allow it (thus
> this is historical). Since the CI was recently migrated to github actions
> with remote runners running on the same Nomad cluster, we might be able to
> do cpu pinning.  Please reach out to Peter Mikus for his input to how this
> might be added as he is the author of the Nomad GHA dispatcher and primary
> maintainer of the Nomad cluster.
>
> Note that most of the issues arise on the debian12 container, because that
> is the only instance where make test is run with multi-worker enabled on
> VPP.
>
> I noticed that there were still failures with the stack of gerrit changes
> that you submitted. Thus for now, to unblock the CI, I submitted a patch
> with tag_fixme_debian12 for the failing testcases (46009).  In order to
> verify the fix for this issue, all testcases that are skipped using this
> tag should remove the tag.
>
> Thanks for your efforts to fix this issue.
> -daw-
>
> On 6/5/26 1:02 AM, Klement Sekera via lists.fd.io wrote:
>
> Hi,
>
> Ok that sheds some light. Are you saying that in CI there is a box which runs 
> N dockers at once and inside each of these there is a make test job? Are 
> these dockers CPU pinned? If not l, then that would be my first suspect - 
> more than one docker one the same physical CPU with the test framework doing 
> more pinning than before could make it more prone to timing issues.
>
> When tuning the patches I was sometimes hitting reproducible failures in CI 
> and I think I saw one the logs mentioning CPU frequency of 0.3GHz which I 
> found dubious, while at the same time it showed 128 available CPUs. I 
> couldn’t understand why it uses TEST_JOBS=4. But if the CI runs N of these at 
> once, then that would make some sense. Maybe we could try sprinting in serial 
> instead of walking in parallel to see if patterns emerge?
>
> Thanks,
> Klement
>
>
> On 5 Jun 2026, at 03:59, Dave Wallace via lists.fd.io 
> <[email protected]> <[email protected]> 
> <[email protected]> <[email protected]> wrote:
>
> Klement,
>
> The failures are not random, they are intermittent.  A number of tests are 
> being skipped (e.g. @tag_fixme_debian12), because they have repeatedly failed 
> in an intermittent and un-reproducible pattern in the CI on non-related 
> patchsets.
>
> Previous investigations failed to reproduce the issue when run over 1000s of 
> iterations on individual servers (both bare-metal and inside the docker 
> executor containers used in the CI).  I have long suspected that there are 
> 'noisy neighbor' cpu pinning issues when a large number of docker containers 
> running verify jobs are packed onto a single nomad client.
>
> For the past several months, the number of non-related intermittent job 
> failures have been very low since the 'usual suspects' were elided from 
> running on debian 12 where the majority of said failures had been occurring.  
> For whatever reason, the latest 'make test' changes have exacerbated the 
> issue.
>
> All of the '@tag_fixme_*' testcases which are elided from per-patch testing 
> represent technical debt which has been neglected for a very long time.  Any 
> help you can provide to address this technical debt is most appreciated.
>
> Thanks,
> -daw-
>
>
> On 6/4/26 13:43, Klement Sekera via lists.fd.io wrote:
> Hey,
>
> Could also be this one: 
> https://www.google.com/url?q=https://gerrit.fd.io/r/c/vpp/%2B/45918/5&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1JJF6ehJguLiWRXzA9IjT8
>
> These patches don’t really change test behavior, only scheduling. Before 
> 45918, with TEST_JOBS > 1, the pipeline would get underutilized “randomly”, 
> due to scheduling at most one test class per finished test class. So if a 
> 1-cpu class followed 4-cpu class, then 3 cpus would sit idle. With this 
> patch, the pipeline is refilled properly.
>
> I also noticed that any extra cpus (like for vcl tests) were unaccounted for 
> - that’s what the later patch fixes.
>
> If the failures are “random”, then it means the tests are flaky and either 
> need to fixed or marked for solo run as a temporary(?!) measure. I’d bet $.25 
> that these were fake-solo-run before due to pipeline underutilization.
>
> Regards,
> Klement
>
>
> On 4 Jun 2026, at 19:30, Dave Wallace via lists.fd.io 
> <[email protected]> <[email protected]> 
> <[email protected]> <[email protected]> wrote:
>
> Ole/Klement,
>
> Can you please help triage these new intermittent / non-patch related test 
> failures?
>
> The frequency of intermittent/ non-patch related test failures have spiked in 
> the CI ever since Ole merged the batch of Klement's  test updates in gerrit.
>
> Here's some more that I encountered on my CI monitoring gerrit change 
> [0]:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26828951273/job/79105790972%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw386qhQrZeq8JYQ1UbilDnW&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3PGqqnA_ct0OGCnittWUSphttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26916331530/job/79436593001%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw0L8Z8McTXF6BqH53pZiW28&source=gmail-imap&ust=1781229569000000&usg=AOvVaw15myZGYglbDMEVHLT5QQtV
>
> Thanks,
> -daw-
> [0]   
> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://gerrit.fd.io/r/c/vpp/%252B/45941%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw2asscMzmXEXYxjjeddvXv_&source=gmail-imap&ust=1781229569000000&usg=AOvVaw0WhlBvDUXOSR6hJAlvAuzO
>
> On 6/4/26 11:58, Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via 
> lists.fd.io wrote:
> Hi,
>
> Today I noticed excess random failures, not related to patch, of make test in 
> CI across different jobs on a couple of patches.
> Some 
> examples:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26952741597/job/79522258003%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw1ALoKUFN46rZlEmQpwjojU&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1kUSq4aEGtENBFXkY6zziHhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26935962088/job/79468040688%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw3OQN_Dos849rFGWXIxyrEB&source=gmail-imap&ust=1781229569000000&usg=AOvVaw06JPcKzQob7_JBtq_CHu78https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773670%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw31_vDtnQQNTHGlo2YfFNpn&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1ZuiXoBRb0kJsLpW9_p1hRhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773732%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw36UUipJ6sAUo4Fidziguv0&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3rVFv-PKak8iSD1FvK00eq
>
> Regards,
> Matus
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27051): https://lists.fd.io/g/vpp-dev/message/27051
Mute This Topic: https://lists.fd.io/mt/119648437/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] random failures in CI

Reply via email to