Re: [vpp-dev] random failures in CI

Dave Wallace via lists.fd.io Mon, 08 Jun 2026 08:44:49 -0700

Hi Klement,

On 6/8/26 10:56 AM, Klement Sekera via lists.fd.io wrote:

Hi Dave,


I'd say this explains why improving CPU utilisation by test framework makes
it worse. If the runtime environment is unpredictable, bugs (I believe we
are seeing bugs) will get exposed more.

I'm not sure how I can help with anything concerning nomad or vexxhost - I
don't know what these names mean. I could guess, but I'm not going to. My
current email environment doesn't suggest a contact for Peter Mikus, so I'm
leaving that up to you.

Peter's email address is available via gerrit:https://gerrit.fd.io/r/q/owner:peter.mikus%2540icloud.com


What could help me move forward with exploring all this is the ability to
retrieve test artifacts from the test CI runs - how does one do that? I see
the logs, and they do mention gzipping artifacts, but I don't see a way of
getting a hold of them.

Test artifacts are uploaded to AWS S3 -- for any given workflow, youwill find the link to the artifacts in the "AWS Publish Logs" or "AWSPublish Artifacts" steps (same link is in both steps). For example, inthe test failure for 46002, this is the URL for the failing debian12workflow:


https://github.com/FDio/vpp/actions/runs/27020468660/job/79747148480

Expanding the 'AWS Publish Logs' step (scroll down below the 'VPP MakeTest' output) you will find this URL to the artifacts in AWS S3:


https://logs.fd.io/vex-yul-rot-jenkins-1/gha-vpp-gerrit-patchset/verify-master-2026_06_05_142213_UTC-gerrit-46002-3/verify-maketest-release-builder-debian12-prod-x86_64/


It might also be useful to mergehttps://gerrit.fd.io/r/c/vpp/+/46002 so
that we get coredump visibility under systemd.

I'm in the process of deploying the fix for the RETRIES env var formulti-worker OS tests (i.e.debain12) and will rebase this gerrit changeonce the fix is available in the production GHA CI.


Thanks,
-daw-


Thanks,
Klement

On Sat, Jun 6, 2026 at 3:51 AM Dave Wallace via lists.fd.io<dwallacelf= 
[email protected]> wrote:

Hi Klement,

The docker containers orchestrated on a Nomad cluster that is located in
Vexxhost along with the CSIT teatbeds.  The containers are not pinned to
cpus due to the fact that the Jenkins Nomad plugin did not allow it (thus
this is historical). Since the CI was recently migrated to github actions
with remote runners running on the same Nomad cluster, we might be able to
do cpu pinning.  Please reach out to Peter Mikus for his input to how this
might be added as he is the author of the Nomad GHA dispatcher and primary
maintainer of the Nomad cluster.

Note that most of the issues arise on the debian12 container, because that
is the only instance where make test is run with multi-worker enabled on
VPP.

I noticed that there were still failures with the stack of gerrit changes
that you submitted. Thus for now, to unblock the CI, I submitted a patch
with tag_fixme_debian12 for the failing testcases (46009).  In order to
verify the fix for this issue, all testcases that are skipped using this
tag should remove the tag.

Thanks for your efforts to fix this issue.
-daw-

On 6/5/26 1:02 AM, Klement Sekera via lists.fd.io wrote:

Hi,

Ok that sheds some light. Are you saying that in CI there is a box which runs N 
dockers at once and inside each of these there is a make test job? Are these 
dockers CPU pinned? If not l, then that would be my first suspect - more than 
one docker one the same physical CPU with the test framework doing more pinning 
than before could make it more prone to timing issues.

When tuning the patches I was sometimes hitting reproducible failures in CI and 
I think I saw one the logs mentioning CPU frequency of 0.3GHz which I found 
dubious, while at the same time it showed 128 available CPUs. I couldn’t 
understand why it uses TEST_JOBS=4. But if the CI runs N of these at once, then 
that would make some sense. Maybe we could try sprinting in serial instead of 
walking in parallel to see if patterns emerge?

Thanks,
Klement


On 5 Jun 2026, at 03:59, Dave Wallace via 
lists.fd.io<[email protected]> 
<[email protected]> wrote:

Klement,

The failures are not random, they are intermittent.  A number of tests are 
being skipped (e.g. @tag_fixme_debian12), because they have repeatedly failed 
in an intermittent and un-reproducible pattern in the CI on non-related 
patchsets.

Previous investigations failed to reproduce the issue when run over 1000s of 
iterations on individual servers (both bare-metal and inside the docker 
executor containers used in the CI).  I have long suspected that there are 
'noisy neighbor' cpu pinning issues when a large number of docker containers 
running verify jobs are packed onto a single nomad client.

For the past several months, the number of non-related intermittent job 
failures have been very low since the 'usual suspects' were elided from running 
on debian 12 where the majority of said failures had been occurring.  For 
whatever reason, the latest 'make test' changes have exacerbated the issue.

All of the '@tag_fixme_*' testcases which are elided from per-patch testing 
represent technical debt which has been neglected for a very long time.  Any 
help you can provide to address this technical debt is most appreciated.

Thanks,
-daw-


On 6/4/26 13:43, Klement Sekera via lists.fd.io wrote:
Hey,

Could also be this 
one:https://www.google.com/url?q=https://gerrit.fd.io/r/c/vpp/%2B/45918/5&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1JJF6ehJguLiWRXzA9IjT8

These patches don’t really change test behavior, only scheduling. Before 45918, 
with TEST_JOBS > 1, the pipeline would get underutilized “randomly”, due to 
scheduling at most one test class per finished test class. So if a 1-cpu class 
followed 4-cpu class, then 3 cpus would sit idle. With this patch, the pipeline is 
refilled properly.

I also noticed that any extra cpus (like for vcl tests) were unaccounted for - 
that’s what the later patch fixes.

If the failures are “random”, then it means the tests are flaky and either need 
to fixed or marked for solo run as a temporary(?!) measure. I’d bet $.25 that 
these were fake-solo-run before due to pipeline underutilization.

Regards,
Klement


On 4 Jun 2026, at 19:30, Dave Wallace via 
lists.fd.io<[email protected]> 
<[email protected]> wrote:

Ole/Klement,

Can you please help triage these new intermittent / non-patch related test 
failures?

The frequency of intermittent/ non-patch related test failures have spiked in 
the CI ever since Ole merged the batch of Klement's  test updates in gerrit.

Here's some more that I encountered on my CI monitoring gerrit change 
[0]:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26828951273/job/79105790972%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw386qhQrZeq8JYQ1UbilDnW&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3PGqqnA_ct0OGCnittWUSphttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26916331530/job/79436593001%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw0L8Z8McTXF6BqH53pZiW28&source=gmail-imap&ust=1781229569000000&usg=AOvVaw15myZGYglbDMEVHLT5QQtV

Thanks,
-daw-
[0]https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://gerrit.fd.io/r/c/vpp/%252B/45941%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw2asscMzmXEXYxjjeddvXv_&source=gmail-imap&ust=1781229569000000&usg=AOvVaw0WhlBvDUXOSR6hJAlvAuzO

On 6/4/26 11:58, Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via 
lists.fd.io wrote:
Hi,

Today I noticed excess random failures, not related to patch, of make test in 
CI across different jobs on a couple of patches.
Some 
examples:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26952741597/job/79522258003%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw1ALoKUFN46rZlEmQpwjojU&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1kUSq4aEGtENBFXkY6zziHhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26935962088/job/79468040688%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw3OQN_Dos849rFGWXIxyrEB&source=gmail-imap&ust=1781229569000000&usg=AOvVaw06JPcKzQob7_JBtq_CHu78https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773670%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw31_vDtnQQNTHGlo2YfFNpn&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1ZuiXoBRb0kJsLpW9_p1hRhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773732%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw36UUipJ6sAUo4Fidziguv0&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3rVFv-PKak8iSD1FvK00eq

Regards,
Matus

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#27050): https://lists.fd.io/g/vpp-dev/message/27050
Mute This Topic: https://lists.fd.io/mt/119648437/21656
Group Owner: [email protected]
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] random failures in CI

Reply via email to