Hi Dave, thanks for the information - that's helpful. When I have some spare cycles, I'll take a look at the notorious CI offenders.
I've CC-ed Peter and hope you guys can take it from here. Thanks, Klement On Mon, Jun 8, 2026 at 5:44 PM Dave Wallace via lists.fd.io <dwallacelf= [email protected]> wrote: > Hi Klement, > > On 6/8/26 10:56 AM, Klement Sekera via lists.fd.io wrote: > > Hi Dave, > > I'd say this explains why improving CPU utilisation by test framework makes > it worse. If the runtime environment is unpredictable, bugs (I believe we > are seeing bugs) will get exposed more. > > I'm not sure how I can help with anything concerning nomad or vexxhost - I > don't know what these names mean. I could guess, but I'm not going to. My > current email environment doesn't suggest a contact for Peter Mikus, so I'm > leaving that up to you. > > Peter's email address is available via gerrit: > https://gerrit.fd.io/r/q/owner:peter.mikus%2540icloud.com > > What could help me move forward with exploring all this is the ability to > retrieve test artifacts from the test CI runs - how does one do that? I see > the logs, and they do mention gzipping artifacts, but I don't see a way of > getting a hold of them. > > Test artifacts are uploaded to AWS S3 -- for any given workflow, you will > find the link to the artifacts in the "AWS Publish Logs" or "AWS Publish > Artifacts" steps (same link is in both steps). For example, in the test > failure for 46002, this is the URL for the failing debian12 workflow: > > https://github.com/FDio/vpp/actions/runs/27020468660/job/79747148480 > > Expanding the 'AWS Publish Logs' step (scroll down below the 'VPP Make > Test' output) you will find this URL to the artifacts in AWS S3: > > > https://logs.fd.io/vex-yul-rot-jenkins-1/gha-vpp-gerrit-patchset/verify-master-2026_06_05_142213_UTC-gerrit-46002-3/verify-maketest-release-builder-debian12-prod-x86_64/ > > It might also be useful to merge https://gerrit.fd.io/r/c/vpp/+/46002 so > that we get coredump visibility under systemd. > > I'm in the process of deploying the fix for the RETRIES env var for > multi-worker OS tests (i.e.debain12) and will rebase this gerrit change > once the fix is available in the production GHA CI. > > Thanks, > -daw- > > Thanks, > Klement > > On Sat, Jun 6, 2026 at 3:51 AM Dave Wallace via lists.fd.io <dwallacelf= > [email protected]> <[email protected]> wrote: > > > Hi Klement, > > The docker containers orchestrated on a Nomad cluster that is located in > Vexxhost along with the CSIT teatbeds. The containers are not pinned to > cpus due to the fact that the Jenkins Nomad plugin did not allow it (thus > this is historical). Since the CI was recently migrated to github actions > with remote runners running on the same Nomad cluster, we might be able to > do cpu pinning. Please reach out to Peter Mikus for his input to how this > might be added as he is the author of the Nomad GHA dispatcher and primary > maintainer of the Nomad cluster. > > Note that most of the issues arise on the debian12 container, because that > is the only instance where make test is run with multi-worker enabled on > VPP. > > I noticed that there were still failures with the stack of gerrit changes > that you submitted. Thus for now, to unblock the CI, I submitted a patch > with tag_fixme_debian12 for the failing testcases (46009). In order to > verify the fix for this issue, all testcases that are skipped using this > tag should remove the tag. > > Thanks for your efforts to fix this issue. > -daw- > > On 6/5/26 1:02 AM, Klement Sekera via lists.fd.io wrote: > > Hi, > > Ok that sheds some light. Are you saying that in CI there is a box which runs > N dockers at once and inside each of these there is a make test job? Are > these dockers CPU pinned? If not l, then that would be my first suspect - > more than one docker one the same physical CPU with the test framework doing > more pinning than before could make it more prone to timing issues. > > When tuning the patches I was sometimes hitting reproducible failures in CI > and I think I saw one the logs mentioning CPU frequency of 0.3GHz which I > found dubious, while at the same time it showed 128 available CPUs. I > couldn’t understand why it uses TEST_JOBS=4. But if the CI runs N of these at > once, then that would make some sense. Maybe we could try sprinting in serial > instead of walking in parallel to see if patterns emerge? > > Thanks, > Klement > > > On 5 Jun 2026, at 03:59, Dave Wallace via lists.fd.io > <[email protected]> <[email protected]> > <[email protected]> <[email protected]> wrote: > > Klement, > > The failures are not random, they are intermittent. A number of tests are > being skipped (e.g. @tag_fixme_debian12), because they have repeatedly failed > in an intermittent and un-reproducible pattern in the CI on non-related > patchsets. > > Previous investigations failed to reproduce the issue when run over 1000s of > iterations on individual servers (both bare-metal and inside the docker > executor containers used in the CI). I have long suspected that there are > 'noisy neighbor' cpu pinning issues when a large number of docker containers > running verify jobs are packed onto a single nomad client. > > For the past several months, the number of non-related intermittent job > failures have been very low since the 'usual suspects' were elided from > running on debian 12 where the majority of said failures had been occurring. > For whatever reason, the latest 'make test' changes have exacerbated the > issue. > > All of the '@tag_fixme_*' testcases which are elided from per-patch testing > represent technical debt which has been neglected for a very long time. Any > help you can provide to address this technical debt is most appreciated. > > Thanks, > -daw- > > > On 6/4/26 13:43, Klement Sekera via lists.fd.io wrote: > Hey, > > Could also be this one: > https://www.google.com/url?q=https://gerrit.fd.io/r/c/vpp/%2B/45918/5&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1JJF6ehJguLiWRXzA9IjT8 > > These patches don’t really change test behavior, only scheduling. Before > 45918, with TEST_JOBS > 1, the pipeline would get underutilized “randomly”, > due to scheduling at most one test class per finished test class. So if a > 1-cpu class followed 4-cpu class, then 3 cpus would sit idle. With this > patch, the pipeline is refilled properly. > > I also noticed that any extra cpus (like for vcl tests) were unaccounted for > - that’s what the later patch fixes. > > If the failures are “random”, then it means the tests are flaky and either > need to fixed or marked for solo run as a temporary(?!) measure. I’d bet $.25 > that these were fake-solo-run before due to pipeline underutilization. > > Regards, > Klement > > > On 4 Jun 2026, at 19:30, Dave Wallace via lists.fd.io > <[email protected]> <[email protected]> > <[email protected]> <[email protected]> wrote: > > Ole/Klement, > > Can you please help triage these new intermittent / non-patch related test > failures? > > The frequency of intermittent/ non-patch related test failures have spiked in > the CI ever since Ole merged the batch of Klement's test updates in gerrit. > > Here's some more that I encountered on my CI monitoring gerrit change > [0]:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26828951273/job/79105790972%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw386qhQrZeq8JYQ1UbilDnW&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3PGqqnA_ct0OGCnittWUSphttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26916331530/job/79436593001%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw0L8Z8McTXF6BqH53pZiW28&source=gmail-imap&ust=1781229569000000&usg=AOvVaw15myZGYglbDMEVHLT5QQtV > > Thanks, > -daw- > [0] > https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://gerrit.fd.io/r/c/vpp/%252B/45941%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw2asscMzmXEXYxjjeddvXv_&source=gmail-imap&ust=1781229569000000&usg=AOvVaw0WhlBvDUXOSR6hJAlvAuzO > > On 6/4/26 11:58, Matus Fabian -X (matfabia - PANTHEON TECHNOLOGIES@Cisco) via > lists.fd.io wrote: > Hi, > > Today I noticed excess random failures, not related to patch, of make test in > CI across different jobs on a couple of patches. > Some > examples:https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26952741597/job/79522258003%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw1ALoKUFN46rZlEmQpwjojU&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1kUSq4aEGtENBFXkY6zziHhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26935962088/job/79468040688%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw3OQN_Dos849rFGWXIxyrEB&source=gmail-imap&ust=1781229569000000&usg=AOvVaw06JPcKzQob7_JBtq_CHu78https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773670%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw31_vDtnQQNTHGlo2YfFNpn&source=gmail-imap&ust=1781229569000000&usg=AOvVaw1ZuiXoBRb0kJsLpW9_p1hRhttps://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://github.com/FDio/vpp/actions/runs/26937415597/job/79470773732%26source%3Dgmail-imap%26ust%3D1781199056000000%26usg%3DAOvVaw36UUipJ6sAUo4Fidziguv0&source=gmail-imap&ust=1781229569000000&usg=AOvVaw3rVFv-PKak8iSD1FvK00eq > > Regards, > Matus > > > > > > > > > > > > > > > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#27051): https://lists.fd.io/g/vpp-dev/message/27051 Mute This Topic: https://lists.fd.io/mt/119648437/21656 Group Owner: [email protected] Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/14379924/21656/631435203/xyzzy [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
