Lukasz Wojciechowski <l.wojciec...@partner.samsung.com> writes: > W dniu 17.07.2020 o 17:19, David Marchand pisze: >> On Fri, Jul 17, 2020 at 10:56 AM David Marchand >> <david.march...@redhat.com> wrote: >>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yi...@intel.com> >>> wrote: >>>> On 7/15/2020 11:14 AM, David Marchand wrote: >>>>> Hello Harry and guys who touched the service code recently :-) >>>>> >>>>> I spotted a failure for the service UT in Travis: >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697 >>>>> >>>>> I found only a single instance of this failure and tried to reproduce >>>>> it with my usual "brute" active loop with no success so far. >>>> +1, I didn't able to reproduce it in my environment but observed it in the >>>> Travis CI. >>>> >>>>> Any chance it could be due to recent changes? >>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-0cc47a31ce4e-231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe >>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-0cc47a31ce4e-d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22 >>> I can see more occurrences of the issue in the CI. >>> I just applied the patch changing the log level for test assert, in >>> the hope it will help. >> And... we just got one with logs: >> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948 >> >> EAL: Test assert service_lcore_attr_get line 396 failed: >> lcore_attr_get() didn't get correct loop count (zero) >> >> It looks like a race between the service core still running and the >> core resetting the loops attr. >> > Yes, it seems to be just lack of patience of the test. It should wait a > bit for lcore to stop before resetting attrs. > Something like this should help: > @@ -384,6 +384,9 @@ service_lcore_attr_get(void) > > rte_service_lcore_stop(slcore_id); > > + /* wait for the service lcore to stop */ > + rte_delay_ms(200); > + > TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id), > "Valid lcore_attr_reset_all() didn't return > success");
Would an rte_eal_wait_lcore make sense? Overall, I really dislike sleeps because they can hide racy synchronization points.