<snip> > >>> Subject: Re: [dpdk-dev] Random failure in service_autotest > >>> > >>> Lukasz Wojciechowski <[email protected]> writes: > >>> > >>>> W dniu 17.07.2020 o 17:19, David Marchand pisze: > >>>>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand > >>>>> <[email protected]> wrote: > >>>>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit > >>>>>> <[email protected]> > >>> wrote: > >>>>>>> On 7/15/2020 11:14 AM, David Marchand wrote: > >>>>>>>> Hello Harry and guys who touched the service code recently :-) > >>>>>>>> > >>>>>>>> I spotted a failure for the service UT in Travis: > >>>>>>>> https://travis- > ci.com/github/ovsrobot/dpdk/jobs/361097992#L1869 > >>>>>>>> 7 > >>>>>>>> > >>>>>>>> I found only a single instance of this failure and tried to > >>>>>>>> reproduce it with my usual "brute" active loop with no success so > far. > >>>>>>> +1, I didn't able to reproduce it in my environment but observed > >>>>>>> +it in > >>> the > >>>>>>> Travis CI. > >>>>>>> > >>>>>>>> Any chance it could be due to recent changes? > >>>>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc- > >>> 0cc47a31ce4e- > >>> > 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom > >>> mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe > >>>>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c- > >>> 0cc47a31ce4e- > >>> > d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom > m > >>> it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22 > >>>>>> I can see more occurrences of the issue in the CI. > >>>>>> I just applied the patch changing the log level for test assert, > >>>>>> in the hope it will help. > >>>>> And... we just got one with logs: > >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948 > >>>>> > >>>>> EAL: Test assert service_lcore_attr_get line 396 failed: > >>>>> lcore_attr_get() didn't get correct loop count (zero) > >>>>> > >>>>> It looks like a race between the service core still running and > >>>>> the core resetting the loops attr. > >>>>> > >>>> Yes, it seems to be just lack of patience of the test. It should > >>>> wait a bit for lcore to stop before resetting attrs. > >>>> Something like this should help: > >>>> @@ -384,6 +384,9 @@ service_lcore_attr_get(void) > >>>> > >>>> rte_service_lcore_stop(slcore_id); > >>>> > >>>> + /* wait for the service lcore to stop */ > >>>> + rte_delay_ms(200); > >>>> + > >>>> TEST_ASSERT_EQUAL(0, > rte_service_lcore_attr_reset_all(slcore_id), > >>>> "Valid lcore_attr_reset_all() didn't > >>>> return success"); > >>> Would an rte_eal_wait_lcore make sense? Overall, I really dislike > >>> sleeps because they can hide racy synchronization points. > >> > >> The rte_service_lcore_stop() operation changes the status to > RUNSTATE_STOPED. > >> However, it will not terminate the service_run() procedure (there is > >> a spinlock in > >> service_run() MT_UNSAFE path). > >> So the 'cs->loop' might increase after calling > >> rte_service_lcore_attr_reset_all(), > >> which leads to this failure. > >> I think if we move the loop counter update operation before the > >> service_run() procedure, it can avoid this conflict. > >> > >> I cannot reproduce this issue on my testbed, so not sure something > >> like this can help or not. > >> Please check the code below. > >> > >> diff --git a/lib/librte_eal/common/rte_service.c > >> b/lib/librte_eal/common/rte_service.c > >> index 6a0e0ff..7b703dd 100644 > >> --- a/lib/librte_eal/common/rte_service.c > >> +++ b/lib/librte_eal/common/rte_service.c > >> @@ -464,6 +464,7 @@ service_runner_func(void *arg) > >> while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) == > >> RUNSTATE_RUNNING) { > >> const uint64_t service_mask = cs->service_mask; > >> + cs->loops++; > >> > >> for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) { > >> if (!service_valid(i)) @@ -471,8 +472,6 @@ > >> service_runner_func(void *arg) > >> /* return value ignored as no change to code > >> flow */ > >> service_run(i, cs, service_mask, service_get(i), > >> 1); > >> } > >> - > >> - cs->loops++; > >> } > >> > >> return 0; > > Changing the implementation loop counting is one option - but changing > > the implementation as a workaround for a unit test seems like the wrong > way around. > I agree. We should fix the test not the service implementation. Of course it > would be nice to do so without inserting sleeps as it's a workaround for true > synchronization. I think the service shutdown sequence in the test case is not correct. We cannot call 'rte_service_lcore_stop' without first shutting down the service using 'rte_service_runstate_set' and 'rte_service_component_runstate_set'. The 'rte_service_lcore_stop' has checks to validate that the service is not running currently (among other things). In fact, the call to 'rte_service_lcore_stop' API is returning -EBUSY currently in the test case. We are not checking the return status.
If I understand the intent of the test case correctly, the sequence of the calls needs to be: rte_service_runstate_set(id, 0) rte_service_component_runstate_set(id, 0); rte_service_may_be_active - loop till the service is not active rte_service_lcore_stop(slcore_id); > > > > Honnappa's suggestion in the other reply seems to not work here, as the > "service_may_be_active" > > won't get updated if the service core is stopped after running its final > iteration..? This also is a problem. This needs to be fixed by setting 'service_active_on_lcore' immediately after the service completes running in 'service_run' (rather than waiting for the next iteration of service_run). > > > > I'd prefer to see the implementation actually improve, or workaround in > the test code itself. > > For improving the implementation, I can suggest this patch: > > https://protect2.fireeye.com/url?k=8ee70138-d3351a60-8ee68a77- > 0cc47a31 > > c8b4- > f3f6ba587be54162&q=1&u=http%3A%2F%2Fpatches.dpdk.org%2Fpatch%2F > 74 > > 493%2F > > > > > > That leaves us with 3 options; > > 1) Add a sleep, allowing the service core to finish its delay() as > > part of the service its running, and give it some grace time to finish > I'm not a fan of sleeps, but the test code is already full of them ant this > would be the easiest fix. > > 2) Relax the check in the unit test, to allow for value <= (expected + 1) > which would ignore the last cs->loops++; call. > > (Another way to think about option 2, is to allow for the non- > determinism of the threading into the test). > I don't like this one. > > 3) Use the linked patch to implement/add/fix a wait time in the stop() > > API > The patch adds new functionality to the public API function. I don't know if's > a good step. And also I don't like the solution. I think that defining a > 1000*1ms = 1s time for service to finish is also bad as we define some static > number in the code. > > My proposal is to keep the thread_active flag and add a new function for > checking it. This way the final user (in our case test app) would be able to > do > checking that flag in a loop by itself. The final user who delegated service > to > be run on lcore knows what type of service it is and how long should it wait > for the service to stop. > > Or we can provide a synchronous function using conditionals to know if the > service has stopped, but that would also require to change the behavior of > API. > > > > Strong options either way? > > > > Regards, -Harry > > -- > Lukasz Wojciechowski > Principal Software Engineer > > Samsung R&D Institute Poland > Samsung Electronics > Office +48 22 377 88 25 > [email protected]

