Re: [dpdk-dev] Random failure in service_autotest

Honnappa Nagarahalli Mon, 20 Jul 2020 22:41:00 -0700

<snip>

> >>> Subject: Re: [dpdk-dev] Random failure in service_autotest
> >>>
> >>> Lukasz Wojciechowski <[email protected]> writes:
> >>>
> >>>> W dniu 17.07.2020 o 17:19, David Marchand pisze:
> >>>>> On Fri, Jul 17, 2020 at 10:56 AM David Marchand
> >>>>> <[email protected]> wrote:
> >>>>>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit
> >>>>>> <[email protected]>
> >>> wrote:
> >>>>>>> On 7/15/2020 11:14 AM, David Marchand wrote:
> >>>>>>>> Hello Harry and guys who touched the service code recently :-)
> >>>>>>>>
> >>>>>>>> I spotted a failure for the service UT in Travis:
> >>>>>>>> https://travis-
> ci.com/github/ovsrobot/dpdk/jobs/361097992#L1869
> >>>>>>>> 7
> >>>>>>>>
> >>>>>>>> I found only a single instance of this failure and tried to
> >>>>>>>> reproduce it with my usual "brute" active loop with no success so
> far.
> >>>>>>> +1, I didn't able to reproduce it in my environment but observed
> >>>>>>> +it in
> >>> the
> >>>>>>> Travis CI.
> >>>>>>>
> >>>>>>>> Any chance it could be due to recent changes?
> >>>>>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-
> >>> 0cc47a31ce4e-
> >>>
> 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> >>> mit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe
> >>>>>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-
> >>> 0cc47a31ce4e-
> >>>
> d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcom
> m
> >>> it%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22
> >>>>>> I can see more occurrences of the issue in the CI.
> >>>>>> I just applied the patch changing the log level for test assert,
> >>>>>> in the hope it will help.
> >>>>> And... we just got one with logs:
> >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948
> >>>>>
> >>>>> EAL: Test assert service_lcore_attr_get line 396 failed:
> >>>>> lcore_attr_get() didn't get correct loop count (zero)
> >>>>>
> >>>>> It looks like a race between the service core still running and
> >>>>> the core resetting the loops attr.
> >>>>>
> >>>> Yes, it seems to be just lack of patience of the test. It should
> >>>> wait a bit for lcore to stop before resetting attrs.
> >>>> Something like this should help:
> >>>> @@ -384,6 +384,9 @@ service_lcore_attr_get(void)
> >>>>
> >>>>           rte_service_lcore_stop(slcore_id);
> >>>>
> >>>> +       /* wait for the service lcore to stop */
> >>>> +       rte_delay_ms(200);
> >>>> +
> >>>>           TEST_ASSERT_EQUAL(0,
> rte_service_lcore_attr_reset_all(slcore_id),
> >>>>                             "Valid lcore_attr_reset_all() didn't
> >>>> return success");
> >>> Would an rte_eal_wait_lcore make sense?  Overall, I really dislike
> >>> sleeps because they can hide racy synchronization points.
> >>
> >> The rte_service_lcore_stop() operation changes the status to
> RUNSTATE_STOPED.
> >> However, it will not terminate the service_run() procedure (there is
> >> a spinlock in
> >> service_run() MT_UNSAFE path).
> >> So the 'cs->loop' might increase after calling
> >> rte_service_lcore_attr_reset_all(),
> >> which leads to this failure.
> >> I think if we move the loop counter update operation before the
> >> service_run() procedure, it can avoid this conflict.
> >>
> >> I cannot reproduce this issue on my testbed, so not sure something
> >> like this can help or not.
> >> Please check the code below.
> >>
> >> diff --git a/lib/librte_eal/common/rte_service.c
> >> b/lib/librte_eal/common/rte_service.c
> >> index 6a0e0ff..7b703dd 100644
> >> --- a/lib/librte_eal/common/rte_service.c
> >> +++ b/lib/librte_eal/common/rte_service.c
> >> @@ -464,6 +464,7 @@ service_runner_func(void *arg)
> >>           while (__atomic_load_n(&cs->runstate, __ATOMIC_ACQUIRE) ==
> >>                           RUNSTATE_RUNNING) {
> >>                   const uint64_t service_mask = cs->service_mask;
> >> +                cs->loops++;
> >>
> >>                   for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) {
> >>                           if (!service_valid(i)) @@ -471,8 +472,6 @@
> >> service_runner_func(void *arg)
> >>                           /* return value ignored as no change to code 
> >> flow */
> >>                           service_run(i, cs, service_mask, service_get(i), 
> >> 1);
> >>                   }
> >> -
> >> -                cs->loops++;
> >>           }
> >>
> >>           return 0;
> > Changing the implementation loop counting is one option - but changing
> > the implementation as a workaround for a unit test seems like the wrong
> way around.
> I agree. We should fix the test not the service implementation. Of course it
> would be nice to do so without inserting sleeps as it's a workaround for true
> synchronization.
I think the service shutdown sequence in the test case is not correct. We 
cannot call 'rte_service_lcore_stop' without first shutting down the service 
using 'rte_service_runstate_set' and 'rte_service_component_runstate_set'. The 
'rte_service_lcore_stop' has checks to validate that the service is not running 
currently (among other things). In fact, the call to 'rte_service_lcore_stop' 
API is returning -EBUSY currently in the test case. We are not checking the 
return status.


If I understand the intent of the test case correctly, the sequence of the 
calls needs to be:
rte_service_runstate_set(id, 0)
rte_service_component_runstate_set(id, 0);
rte_service_may_be_active - loop till the service is not active
rte_service_lcore_stop(slcore_id);

> >
> > Honnappa's suggestion in the other reply seems to not work here, as the
> "service_may_be_active"
> > won't get updated if the service core is stopped after running its final
> iteration..?
This also is a problem.
This needs to be fixed by setting 'service_active_on_lcore' immediately after 
the service completes running in 'service_run' (rather than waiting for the 
next iteration of service_run).

> >
> > I'd prefer to see the implementation actually improve, or workaround in
> the test code itself.
> > For improving the implementation, I can suggest this patch:
> > https://protect2.fireeye.com/url?k=8ee70138-d3351a60-8ee68a77-
> 0cc47a31
> > c8b4-
> f3f6ba587be54162&q=1&u=http%3A%2F%2Fpatches.dpdk.org%2Fpatch%2F
> 74
> > 493%2F
> >
> >
> > That leaves us with 3 options;
> > 1) Add a sleep, allowing the service core to finish its delay() as
> > part of the service its running, and give it some grace time to finish
> I'm not a fan of sleeps, but the test code is already full of them ant this
> would be the easiest fix.
> > 2) Relax the check in the unit test, to allow for  value <= (expected + 1)
> which would ignore the last cs->loops++; call.
> >      (Another way to think about option 2, is to allow for the non-
> determinism of the threading into the test).
> I don't like this one.
> > 3) Use the linked patch to implement/add/fix a wait time in the stop()
> > API
> The patch adds new functionality to the public API function. I don't know if's
> a good step. And also I don't like the solution. I think that defining a
> 1000*1ms = 1s time for service to finish is also bad as we define some static
> number in the code.
> 
> My proposal is to keep the thread_active flag and add a new function for
> checking it. This way the final user (in our case test app) would be able to 
> do
> checking that flag in a loop by itself. The final user who delegated service 
> to
> be run on lcore knows what type of service it is and how long should it wait
> for the service to stop.
> 
> Or we can provide a synchronous function using conditionals to know if the
> service has stopped, but that would also require to change the behavior of
> API.
> >
> > Strong options either way?
> >
> > Regards, -Harry
> 
> --
> Lukasz Wojciechowski
> Principal Software Engineer
> 
> Samsung R&D Institute Poland
> Samsung Electronics
> Office +48 22 377 88 25
> [email protected]

Re: [dpdk-dev] Random failure in service_autotest

Reply via email to