"Van Haaren, Harry" <harry.van.haa...@intel.com> writes:

>> -----Original Message-----
>> From: Van Haaren, Harry
>> Sent: Tuesday, November 26, 2019 1:19 PM
>> To: Aaron Conole <acon...@redhat.com>; Thomas Monjalon <tho...@monjalon.net>
>
> <snip>
>
>> > EAL: Test assert service_lcore_en_dis_able line 487 failed: Ex-service
>> core
>> > function call had no effect.
>> >
>> > So I'll spend some time in this area, it seems.
>> 
>> 
>> The below diff makes it 100% reproducible here, failing every time.
>> 
>> It seems like the main thread is returning, before the service thread has
>> returned.
>> 
>> The rte_eal_mp_wait_lcore() call seems to not wait on the service-core,
>> which allows
>> the main thread to read the "service_remote_launch_flag" value as 0 (before
>> the service-thread writes it to 1).
>> 
>> Adding the delay between the service launch and service write being
>> performed makes this issue much much more likely to occur - so the above
>> description I have confidence in.
>> 
>> What I'm not clear on (yet) is why the eal_mp_wait_lcore() isn't waiting...
>> 
>> -H
>> 
>> 
>> diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
>> index 9fe38f5e0..846ad00d1 100644
>> --- a/app/test/test_service_cores.c
>> +++ b/app/test/test_service_cores.c
>> @@ -445,6 +445,7 @@ static int
>>  service_remote_launch_func(void *arg)
>>  {
>>         RTE_SET_USED(arg);
>> +       rte_delay_ms(100);
>>         service_remote_launch_flag = 1;
>>         return 0;
>>  }
>
> Diff below seems to fix the problem here; Aaron would you test the below fix 
> in your setup for a while too?
> I have a loop running here attempting to reproduce - but before 100% failures 
> and so far 100% passes with the added wait_lcore() call.
>
>
> diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c
> index 9fe38f5e0..62ffedb19 100644
> --- a/app/test/test_service_cores.c
> +++ b/app/test/test_service_cores.c
> @@ -445,6 +445,7 @@ static int
>  service_remote_launch_func(void *arg)
>  {
>         RTE_SET_USED(arg);
> +       rte_delay_ms(100);
>         service_remote_launch_flag = 1;
>         return 0;
>  }
> @@ -483,6 +484,7 @@ service_lcore_en_dis_able(void)
>         int ret = rte_eal_remote_launch(service_remote_launch_func, NULL,
>                                         slcore_id);
>         TEST_ASSERT_EQUAL(0, ret, "Ex-service core remote launch failed.");
> +       rte_eal_wait_lcore(slcore_id);
>         rte_eal_mp_wait_lcore();

Ahh, I see.  Actually, this brings up a question - is the intent for
mp_wait_lcore to cycle through the service cores as well?  Because IIUC,
the issue will be the lcore will be set to ROLE_RTE normally, but
service cores will do: ROLE_SERVICE and then the wait cannot work.

If the idea is that mp_wait_lcore should work (and looking at the test,
it seems like it is the intent?) then it will need to cycle through
service cores, too.  If the intent is that it shouldn't, then we should
remove those calls from the test application to prevent developer from
misunderstanding.

Either way, the documentation for `rte_service_lcore_start` is a bit too
ambiguous and needs to reflect whether the mp_wait_lcore should work.  I
think either it should (which means updating rte_get_next_lcore to
include ROLE_SERVICE), or none of the lcore functions should work, and
we should have an rte_service...() equivalent that should be used.

>         TEST_ASSERT_EQUAL(1, service_remote_launch_flag,
>                         "Ex-service core function call had no effect.");

Reply via email to