Aaron Conole <acon...@redhat.com> writes: > "Van Haaren, Harry" <harry.van.haa...@intel.com> writes: > >>> -----Original Message----- >>> From: Aaron Conole [mailto:acon...@redhat.com] >>> Sent: Wednesday, September 4, 2019 8:56 PM >>> To: David Marchand <david.march...@redhat.com> >>> Cc: Van Haaren, Harry <harry.van.haa...@intel.com>; dev@dpdk.org >>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from >>> service_autotest >>> failing >>> >>> David Marchand <david.march...@redhat.com> writes: >>> >>> > On Wed, Sep 4, 2019 at 12:04 PM David Marchand >>> > <david.march...@redhat.com> wrote: >>> >> >>> >> On Wed, Sep 4, 2019 at 11:42 AM Van Haaren, Harry >>> >> <harry.van.haa...@intel.com> wrote: >>> >> > >>> >> > > -----Original Message----- >>> >> > > From: Aaron Conole [mailto:acon...@redhat.com] >>> >> > > Sent: Tuesday, September 3, 2019 3:46 PM >>> >> > > To: Van Haaren, Harry <harry.van.haa...@intel.com> >>> >> > > Cc: dev@dpdk.org >>> >> > > Subject: [BUG] service_lcore_en_dis_able from service_autotest >>> >> > > failing >>> >> > > >>> >> > > Hi Harry, >>> >> > >>> >> > Hey Aaron, >>> >> > >>> >> > > I noticed as part of series_6218 >>> >> > > (http://patches.dpdk.org/project/dpdk/list/?series=6218) that the >>> travis >>> >> > > build had a single failure, in service_autotest but it doesn't seem >>> >> > > related to the series at all. >>> >> > > >>> >> > > https://travis-ci.com/ovsrobot/dpdk/jobs/230358460 >>> >> > > >>> >> > > Not sure if there's some kind of debugging we can add or look at to >>> help >>> >> > > diagnose failures when they occur. Do you have time to have a look? >>> >> > >>> >> > Thanks for flagging this. >>> >> > >>> >> > I've just re-run the unit tests here multiple times to see if I can >>> >> > reproduce something strange, no luck on reproducing the issue. >>> >> > >>> >> > Attempted with clang-6 and clang-7 (travis error on clang-7), >>> >> > still no issues found. >>> >> > >>> >> > Building with Clang-7 and Shared libs (instead of default static) >>> >> > still no issues found. >>> >> > >>> >> > If somebody can reproduce please send an update to here and I'll >>> >> > attempt to replicate that setup. Right now I can't reproduce the issue. >>> >> >>> >> You have to be patient, but I caught it on my laptop: >>> >> >>> > >>> > Ok, and now with the logs: >>> > >>> > >>> > # time (log=/tmp/$$.log; while true; do echo service_autotest |taskset >>> > -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 -l 0-1 >>> >>$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f >>> > $log) >>> > EAL: Detected lcore 0 as core 0 on socket 0 >>> > EAL: Detected lcore 1 as core 1 on socket 0 >>> > EAL: Detected lcore 2 as core 2 on socket 0 >> <snip verbose logs> >>> > EAL: probe driver: 8086:15d7 net_e1000_em >>> > EAL: Not managed by a supported kernel driver, skipped >>> > EAL: Module /sys/module/vfio not found! error 2 (No such file or >>> > directory) >>> > APP: HPET is not enabled, using TSC as default timer >>> > RTE>>service_autotest >>> > + ------------------------------------------------------- + >>> > + Test Suite : service core test suite >>> > + ------------------------------------------------------- + >>> > + TestCase [ 0] : unregister_all succeeded >>> > + TestCase [ 1] : service_name succeeded >>> > + TestCase [ 2] : service_get_by_name succeeded >>> > Service dummy_service Summary >>> > dummy_service: stats 1 calls 0 cycles 0 avg: 0 >>> > Service dummy_service Summary >>> > dummy_service: stats 0 calls 0 cycles 0 avg: 0 >>> > + TestCase [ 3] : service_dump succeeded >>> > + TestCase [ 4] : service_attr_get succeeded >>> > + TestCase [ 5] : service_lcore_attr_get succeeded >>> > + TestCase [ 6] : service_probe_capability succeeded >>> > + TestCase [ 7] : service_start_stop succeeded >>> > + TestCase [ 8] : service_lcore_add_del skipped >>> > + TestCase [ 9] : service_lcore_start_stop succeeded >>> > EAL: Test assert service_lcore_en_dis_able line 488 failed: Ex-service >>> > core function call had no effect. >>> > + TestCase [10] : service_lcore_en_dis_able failed >>> > + TestCase [11] : service_mt_unsafe_poll skipped >>> > + TestCase [12] : service_mt_safe_poll skipped >>> > + TestCase [13] : service_app_lcore_mt_safe succeeded >>> > + TestCase [14] : service_app_lcore_mt_unsafe succeeded >>> > + TestCase [15] : service_may_be_active succeeded >>> > + ------------------------------------------------------- + >>> > + Test Suite Summary >>> > + Tests Total : 16 >>> > + Tests Skipped : 3 >>> > + Tests Executed : 16 >>> > + Tests Unsupported: 0 >>> > + Tests Passed : 12 >>> > + Tests Failed : 1 >>> > + ------------------------------------------------------- + >>> > Test Failed >>> > RTE>>EAL: request: mp_malloc_sync >>> > EAL: Heap on socket 0 was shrunk by 2MB >>> > >>> > real 2m42.884s >>> > user 5m1.902s >>> > sys 0m2.208s >>> >>> I can confirm - takes about 1m to fail. >> >> >> Hi Aaron and David, >> >> I've been attempting to reproduce this, still no errors here. >> >> Given the nature of service-cores, and the difficulty to reproduce >> here this feels like a race-condition - one that may not exist in all >> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here). >> >> I'm using Meson to build, so reproducing using this instead of the command >> as provided above. There should be no difference in reproducing due to this: > > The command runs far more iterations than meson does (I think). > > I still see it periodically occur in the travis environment. > > I did see at least one missing memory barrier (I believe). Please > review the following code change (and if you agree I can submit it > formally): > > ----- > --- a/lib/librte_eal/common/eal_common_launch.c > +++ b/lib/librte_eal/common/eal_common_launch.c > @@ -21,8 +21,10 @@ > int > rte_eal_wait_lcore(unsigned slave_id) > { > - if (lcore_config[slave_id].state == WAIT) > + if (lcore_config[slave_id].state == WAIT) { > + rte_rmb(); > return 0; > + } > > while (lcore_config[slave_id].state != WAIT && > lcore_config[slave_id].state != FINISHED) > ----- > > This is because in lib/librte_eal/linux/eal/eal_thread.c: > > ----- > /* when a service core returns, it should go directly to WAIT > * state, because the application will not lcore_wait() for it. > */ > if (lcore_config[lcore_id].core_role == ROLE_SERVICE) > lcore_config[lcore_id].state = WAIT; > else > lcore_config[lcore_id].state = FINISHED; > ----- > > NOTE that the service core skips the rte_eal_wait_lcore() code from > making the FINISHED->WAIT transition. So I think at least that read > barrier will be needed (maybe I miss the pairing, though?). > > Additionally, I'm wondering if there is an additional write or sync > barrier needed to ensure that some of the transitions are properly > recorded when using lcore as a service lcore function. The fact that > this only happens occasionally tells me that it's either a race (which > is possible... because the variable update in the test might not be > sync'd across cores or something), or some other missing > synchronization. > >> $ meson test service_autotest --repeat 50 >> >> 1/1 DPDK:fast-tests / service_autotest OK 3.86 s >> 1/1 DPDK:fast-tests / service_autotest OK 3.87 s >> ... >> 1/1 DPDK:fast-tests / service_autotest OK 3.84 s >> >> OK: 50 >> FAIL: 0 >> SKIP: 0 >> TIMEOUT: 0 >> >> I'll keep it running for a few hours but I have little faith if it only >> takes 1 minute on your machines... > > Please try the flat command.
Not sure if you've had any time to look at this. I think there's a change we can make, but not sure about how it fits in the overall service lcore design. The proposal is to use a pthread_cond variable which blocks the thread requesting the service function to run. The service function merely sets the condition. The requesting thread does a timed wait (up to 5s?) and if the timeout is exceeded can throw an error. Otherwise, it will unblock and can assume that the test passes. WDYT? I think it works better than the racy code in the test case for now. >> Regards, -Harry