> -----Original Message----- > From: Aaron Conole [mailto:acon...@redhat.com] > Sent: Monday, October 14, 2019 3:54 PM > To: Van Haaren, Harry <harry.van.haa...@intel.com> > Cc: David Marchand <david.march...@redhat.com>; dev@dpdk.org > Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from > service_autotest failing > > Aaron Conole <acon...@redhat.com> writes: > > > "Van Haaren, Harry" <harry.van.haa...@intel.com> writes: > > > >>> -----Original Message----- > >>> From: Aaron Conole [mailto:acon...@redhat.com] > >>> Sent: Wednesday, September 4, 2019 8:56 PM > >>> To: David Marchand <david.march...@redhat.com> > >>> Cc: Van Haaren, Harry <harry.van.haa...@intel.com>; dev@dpdk.org > >>> Subject: Re: [dpdk-dev] [BUG] service_lcore_en_dis_able from > service_autotest > >>> failing <snip lots of backlog> > >>> > real 2m42.884s > >>> > user 5m1.902s > >>> > sys 0m2.208s > >>> > >>> I can confirm - takes about 1m to fail. > >> > >> > >> Hi Aaron and David, > >> > >> I've been attempting to reproduce this, still no errors here. > >> > >> Given the nature of service-cores, and the difficulty to reproduce > >> here this feels like a race-condition - one that may not exist in all > >> binaries. Can you describe your compiler/command setup? (gcc 7.4.0 here). > >> > >> I'm using Meson to build, so reproducing using this instead of the > command > >> as provided above. There should be no difference in reproducing due to > this: > > > > The command runs far more iterations than meson does (I think). > > > > I still see it periodically occur in the travis environment. > > > > I did see at least one missing memory barrier (I believe). Please > > review the following code change (and if you agree I can submit it > > formally): > > > > ----- > > --- a/lib/librte_eal/common/eal_common_launch.c > > +++ b/lib/librte_eal/common/eal_common_launch.c > > @@ -21,8 +21,10 @@ > > int > > rte_eal_wait_lcore(unsigned slave_id) > > { > > - if (lcore_config[slave_id].state == WAIT) > > + if (lcore_config[slave_id].state == WAIT) { > > + rte_rmb(); > > return 0; > > + } > > > > while (lcore_config[slave_id].state != WAIT && > > lcore_config[slave_id].state != FINISHED) > > ----- > > > > This is because in lib/librte_eal/linux/eal/eal_thread.c: > > > > ----- > > /* when a service core returns, it should go directly to WAIT > > * state, because the application will not lcore_wait() for it. > > */ > > if (lcore_config[lcore_id].core_role == ROLE_SERVICE) > > lcore_config[lcore_id].state = WAIT; > > else > > lcore_config[lcore_id].state = FINISHED; > > ----- > > > > NOTE that the service core skips the rte_eal_wait_lcore() code from > > making the FINISHED->WAIT transition. So I think at least that read > > barrier will be needed (maybe I miss the pairing, though?). > > > > Additionally, I'm wondering if there is an additional write or sync > > barrier needed to ensure that some of the transitions are properly > > recorded when using lcore as a service lcore function. The fact that > > this only happens occasionally tells me that it's either a race (which > > is possible... because the variable update in the test might not be > > sync'd across cores or something), or some other missing > > synchronization. > > > >> $ meson test service_autotest --repeat 50 > >> > >> 1/1 DPDK:fast-tests / service_autotest OK 3.86 s > >> 1/1 DPDK:fast-tests / service_autotest OK 3.87 s > >> ... > >> 1/1 DPDK:fast-tests / service_autotest OK 3.84 s > >> > >> OK: 50 > >> FAIL: 0 > >> SKIP: 0 > >> TIMEOUT: 0 > >> > >> I'll keep it running for a few hours but I have little faith if it only > >> takes 1 minute on your machines... > > > > Please try the flat command. > > Not sure if you've had any time to look at this.
Apologies for delay in response - I've ran the existing tests a few 1000's of times during the week, with one reproduction. That's not enough for confidence in debug/fix for me. > I think there's a change we can make, but not sure about how it fits in > the overall service lcore design. This suggestion is only changing the test code correct? > The proposal is to use a pthread_cond variable which blocks the thread > requesting the service function to run. The service function merely > sets the condition. The requesting thread does a timed wait (up to 5s?) > and if the timeout is exceeded can throw an error. Otherwise, it will > unblock and can assume that the test passes. WDYT? I think it works > better than the racy code in the test case for now. The idea/concept is right above, but I think that's what the test is approximating anyway? The main thread does an "mp_wait_lcore()" until the service core has returned, essentially a blocking call. The test fails if the flag is not == 1 (as that indidcates failure in launching an application function on a previously-use-as-service-core lthread). I think your RMB suggestion is likely to be the correct, but I'd like to dig into it a bit more. Thanks for the ping on this thread.