David Marchand <david.march...@redhat.com> writes: > On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <acon...@redhat.com> wrote: >> >> Ruifeng Wang <ruifeng.w...@arm.com> writes: >> >> > Distributor and worker threads rely on data structs in cache line >> > for synchronization. The shared data structs were not protected. >> > This caused deadlock issue on weaker memory ordering platforms as >> > aarch64. >> > Fix this issue by adding memory barriers to ensure synchronization >> > among cores. >> > >> > Bugzilla ID: 342 >> > Fixes: 775003ad2f96 ("distributor: add new burst-capable library") >> > Cc: sta...@dpdk.org >> > >> > Signed-off-by: Ruifeng Wang <ruifeng.w...@arm.com> >> > Reviewed-by: Gavin Hu <gavin...@arm.com> >> > --- >> >> I see a failure in the distributor_autotest (on one of the builds): >> >> 64/82 DPDK:fast-tests / distributor_autotest FAIL 0.37 s (exit >> status 255 or signal 127 SIGinvalid) >> >> --- command --- >> >> DPDK_TEST='distributor_autotest' >> /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1 >> --file-prefix=distributor_autotest >> >> --- stdout --- >> >> EAL: Probing VFIO support... >> >> APP: HPET is not enabled, using TSC as default timer >> >> RTE>>distributor_autotest >> >> === Basic distributor sanity tests === >> >> Worker 0 handled 32 packets >> >> Sanity test with all zero hashes done. >> >> Worker 0 handled 32 packets >> >> Sanity test with non-zero hashes done >> >> === testing big burst (single) === >> >> Sanity test of returned packets done >> >> === Sanity test with mbuf alloc/free (single) === >> >> Sanity test with mbuf alloc/free passed >> >> Too few cores to run worker shutdown test >> >> === Basic distributor sanity tests === >> >> Worker 0 handled 32 packets >> >> Sanity test with all zero hashes done. >> >> Worker 0 handled 32 packets >> >> Sanity test with non-zero hashes done >> >> === testing big burst (burst) === >> >> Sanity test of returned packets done >> >> === Sanity test with mbuf alloc/free (burst) === >> >> Line 326: Packet count is incorrect, 1048568, expected 1048576 >> >> Test Failed >> >> RTE>> >> >> --- stderr --- >> >> EAL: Detected 2 lcore(s) >> >> EAL: Detected 1 NUMA nodes >> >> EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket >> >> EAL: Selected IOVA mode 'PA' >> >> EAL: No available hugepages reported in hugepages-1048576kB >> >> ------- >> >> Not sure how to help debug further. I'll re-start the job to see if >> it 'clears' up - but I guess there may be a delicate synchronization >> somewhere that needs to be accounted. > > Idem, and with the same loop I used before, it can be caught quickly. > > # time (log=/tmp/$$.log; while true; do echo distributor_autotest > |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8 > -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm > -f $log)
Probably good to document it, yes. It seems to be a good technique for reproducing failures. > [snip] > > RTE>>distributor_autotest > EAL: Trying to obtain current memory policy. > EAL: Setting policy MPOL_PREFERRED for socket 0 > EAL: Restoring previous memory policy: 0 > EAL: request: mp_malloc_sync > EAL: Heap on socket 0 was expanded by 2MB > EAL: Trying to obtain current memory policy. > EAL: Setting policy MPOL_PREFERRED for socket 0 > EAL: Restoring previous memory policy: 0 > EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous space > EAL: Trying to obtain current memory policy. > EAL: Setting policy MPOL_PREFERRED for socket 0 > EAL: Restoring previous memory policy: 0 > EAL: request: mp_malloc_sync > EAL: Heap on socket 0 was expanded by 8MB > === Basic distributor sanity tests === > Worker 0 handled 32 packets > Sanity test with all zero hashes done. > Worker 0 handled 32 packets > Sanity test with non-zero hashes done > === testing big burst (single) === > Sanity test of returned packets done > > === Sanity test with mbuf alloc/free (single) === > Sanity test with mbuf alloc/free passed > > Too few cores to run worker shutdown test > === Basic distributor sanity tests === > Worker 0 handled 32 packets > Sanity test with all zero hashes done. > Worker 0 handled 32 packets > Sanity test with non-zero hashes done > === testing big burst (burst) === > Sanity test of returned packets done > > === Sanity test with mbuf alloc/free (burst) === > Line 326: Packet count is incorrect, 1048568, expected 1048576 > Test Failed > RTE>> > real 0m36.668s > user 1m7.293s > sys 0m1.560s > > Could be worth running this loop on all tests? (not talking about the > CI, it would be a manual effort to catch lurking issues).