Hi Zhihong, On 11/04/2016 08:20 AM, Wang, Zhihong wrote: > > >> -----Original Message----- >> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com] >> Sent: Thursday, November 3, 2016 4:11 PM >> To: Wang, Zhihong <zhihong.wang at intel.com>; Yuanhan Liu >> <yuanhan.liu at linux.intel.com> >> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister) >> <ppfister at cisco.com>; Xie, Huawei <huawei.xie at intel.com>; dev at >> dpdk.org; >> vkaplans at redhat.com; mst at redhat.com >> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors support >> to the TX path >> >> >> >> On 11/02/2016 11:51 AM, Maxime Coquelin wrote: >>> >>> >>> On 10/31/2016 11:01 AM, Wang, Zhihong wrote: >>>> >>>> >>>>> -----Original Message----- >>>>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com] >>>>> Sent: Friday, October 28, 2016 3:42 PM >>>>> To: Wang, Zhihong <zhihong.wang at intel.com>; Yuanhan Liu >>>>> <yuanhan.liu at linux.intel.com> >>>>> Cc: stephen at networkplumber.org; Pierre Pfister (ppfister) >>>>> <ppfister at cisco.com>; Xie, Huawei <huawei.xie at intel.com>; >> dev at dpdk.org; >>>>> vkaplans at redhat.com; mst at redhat.com >>>>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors >>>>> support >>>>> to the TX path >>>>> >>>>> >>>>> >>>>> On 10/28/2016 02:49 AM, Wang, Zhihong wrote: >>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com] >>>>>>>> Sent: Thursday, October 27, 2016 6:46 PM >>>>>>>> To: Maxime Coquelin <maxime.coquelin at redhat.com> >>>>>>>> Cc: Wang, Zhihong <zhihong.wang at intel.com>; >>>>>>>> stephen at networkplumber.org; Pierre Pfister (ppfister) >>>>>>>> <ppfister at cisco.com>; Xie, Huawei <huawei.xie at intel.com>; >>>>> dev at dpdk.org; >>>>>>>> vkaplans at redhat.com; mst at redhat.com >>>>>>>> Subject: Re: [dpdk-dev] [PATCH v4] vhost: Add indirect descriptors >>>>> support >>>>>>>> to the TX path >>>>>>>> >>>>>>>> On Thu, Oct 27, 2016 at 12:35:11PM +0200, Maxime Coquelin wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10/27/2016 12:33 PM, Yuanhan Liu wrote: >>>>>>>>>>>> On Thu, Oct 27, 2016 at 11:10:34AM +0200, Maxime Coquelin >>>>> wrote: >>>>>>>>>>>>>> Hi Zhihong, >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 10/27/2016 11:00 AM, Wang, Zhihong wrote: >>>>>>>>>>>>>>>> Hi Maxime, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Seems indirect desc feature is causing serious >> performance >>>>>>>>>>>>>>>> degradation on Haswell platform, about 20% drop for both >>>>>>>>>>>>>>>> mrg=on and mrg=off (--txqflags=0xf00, non-vector >> version), >>>>>>>>>>>>>>>> both iofwd and macfwd. >>>>>>>>>>>>>> I tested PVP (with macswap on guest) and Txonly/Rxonly on >> an >>>>> Ivy >>>>>>>> Bridge >>>>>>>>>>>>>> platform, and didn't faced such a drop. >>>>>>>>>>>> >>>>>>>>>>>> I was actually wondering that may be the cause. I tested it with >>>>>>>>>>>> my IvyBridge server as well, I saw no drop. >>>>>>>>>>>> >>>>>>>>>>>> Maybe you should find a similar platform (Haswell) and have a >>>>>>>>>>>> try? >>>>>>>>>> Yes, that's why I asked Zhihong whether he could test Txonly in >>>>>>>>>> guest >>>>> to >>>>>>>>>> see if issue is reproducible like this. >>>>>>>> >>>>>>>> I have no Haswell box, otherwise I could do a quick test for you. >>>>>>>> IIRC, >>>>>>>> he tried to disable the indirect_desc feature, then the performance >>>>>>>> recovered. So, it's likely the indirect_desc is the culprit here. >>>>>>>> >>>>>>>>>> I will be easier for me to find an Haswell machine if it has not >>>>>>>>>> to be >>>>>>>>>> connected back to back to and HW/SW packet generator. >>>>>> In fact simple loopback test will also do, without pktgen. >>>>>> >>>>>> Start testpmd in both host and guest, and do "start" in one >>>>>> and "start tx_first 32" in another. >>>>>> >>>>>> Perf drop is about 24% in my test. >>>>>> >>>>> >>>>> Thanks, I never tried this test. >>>>> I managed to find an Haswell platform (Intel(R) Xeon(R) CPU E5-2699 v3 >>>>> @ 2.30GHz), and can reproduce the problem with the loop test you >>>>> mention. I see a performance drop about 10% (8.94Mpps/8.08Mpps). >>>>> Out of curiosity, what are the numbers you get with your setup? >>>> >>>> Hi Maxime, >>>> >>>> Let's align our test case to RC2, mrg=on, loopback, on Haswell. >>>> My results below: >>>> 1. indirect=1: 5.26 Mpps >>>> 2. indirect=0: 6.54 Mpps >>>> >>>> It's about 24% drop. >>> OK, so on my side, same setup on Haswell: >>> 1. indirect=1: 7.44 Mpps >>> 2. indirect=0: 8.18 Mpps >>> >>> Still 10% drop in my case with mrg=on. >>> >>> The strange thing with both of our figures is that this is below from >>> what I obtain with my SandyBridge machine. The SB cpu freq is 4% higher, >>> but that doesn't explain the gap between the measurements. >>> >>> I'm continuing the investigations on my side. >>> Maybe we should fix a deadline, and decide do disable indirect in >>> Virtio PMD if root cause not identified/fixed at some point? >>> >>> Yuanhan, what do you think? >> >> I have done some measurements using perf, and know understand better >> what happens. >> >> With indirect descriptors, I can see a cache miss when fetching the >> descriptors in the indirect table. Actually, this is expected, so >> we prefetch the first desc as soon as possible, but still not soon >> enough to make it transparent. >> In direct descriptors case, the desc in the virtqueue seems to be >> remain in the cache from its previous use, so we have a hit. >> >> That said, in realistic use-case, I think we should not have a hit, >> even with direct descriptors. >> Indeed, the test case use testpmd on guest side with the forwarding set >> in IO mode. It means the packet content is never accessed by the guest. >> >> In my experiments, I am used to set the "macswap" forwarding mode, which >> swaps src and dest MAC addresses in the packet. I find it more >> realistic, because I don't see the point in sending packets to the guest >> if it is not accessed (not even its header). >> >> I tried again the test case, this time with setting the forwarding mode >> to macswap in the guest. This time, I get same performance with both >> direct and indirect (indirect even a little better with a small >> optimization, consisting in prefetching the 2 first descs >> systematically as we know there are contiguous). > > > Hi Maxime, > > I did a little more macswap test and found out more stuff here: Thanks for doing more tests.
> > 1. I did loopback test on another HSW machine with the same H/W, > and indirect_desc on and off seems have close perf > > 2. So I checked the gcc version: > > * Previous: gcc version 6.2.1 20160916 (Fedora 24) > > * New: gcc version 5.4.0 20160609 (Ubuntu 16.04.1 LTS) On my side, I tested with RHEL7.3: - gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11) It certainly contains some backports from newer GCC versions. > > On previous one indirect_desc has 20% drop > > 3. Then I compiled binary on Ubuntu and scp to Fedora, and as > expected I got the same perf as on Ubuntu, and the perf gap > disappeared, so gcc is definitely one factor here > > 4. Then I use the Ubuntu binary on Fedora for PVP test, then the > perf gap comes back again and the same with the Fedora binary > results, indirect_desc causes about 20% drop Let me know if I understand correctly: Loopback test with macswap: - gcc version 6.2.1 : 20% perf drop - gcc version 5.4.0 : No drop PVP test with macswap: - gcc version 6.2.1 : 20% perf drop - gcc version 5.4.0 : 20% perf drop > > So in all, could you try PVP traffic on HSW to see how it works? Sadly, the HSW machine I borrowed does not have other device connected back to back on its 10G port. I can only test PVP with SNB machines currently. > > >> >> Do you agree we should assume that the packet (header or/and buf) will >> always be accessed by the guest application? >> If so, do you agree we should keep indirect descs enabled, and maybe >> update the test cases? > > > I agree with you that mac/macswap test is more realistic and makes > more sense for real applications. Thanks, Maxime