[dpdk-dev] Performance regression in DPDK 1.8/2.0
Hi Paul, > -Original Message- > From: Paul Emmerich [mailto:emmericp at net.in.tum.de] > Sent: Tuesday, April 28, 2015 12:48 PM > To: De Lara Guarch, Pablo > Cc: Pavel Odintsov; dev at dpdk.org > Subject: Re: [dpdk-dev] Performance regression in DPDK 1.8/2.0 > > Hi, > > > De Lara Guarch, Pablo : > > Could you tell me which changes you made here? I see you are using > simple tx code path on 1.8.0, > > but with the default values, you should be using vector tx, > > unless you have changed anything in the tx configuration. > > sorry, I might have written that down wrong or read the output wrong. > I did not modify the l2fwd example. > > > > So, just for clarification, > > for l2fwd you used E3-1230 v2 (Ivy Bridge), at 1.6 GHz or 3.3 GHz? > > At 1.6 GHz as it is simply too fast at 3.3 GHz ;) > > > I'll probably write a minimal example that shows my > problem with tx only sometime next week. > I just used the l2fwd example to illustrate my point > with a 'builtin' example. Thanks for the clarification. I tested it on Ivy Bridge as well, and I could not reproduce the issue. Make sure that you use vector rx/tx anyway, to get best performance (you should be seeing better performance, since l2fwd in 1.8/2.0 uses both vector rx/tx). Thanks, Pablo > > Paul
[dpdk-dev] Performance regression in DPDK 1.8/2.0
Hi, De Lara Guarch, Pablo : > Could you tell me which changes you made here? I see you are using simple tx > code path on 1.8.0, > but with the default values, you should be using vector tx, > unless you have changed anything in the tx configuration. sorry, I might have written that down wrong or read the output wrong. I did not modify the l2fwd example. > So, just for clarification, > for l2fwd you used E3-1230 v2 (Ivy Bridge), at 1.6 GHz or 3.3 GHz? At 1.6 GHz as it is simply too fast at 3.3 GHz ;) I?ll probably write a minimal example that shows my problem with tx only sometime next week. I just used the l2fwd example to illustrate my point with a 'builtin? example. Paul
[dpdk-dev] Performance regression in DPDK 1.8/2.0
Hi, Matthew Hall : > Not sure if it's relevant or not, but there was another mail claiming PCIe > MSI-X wasn't necessarily working in DPDK 2.x. Not sure if that could be > causing slowdowns when there are drastic volumes of 64-byte packets causing a > lot of PCI activity. Interrupts should not be relevant here. > Also, you are mentioning some specific patches were involved... so I have to > ask if anybody tried git bisect yet or not. Maybe easier than trying to guess > at the answer. I have not yet tried to bisect it, but that?s the next step on my todo list*. The mbuf patch was just an educated guess to start a discussion. I hoped that I was just doing something obvious wrong, and/or that someone could point me to performance regression tests that were executed to proof that the mbuf patch does not affect performance. However, there don?t seem to be any 'official? performance regression tests, are there? Paul * I probably won?t be able to it until next week, though as I have to to finish the paper about my packet generator
[dpdk-dev] Performance regression in DPDK 1.8/2.0
Hi, sorry, I mixed up the hardware I used for my tests. Paul Emmerich : > CPU: Intel(R) Xeon(R) CPU E3-1230 v2 > TurboBoost and HyperThreading disabled. > Frequency fixed at 3.30 GHz via acpi_cpufreq. The CPU frequency was fixed at 1.60 GHz to enforce a CPU bottleneck. My original post said that I used a Xeon E5-2620 v3 at 1.2 GHz, this is incorrect. The calculation for Cycles/Pkt in the original post used the correct 1.6 GHz figure, though. (I used the E5 CPU for the evaluation of my packet generator performance with 1.7.1/2.0.0, not for the l2fwd test.) Sorry for the confusion. Paul
[dpdk-dev] Performance regression in DPDK 1.8/2.0
On Tue, Apr 28, 2015 at 12:28:34AM +0200, Paul Emmerich wrote: > Let me know if you need any additional information. > I'd also be interested in the configuration that resulted in the 20% speed- > up that was mentioned in the original mbuf patch > > Paul > The speed-up would be for apps that were doing RX of scattered packets, i.e. across mbufs. Before 1.8, this was using a scalar function which was rather slow compared to the fast-path vector function. In 1.8 we introduced a new vector function which supported scattered packets - it still isn't as fast as the non-scattered packet RX function, but it was a good improvement over the older version. /Bruce
[dpdk-dev] Performance regression in DPDK 1.8/2.0
On Tue, Apr 28, 2015 at 12:43:16PM +0200, Paul Emmerich wrote: > Hi, > > sorry, I mixed up the hardware I used for my tests. > > > Paul Emmerich : > > CPU: Intel(R) Xeon(R) CPU E3-1230 v2 > > TurboBoost and HyperThreading disabled. > > Frequency fixed at 3.30 GHz via acpi_cpufreq. > > The CPU frequency was fixed at 1.60 GHz to enforce > a CPU bottleneck. > > > My original post said that I used a Xeon E5-2620 v3 > at 1.2 GHz, this is incorrect. The calculation for Cycles/Pkt > in the original post used the correct 1.6 GHz figure, though. > > (I used the E5 CPU for the evaluation of my packet generator > performance with 1.7.1/2.0.0, not for the l2fwd test.) > > > Sorry for the confusion. > > > Paul Thanks for the update - we are investigating. /Bruce
[dpdk-dev] Performance regression in DPDK 1.8/2.0
> -Original Message- > From: Richardson, Bruce > Sent: Tuesday, April 28, 2015 11:55 AM > To: Paul Emmerich > Cc: De Lara Guarch, Pablo; dev at dpdk.org > Subject: Re: [dpdk-dev] Performance regression in DPDK 1.8/2.0 > > On Tue, Apr 28, 2015 at 12:43:16PM +0200, Paul Emmerich wrote: > > Hi, > > > > sorry, I mixed up the hardware I used for my tests. > > > > > > Paul Emmerich : > > > CPU: Intel(R) Xeon(R) CPU E3-1230 v2 > > > TurboBoost and HyperThreading disabled. > > > Frequency fixed at 3.30 GHz via acpi_cpufreq. > > > > The CPU frequency was fixed at 1.60 GHz to enforce > > a CPU bottleneck. > > > > > > My original post said that I used a Xeon E5-2620 v3 > > at 1.2 GHz, this is incorrect. The calculation for Cycles/Pkt > > in the original post used the correct 1.6 GHz figure, though. > > > > (I used the E5 CPU for the evaluation of my packet generator > > performance with 1.7.1/2.0.0, not for the l2fwd test.) Thanks for the update. So, just for clarification, for l2fwd you used E3-1230 v2 (Ivy Bridge), at 1.6 GHz or 3.3 GHz? Pablo > > > > > > Sorry for the confusion. > > > > > > Paul > Thanks for the update - we are investigating. > > /Bruce
[dpdk-dev] Performance regression in DPDK 1.8/2.0
> -Original Message- > From: Paul Emmerich [mailto:emmericp at net.in.tum.de] > Sent: Monday, April 27, 2015 11:29 PM > To: De Lara Guarch, Pablo > Cc: Pavel Odintsov; dev at dpdk.org > Subject: Re: [dpdk-dev] Performance regression in DPDK 1.8/2.0 > > Hi, > > Pablo : > > Could you tell me how you got the L1 cache miss ratio? Perf? > > perf stat -e L1-dcache-loads,L1-dcache-misses l2fwd ... > > > > Could you provide more information on how you run the l2fwd app, > > in order to try to reproduce the issue: > > - L2fwd Command line > > ./build/l2fwd -c 3 -n 2 -- -p 3 -q 2 > > > > - L2fwd initialization (to check memory/CPU/NICs) > > I unfortunately did not save the output, but I wrote down the important > parts: > > 1.7.1: no output regarding rx/tx code paths as init debug wasn't enabled > 1.8.0 and 2.0.0: simple tx code path, vector rx > > > Hardware: > > CPU: Intel(R) Xeon(R) CPU E3-1230 v2 > TurboBoost and HyperThreading disabled. > Frequency fixed at 3.30 GHz via acpi_cpufreq. > > NIC: X540-T2 > > Memory: Dual Channel DDR3 1333 MHz, 4x 4GB > > > Did you change the l2fwd app between versions? L2fwd uses simple rx on > 1.7.1, > > whereas it uses vector rx on 2.0 (enable IXGBE_DEBUG_INIT to check it). > > Yes, I had to update l2fwd when going from 1.7.1 to 1.8.0. However, the > changes in the app were minimal. Could you tell me which changes you made here? I see you are using simple tx code path on 1.8.0, but with the default values, you should be using vector tx, unless you have changed anything in the tx configuration. Not sure also if you are using simple tx code path on 1.7.1 then, plus scattered rx. (Without changing the l2fwd app, I use scattered rx and vector tx). Thanks! Pablo > > 1.8.0 and 2.0.0 used vector rx. Disabling vector rx via DPDK .config file > causes another 30% performance loss so I kept it enabled. > > > > > Which packet format/size did you use? Does your traffic generator take > into account the Inter-packet gap? > > 64 Byte packets, full line rate on both ports, i.e. 14.88 Mpps per port. > The packet's content doesn't matter as l2fwd doesn't look at it, but it was > just some random stuff: EthType 0x1234. > > > Let me know if you need any additional information. > I'd also be interested in the configuration that resulted in the 20% speed- > up that was mentioned in the original mbuf patch > > Paul
[dpdk-dev] Performance regression in DPDK 1.8/2.0
Hi, Pablo : > Could you tell me how you got the L1 cache miss ratio? Perf? perf stat -e L1-dcache-loads,L1-dcache-misses l2fwd ... > Could you provide more information on how you run the l2fwd app, > in order to try to reproduce the issue: > - L2fwd Command line ./build/l2fwd -c 3 -n 2 -- -p 3 -q 2 > - L2fwd initialization (to check memory/CPU/NICs) I unfortunately did not save the output, but I wrote down the important parts: 1.7.1: no output regarding rx/tx code paths as init debug wasn't enabled 1.8.0 and 2.0.0: simple tx code path, vector rx Hardware: CPU: Intel(R) Xeon(R) CPU E3-1230 v2 TurboBoost and HyperThreading disabled. Frequency fixed at 3.30 GHz via acpi_cpufreq. NIC: X540-T2 Memory: Dual Channel DDR3 1333 MHz, 4x 4GB > Did you change the l2fwd app between versions? L2fwd uses simple rx on 1.7.1, > whereas it uses vector rx on 2.0 (enable IXGBE_DEBUG_INIT to check it). Yes, I had to update l2fwd when going from 1.7.1 to 1.8.0. However, the changes in the app were minimal. 1.8.0 and 2.0.0 used vector rx. Disabling vector rx via DPDK .config file causes another 30% performance loss so I kept it enabled. > Which packet format/size did you use? Does your traffic generator take into > account the Inter-packet gap? 64 Byte packets, full line rate on both ports, i.e. 14.88 Mpps per port. The packet's content doesn't matter as l2fwd doesn't look at it, but it was just some random stuff: EthType 0x1234. Let me know if you need any additional information. I'd also be interested in the configuration that resulted in the 20% speed- up that was mentioned in the original mbuf patch Paul
[dpdk-dev] Performance regression in DPDK 1.8/2.0
On Apr 27, 2015, at 3:28 PM, Paul Emmerich wrote: > Let me know if you need any additional information. > I'd also be interested in the configuration that resulted in the 20% speed- > up that was mentioned in the original mbuf patch Not sure if it's relevant or not, but there was another mail claiming PCIe MSI-X wasn't necessarily working in DPDK 2.x. Not sure if that could be causing slowdowns when there are drastic volumes of 64-byte packets causing a lot of PCI activity. Also, you are mentioning some specific patches were involved... so I have to ask if anybody tried git bisect yet or not. Maybe easier than trying to guess at the answer. Matthew.
[dpdk-dev] Performance regression in DPDK 1.8/2.0
Hi, > -Original Message- > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Pavel Odintsov > Sent: Monday, April 27, 2015 9:07 AM > To: Paul Emmerich > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] Performance regression in DPDK 1.8/2.0 > > Hello! > > I executed deep test of Paul's toolkit and could approve performance > degradation in 2.0.0. > > On Sun, Apr 26, 2015 at 9:50 PM, Paul Emmerich > wrote: > > Hi, > > > > I'm working on a DPDK-based packet generator [1] and I recently tried to > > upgrade from DPDK 1.7.1 to 2.0.0. > > However, I noticed that DPDK 1.7.1 is about 25% faster than 2.0.0 for my > use > > case. > > > > So I ran some basic performance tests on the l2fwd example with DPDK > 1.7.1, > > 1.8.0 and 2.0.0. > > I used an Intel Xeon E5-2620 v3 CPU clocked down to 1.2 GHz in order to > > ensure that the CPU and not the network bandwidth is the bottleneck. > > I configured l2fwd to forward between two interfaces of an X540 NIC using > > only a single CPU core (-q2) and measured the following throughput under > > full bidirectional load: > > > > > > Version TP [Mpps] Cycles/Pkt > > 1.7.118.84 84.925690021 > > 1.8.016.78 95.351609058 > > 2.0.016.40 97.56097561 > > > > DPDK 1.7.1 is about 15% faster in this scenario. The obvious suspect is the > > new mbuf structure introduced in DPDK 1.8, so I profiled L1 cache misses: > > > > Version L1 miss ratio > > 1.7.1 6.5% > > 1.8.013.8% > > 2.0.013.4% > > > > > > FWIW the performance results with my packet generator on the same 1.2 > GHz > > CPU core are: > > > > Version TP [Mpps] L1 cache miss ratio > > 1.7 11.77 4.3% > > 2.0 9.58.4% Could you tell me how you got the L1 cache miss ratio? Perf? > > > > > > The discussion about the original patch [2] which introduced the new mbuf > > structure addresses this potential performance degradation and mentions > that > > it is somehow mitigated. > > It even claims a 20% *increase* in performance in a specific scenario. > > However, that doesn't seem to be the case for both l2fwd and my packet > > generator. > > > > Any ideas how to fix this? A 25% loss in throughput prevents me from > > upgrading to DPDK 2.0.0. I need the new lcore features and the 40 GBit > > driver updates, so I can't stay on 1.7.1 forever. Could you provide more information on how you run the l2fwd app, in order to try to reproduce the issue: - L2fwd Command line - L2fwd initialization (to check memory/CPU/NICs) Did you change the l2fwd app between versions? L2fwd uses simple rx on 1.7.1, whereas it uses vector rx on 2.0 (enable IXGBE_DEBUG_INIT to check it). Last question, I assume you use your traffic generator to get all those numbers. Which packet format/size did you use? Does your traffic generator take into account the Inter-packet gap? Thanks! Pablo > > > > Paul > > > > > > [1] https://github.com/emmericp/MoonGen > > [2] > http://comments.gmane.org/gmane.comp.networking.dpdk.devel/5155 > > > > -- > Sincerely yours, Pavel Odintsov
[dpdk-dev] Performance regression in DPDK 1.8/2.0
Hi, I'm working on a DPDK-based packet generator [1] and I recently tried to upgrade from DPDK 1.7.1 to 2.0.0. However, I noticed that DPDK 1.7.1 is about 25% faster than 2.0.0 for my use case. So I ran some basic performance tests on the l2fwd example with DPDK 1.7.1, 1.8.0 and 2.0.0. I used an Intel Xeon E5-2620 v3 CPU clocked down to 1.2 GHz in order to ensure that the CPU and not the network bandwidth is the bottleneck. I configured l2fwd to forward between two interfaces of an X540 NIC using only a single CPU core (-q2) and measured the following throughput under full bidirectional load: Version TP [Mpps] Cycles/Pkt 1.7.118.84 84.925690021 1.8.016.78 95.351609058 2.0.016.40 97.56097561 DPDK 1.7.1 is about 15% faster in this scenario. The obvious suspect is the new mbuf structure introduced in DPDK 1.8, so I profiled L1 cache misses: Version L1 miss ratio 1.7.1 6.5% 1.8.013.8% 2.0.013.4% FWIW the performance results with my packet generator on the same 1.2 GHz CPU core are: Version TP [Mpps] L1 cache miss ratio 1.7 11.77 4.3% 2.0 9.58.4% The discussion about the original patch [2] which introduced the new mbuf structure addresses this potential performance degradation and mentions that it is somehow mitigated. It even claims a 20% *increase* in performance in a specific scenario. However, that doesn't seem to be the case for both l2fwd and my packet generator. Any ideas how to fix this? A 25% loss in throughput prevents me from upgrading to DPDK 2.0.0. I need the new lcore features and the 40 GBit driver updates, so I can't stay on 1.7.1 forever. Paul [1] https://github.com/emmericp/MoonGen [2] http://comments.gmane.org/gmane.comp.networking.dpdk.devel/5155