It seems I have found the cause, but I still don't understand the reason.
So, let me describe my setup a bit further. I installed the VMWare Workstation 
onto my laptop. It has a mobile i5 CPU: 2 cores with hyperthreading, so 
basically 4 cores.
In VMWare I assigned to C1 and C2 nodes 1 CPU and one core, BR has one CPU and 
4 cores allocated (the possible maximum value).

If I execute the 'basicfwd' or the multi-process master (and two clients) on 
any of the cores out of [2,3,4] then the ping is received immediately (less 
than 0.5ms) and the transfer speed is immediately high (starting from ~30MB and 
finishing at around 80-90MB/s with basicfwd and test-pdm also *). 

If I allocate them on core 1 (the clients on any other cores), then the ping 
behaves as I originally described: 1sec delays. When I tried to transfer a 
bigger file (I used scp) it started really slow (some 16-32KB/s), sometimes 
even it was stalled. Then later on it get faster as Matthew wrote but it didn't 
went upper than 20-30MB/s

test-pmd worked originally. 
This is because when executing test-pmd there had to be defined 2 cores and I 
always passed '-c 3'. Checking with top it could be seen that it always used 
the CPU#2 (top showed that the second CPU was utilized by 100%).

Can anyone tell me the reason of this behavior? Using CPU 1 there are huge 
latencies, using other CPUs everything work as expected...
Checking on the laptop (windows task manager) it could be seen that none of the 
VMs were utilizing one CPU's to 100% on my laptop. The dpdk processes 100% 
utilization were somehow distributed amongst the physical CPU cores. So no 
single core were allocated exclusively by a VM. Why is it a different situation 
when I use the first CPU on BR rather than the others? It doesn't seem that C1 
and C2 are blocking that CPU. Anyway, the HOST opsys already uses all the cores 
(not heavily).


Rashmin, thanks for the docs. I think I already saw that one but I didn't take 
that as serious. I thought perf tuning about latency in VMWare ESXi makes point 
when one would like to go from 5ms to 0.5ms. But I had 1000ms latency at low 
load... I will check those params if they apply to Workstation at all.


*) Top speed of Multi process master-client example was around 20-30 MB/s, 
immediately. I think this is a normal limitation because the processes have to 
talk with each other through shared mem, so it is anyway slower. I didn't test 
its speed when the Master process was bound to core 1

Sandor

-----Original Message-----
From: ext Patel, Rashmin N [mailto:rashmin.n.pa...@intel.com] 
Sent: Thursday, June 25, 2015 10:56 PM
To: Matthew Hall; Vass, Sandor (Nokia - HU/Budapest)
Cc: dev at dpdk.org
Subject: RE: [dpdk-dev] VMXNET3 on vmware, ping delay

For tuning ESXi and vSwitch for latency sensitive workloads, I remember the 
following paper published by VMware: 
https://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf
 that you can try out.

The overall latency in setup (vmware and dpdk-vm using vmxnet3) remains in 
vmware-native-driver/vmkernel/vmxnet3-backend/vmx-emulation threads in ESXi. So 
you can better tune ESXi (as explained in the above white paper) and/or make 
sure that these important threads are not starving to improve latency and 
throughput in some cases of this setup.

Thanks,
Rashmin

-----Original Message-----
From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Matthew Hall
Sent: Thursday, June 25, 2015 8:19 AM
To: Vass, Sandor (Nokia - HU/Budapest)
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] VMXNET3 on vmware, ping delay

On Thu, Jun 25, 2015 at 09:14:53AM +0000, Vass, Sandor (Nokia - HU/Budapest) 
wrote:
> According to my understanding each packet should go through BR as fast 
> as possible, but it seems that the rte_eth_rx_burst retrieves packets 
> only when there are at least 2 packets on the RX queue of the NIC. At 
> least most of the times as there are cases (rarely - according to my 
> console log) when it can retrieve 1 packet also and sometimes only 3 
> packets can be retrieved...

By default DPDK is optimized for throughput not latency. Try a test with 
heavier traffic.

There is also some work going on now for DPDK interrupt-driven mode, which will 
work more like traditional Ethernet drivers instead of polling mode Ethernet 
drivers.

Though I'm not an expert on it, there is also a series of ways to optimize for 
latency, which hopefully some others could discuss... or maybe search the 
archives / web site / Intel tuning documentation.

Matthew.

Reply via email to