On Thu, Jun 16, 2016 at 10:19 PM, Wiles, Keith <keith.wiles at intel.com> wrote:
>
> On 6/16/16, 3:16 PM, "dev on behalf of Wiles, Keith" <dev-bounces at dpdk.org 
> on behalf of keith.wiles at intel.com> wrote:
>
>>
>>On 6/16/16, 3:00 PM, "Take Ceara" <dumitru.ceara at gmail.com> wrote:
>>
>>>On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith <keith.wiles at intel.com> 
>>>wrote:
>>>> On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara at gmail.com> wrote:
>>>>
>>>>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles at intel.com> 
>>>>>wrote:
>>>>>>
>>>>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces at 
>>>>>> dpdk.org on behalf of keith.wiles at intel.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara at gmail.com> wrote:
>>>>>>>
>>>>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles at 
>>>>>>>>intel.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Right now I do not know what the issue is with the system. Could be 
>>>>>>>>> too many Rx/Tx ring pairs per port and limiting the memory in the 
>>>>>>>>> NICs, which is why you get better performance when you have 8 core 
>>>>>>>>> per port. I am not really seeing the whole picture and how DPDK is 
>>>>>>>>> configured to help more. Sorry.
>>>>>>>>
>>>>>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8
>>>>>>>>cores per port as I've tried with two different machines connected
>>>>>>>>back to back each with one X710 port and 16 cores on each of them
>>>>>>>>running on that port. In that case our performance doubled as
>>>>>>>>expected.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Maybe seeing the DPDK command line would help.
>>>>>>>>
>>>>>>>>The command line I use with ports 01:00.3 and 81:00.3 is:
>>>>>>>>./warp17 -c 0xFFFFFFFFF3   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00
>>>>>>>>
>>>>>>>>Our own qmap args allow the user to control exactly how cores are
>>>>>>>>split between ports. In this case we end up with:
>>>>>>>>
>>>>>>>>warp17> show port map
>>>>>>>>Port 0[socket: 0]:
>>>>>>>>   Core 4[socket:0] (Tx: 0, Rx: 0)
>>>>>>>>   Core 5[socket:0] (Tx: 1, Rx: 1)
>>>>>>>>   Core 6[socket:0] (Tx: 2, Rx: 2)
>>>>>>>>   Core 7[socket:0] (Tx: 3, Rx: 3)
>>>>>>>>   Core 8[socket:0] (Tx: 4, Rx: 4)
>>>>>>>>   Core 9[socket:0] (Tx: 5, Rx: 5)
>>>>>>>>   Core 20[socket:0] (Tx: 6, Rx: 6)
>>>>>>>>   Core 21[socket:0] (Tx: 7, Rx: 7)
>>>>>>>>   Core 22[socket:0] (Tx: 8, Rx: 8)
>>>>>>>>   Core 23[socket:0] (Tx: 9, Rx: 9)
>>>>>>>>   Core 24[socket:0] (Tx: 10, Rx: 10)
>>>>>>>>   Core 25[socket:0] (Tx: 11, Rx: 11)
>>>>>>>>   Core 26[socket:0] (Tx: 12, Rx: 12)
>>>>>>>>   Core 27[socket:0] (Tx: 13, Rx: 13)
>>>>>>>>   Core 28[socket:0] (Tx: 14, Rx: 14)
>>>>>>>>   Core 29[socket:0] (Tx: 15, Rx: 15)
>>>>>>>>
>>>>>>>>Port 1[socket: 1]:
>>>>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>>>>   Core 30[socket:1] (Tx: 10, Rx: 10)
>>>>>>>>   Core 31[socket:1] (Tx: 11, Rx: 11)
>>>>>>>>   Core 32[socket:1] (Tx: 12, Rx: 12)
>>>>>>>>   Core 33[socket:1] (Tx: 13, Rx: 13)
>>>>>>>>   Core 34[socket:1] (Tx: 14, Rx: 14)
>>>>>>>>   Core 35[socket:1] (Tx: 15, Rx: 15)
>>>>>>>
>>>>>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 
>>>>>>>lcores total.
>>>>>>>
>>>>>>>The above is listing the LCORES (or hyper-threads) and not COREs, which 
>>>>>>>I understand some like to think they are interchangeable. The problem is 
>>>>>>>the hyper-threads are logically interchangeable, but not performance 
>>>>>>>wise. If you have two run-to-completion threads on a single physical 
>>>>>>>core each on a different hyper-thread of that core [0,1], then the 
>>>>>>>second lcore or thread (1) on that physical core will only get at most 
>>>>>>>about 30-20% of the CPU cycles. Normally it is much less, unless you 
>>>>>>>tune the code to make sure each thread is not trying to share the 
>>>>>>>internal execution units, but some internal execution units are always 
>>>>>>>shared.
>>>>>>>
>>>>>>>To get the best performance when hyper-threading is enable is to not run 
>>>>>>>both threads on a single physical core, but only run one hyper-thread-0.
>>>>>>>
>>>>>>>In the table below the table lists the physical core id and each of the 
>>>>>>>lcore ids per socket. Use the first lcore per socket for the best 
>>>>>>>performance:
>>>>>>>Core 1 [1, 21]    [11, 31]
>>>>>>>Use lcore 1 or 11 depending on the socket you are on.
>>>>>>>
>>>>>>>The info below is most likely the best performance and utilization of 
>>>>>>>your system. If I got the values right ?
>>>>>>>
>>>>>>>./warp17 -c 0x00000FFFe0   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 --
>>>>>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00
>>>>>>>
>>>>>>>Port 0[socket: 0]:
>>>>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>>>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>>>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>>>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>>>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>>>>>
>>>>>>>8 cores on first socket leaving 0-1 lcores for Linux.
>>>>>>
>>>>>> 9 cores and leaving the first core or two lcores for Linux
>>>>>>>
>>>>>>>Port 1[socket: 1]:
>>>>>>>   Core 10[socket:1] (Tx: 0, Rx: 0)
>>>>>>>   Core 11[socket:1] (Tx: 1, Rx: 1)
>>>>>>>   Core 12[socket:1] (Tx: 2, Rx: 2)
>>>>>>>   Core 13[socket:1] (Tx: 3, Rx: 3)
>>>>>>>   Core 14[socket:1] (Tx: 4, Rx: 4)
>>>>>>>   Core 15[socket:1] (Tx: 5, Rx: 5)
>>>>>>>   Core 16[socket:1] (Tx: 6, Rx: 6)
>>>>>>>   Core 17[socket:1] (Tx: 7, Rx: 7)
>>>>>>>   Core 18[socket:1] (Tx: 8, Rx: 8)
>>>>>>>   Core 19[socket:1] (Tx: 9, Rx: 9)
>>>>>>>
>>>>>>>All 10 cores on the second socket.
>>>>>
>>>>>The values were almost right :) But that's because we reserve the
>>>>>first two lcores that are passed to dpdk for our own management part.
>>>>>I was aware that lcores are not physical cores so we don't expect
>>>>>performance to scale linearly with the number of lcores. However, if
>>>>>there's a chance that another hyperthread can run while the paired one
>>>>>is stalling we'd like to take advantage of those cycles if possible.
>>>>>
>>>>>Leaving that aside I just ran two more tests while using only one of
>>>>>the two hwthreads in a core.
>>>>>
>>>>>a. 2 ports on different sockets with 8 cores/port:
>>>>>./build/warp17 -c 0xFF3FF   -m 32768 -w 0000:81:00.3 -w 0000:01:00.3
>>>>>-- --qmap 0.0x3FC --qmap 1.0xFF000
>>>>>warp17> show port map
>>>>>Port 0[socket: 0]:
>>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>>>>   Core 6[socket:0] (Tx: 4, Rx: 4)
>>>>>   Core 7[socket:0] (Tx: 5, Rx: 5)
>>>>>   Core 8[socket:0] (Tx: 6, Rx: 6)
>>>>>   Core 9[socket:0] (Tx: 7, Rx: 7)
>>>>>
>>>>>Port 1[socket: 1]:
>>>>>   Core 12[socket:1] (Tx: 0, Rx: 0)
>>>>>   Core 13[socket:1] (Tx: 1, Rx: 1)
>>>>>   Core 14[socket:1] (Tx: 2, Rx: 2)
>>>>>   Core 15[socket:1] (Tx: 3, Rx: 3)
>>>>>   Core 16[socket:1] (Tx: 4, Rx: 4)
>>>>>   Core 17[socket:1] (Tx: 5, Rx: 5)
>>>>>   Core 18[socket:1] (Tx: 6, Rx: 6)
>>>>>   Core 19[socket:1] (Tx: 7, Rx: 7)
>>>>>
>>>>>This gives a session setup rate of only 2M sessions/s.
>>>>>
>>>>>b. 2 ports on socket 0 with 4 cores/port:
>>>>>./build/warp17 -c 0x3FF   -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 --
>>>>>--qmap 0.0x3C0 --qmap 1.0x03C
>>>>
>>>> One more thing to try change the ?m 32768 to ?socket-mem 16384,16384 to 
>>>> make sure the memory is split between the sockets. You may need to remove 
>>>> the /dev/huepages/* files or wherever you put them.
>>>>
>>>> What is the dpdk ?n option set to on your system? Mine is set to ??n 4?
>>>>
>>>
>>>I tried with ?socket-mem 16384,16384 but it doesn't make any
>>>difference. We call anyway rte_malloc_socket for everything that might
>>>be accessed in fast path and the mempools are per-core and created
>>>with the correct socket-id. Even when starting with '-m 32768' I see
>>>that 16 hugepages get allocated on each of the sockets.
>>>
>>>On the test server I have 4 memory channels so '-n 4'.
>>>
>>>>>warp17> show port map
>>>>>Port 0[socket: 0]:
>>>>>   Core 6[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 7[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 8[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 9[socket:0] (Tx: 3, Rx: 3)
>>>>>
>>>>>Port 1[socket: 0]:
>>>>>   Core 2[socket:0] (Tx: 0, Rx: 0)
>>>>>   Core 3[socket:0] (Tx: 1, Rx: 1)
>>>>>   Core 4[socket:0] (Tx: 2, Rx: 2)
>>>>>   Core 5[socket:0] (Tx: 3, Rx: 3)
>>
>>I do not know now. It seems like something else is going on here that we have 
>>not identified.
>
> Maybe vTune or some other type of debug performance tool would be the next 
> step here.
>

Thanks for the patience Keith.
I'll try some profiling and see where it takes us from there. I'll
update this thread when I have some new info.

Regards,
Dumitru

>>
>>>>>
>>>>>Surprisingly this gives a session setup rate of 3M sess/s!!
>>>>>
>>>>>The packet processing cores are totally independent and only access
>>>>>local socket memory/ports.
>>>>>There is no locking or atomic variable access in fast path in our code.
>>>>>The mbuf pools are not shared between cores handling the same port so
>>>>>there should be no contention when allocating/freeing mbufs.
>>>>>In this specific test scenario all the cores handling port 0 are
>>>>>essentially executing the same code (TCP clients) and the cores on
>>>>>port 1 as well (TCP servers).
>>>>>
>>>>>Do you have any tips about what other things to check for?
>>>>>
>>>>>Thanks,
>>>>>Dumitru
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>++Keith
>>>>>>>
>>>>>>>>
>>>>>>>>Just for reference, the cpu_layout script shows:
>>>>>>>>$ $RTE_SDK/tools/cpu_layout.py
>>>>>>>>============================================================
>>>>>>>>Core and Socket Information (as reported by '/proc/cpuinfo')
>>>>>>>>============================================================
>>>>>>>>
>>>>>>>>cores =  [0, 1, 2, 3, 4, 8, 9, 10, 11, 12]
>>>>>>>>sockets =  [0, 1]
>>>>>>>>
>>>>>>>>        Socket 0        Socket 1
>>>>>>>>        --------        --------
>>>>>>>>Core 0  [0, 20]         [10, 30]
>>>>>>>>Core 1  [1, 21]         [11, 31]
>>>>>>>>Core 2  [2, 22]         [12, 32]
>>>>>>>>Core 3  [3, 23]         [13, 33]
>>>>>>>>Core 4  [4, 24]         [14, 34]
>>>>>>>>Core 8  [5, 25]         [15, 35]
>>>>>>>>Core 9  [6, 26]         [16, 36]
>>>>>>>>Core 10 [7, 27]         [17, 37]
>>>>>>>>Core 11 [8, 28]         [18, 38]
>>>>>>>>Core 12 [9, 29]         [19, 39]
>>>>>>>>
>>>>>>>>I know it might be complicated to gigure out exactly what's happening
>>>>>>>>in our setup with our own code so please let me know if you need
>>>>>>>>additional information.
>>>>>>>>
>>>>>>>>I appreciate the help!
>>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>Dumitru
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>>
>
>
>

Reply via email to