On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith <keith.wiles at intel.com> wrote: > On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara at gmail.com> wrote: > >>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles at intel.com> >>wrote: >>> >>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces at >>> dpdk.org on behalf of keith.wiles at intel.com> wrote: >>> >>>> >>>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara at gmail.com> wrote: >>>> >>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles at intel.com> >>>>>wrote: >>>>> >>>>>> >>>>>> Right now I do not know what the issue is with the system. Could be too >>>>>> many Rx/Tx ring pairs per port and limiting the memory in the NICs, >>>>>> which is why you get better performance when you have 8 core per port. I >>>>>> am not really seeing the whole picture and how DPDK is configured to >>>>>> help more. Sorry. >>>>> >>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8 >>>>>cores per port as I've tried with two different machines connected >>>>>back to back each with one X710 port and 16 cores on each of them >>>>>running on that port. In that case our performance doubled as >>>>>expected. >>>>> >>>>>> >>>>>> Maybe seeing the DPDK command line would help. >>>>> >>>>>The command line I use with ports 01:00.3 and 81:00.3 is: >>>>>./warp17 -c 0xFFFFFFFFF3 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00 >>>>> >>>>>Our own qmap args allow the user to control exactly how cores are >>>>>split between ports. In this case we end up with: >>>>> >>>>>warp17> show port map >>>>>Port 0[socket: 0]: >>>>> Core 4[socket:0] (Tx: 0, Rx: 0) >>>>> Core 5[socket:0] (Tx: 1, Rx: 1) >>>>> Core 6[socket:0] (Tx: 2, Rx: 2) >>>>> Core 7[socket:0] (Tx: 3, Rx: 3) >>>>> Core 8[socket:0] (Tx: 4, Rx: 4) >>>>> Core 9[socket:0] (Tx: 5, Rx: 5) >>>>> Core 20[socket:0] (Tx: 6, Rx: 6) >>>>> Core 21[socket:0] (Tx: 7, Rx: 7) >>>>> Core 22[socket:0] (Tx: 8, Rx: 8) >>>>> Core 23[socket:0] (Tx: 9, Rx: 9) >>>>> Core 24[socket:0] (Tx: 10, Rx: 10) >>>>> Core 25[socket:0] (Tx: 11, Rx: 11) >>>>> Core 26[socket:0] (Tx: 12, Rx: 12) >>>>> Core 27[socket:0] (Tx: 13, Rx: 13) >>>>> Core 28[socket:0] (Tx: 14, Rx: 14) >>>>> Core 29[socket:0] (Tx: 15, Rx: 15) >>>>> >>>>>Port 1[socket: 1]: >>>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>>> Core 30[socket:1] (Tx: 10, Rx: 10) >>>>> Core 31[socket:1] (Tx: 11, Rx: 11) >>>>> Core 32[socket:1] (Tx: 12, Rx: 12) >>>>> Core 33[socket:1] (Tx: 13, Rx: 13) >>>>> Core 34[socket:1] (Tx: 14, Rx: 14) >>>>> Core 35[socket:1] (Tx: 15, Rx: 15) >>>> >>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 >>>>lcores total. >>>> >>>>The above is listing the LCORES (or hyper-threads) and not COREs, which I >>>>understand some like to think they are interchangeable. The problem is the >>>>hyper-threads are logically interchangeable, but not performance wise. If >>>>you have two run-to-completion threads on a single physical core each on a >>>>different hyper-thread of that core [0,1], then the second lcore or thread >>>>(1) on that physical core will only get at most about 30-20% of the CPU >>>>cycles. Normally it is much less, unless you tune the code to make sure >>>>each thread is not trying to share the internal execution units, but some >>>>internal execution units are always shared. >>>> >>>>To get the best performance when hyper-threading is enable is to not run >>>>both threads on a single physical core, but only run one hyper-thread-0. >>>> >>>>In the table below the table lists the physical core id and each of the >>>>lcore ids per socket. Use the first lcore per socket for the best >>>>performance: >>>>Core 1 [1, 21] [11, 31] >>>>Use lcore 1 or 11 depending on the socket you are on. >>>> >>>>The info below is most likely the best performance and utilization of your >>>>system. If I got the values right ? >>>> >>>>./warp17 -c 0x00000FFFe0 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00 >>>> >>>>Port 0[socket: 0]: >>>> Core 2[socket:0] (Tx: 0, Rx: 0) >>>> Core 3[socket:0] (Tx: 1, Rx: 1) >>>> Core 4[socket:0] (Tx: 2, Rx: 2) >>>> Core 5[socket:0] (Tx: 3, Rx: 3) >>>> Core 6[socket:0] (Tx: 4, Rx: 4) >>>> Core 7[socket:0] (Tx: 5, Rx: 5) >>>> Core 8[socket:0] (Tx: 6, Rx: 6) >>>> Core 9[socket:0] (Tx: 7, Rx: 7) >>>> >>>>8 cores on first socket leaving 0-1 lcores for Linux. >>> >>> 9 cores and leaving the first core or two lcores for Linux >>>> >>>>Port 1[socket: 1]: >>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>> >>>>All 10 cores on the second socket. >> >>The values were almost right :) But that's because we reserve the >>first two lcores that are passed to dpdk for our own management part. >>I was aware that lcores are not physical cores so we don't expect >>performance to scale linearly with the number of lcores. However, if >>there's a chance that another hyperthread can run while the paired one >>is stalling we'd like to take advantage of those cycles if possible. >> >>Leaving that aside I just ran two more tests while using only one of >>the two hwthreads in a core. >> >>a. 2 ports on different sockets with 8 cores/port: >>./build/warp17 -c 0xFF3FF -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 >>-- --qmap 0.0x3FC --qmap 1.0xFF000 >>warp17> show port map >>Port 0[socket: 0]: >> Core 2[socket:0] (Tx: 0, Rx: 0) >> Core 3[socket:0] (Tx: 1, Rx: 1) >> Core 4[socket:0] (Tx: 2, Rx: 2) >> Core 5[socket:0] (Tx: 3, Rx: 3) >> Core 6[socket:0] (Tx: 4, Rx: 4) >> Core 7[socket:0] (Tx: 5, Rx: 5) >> Core 8[socket:0] (Tx: 6, Rx: 6) >> Core 9[socket:0] (Tx: 7, Rx: 7) >> >>Port 1[socket: 1]: >> Core 12[socket:1] (Tx: 0, Rx: 0) >> Core 13[socket:1] (Tx: 1, Rx: 1) >> Core 14[socket:1] (Tx: 2, Rx: 2) >> Core 15[socket:1] (Tx: 3, Rx: 3) >> Core 16[socket:1] (Tx: 4, Rx: 4) >> Core 17[socket:1] (Tx: 5, Rx: 5) >> Core 18[socket:1] (Tx: 6, Rx: 6) >> Core 19[socket:1] (Tx: 7, Rx: 7) >> >>This gives a session setup rate of only 2M sessions/s. >> >>b. 2 ports on socket 0 with 4 cores/port: >>./build/warp17 -c 0x3FF -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 -- >>--qmap 0.0x3C0 --qmap 1.0x03C > > One more thing to try change the ?m 32768 to ?socket-mem 16384,16384 to make > sure the memory is split between the sockets. You may need to remove the > /dev/huepages/* files or wherever you put them. > > What is the dpdk ?n option set to on your system? Mine is set to ??n 4? >
I tried with ?socket-mem 16384,16384 but it doesn't make any difference. We call anyway rte_malloc_socket for everything that might be accessed in fast path and the mempools are per-core and created with the correct socket-id. Even when starting with '-m 32768' I see that 16 hugepages get allocated on each of the sockets. On the test server I have 4 memory channels so '-n 4'. >>warp17> show port map >>Port 0[socket: 0]: >> Core 6[socket:0] (Tx: 0, Rx: 0) >> Core 7[socket:0] (Tx: 1, Rx: 1) >> Core 8[socket:0] (Tx: 2, Rx: 2) >> Core 9[socket:0] (Tx: 3, Rx: 3) >> >>Port 1[socket: 0]: >> Core 2[socket:0] (Tx: 0, Rx: 0) >> Core 3[socket:0] (Tx: 1, Rx: 1) >> Core 4[socket:0] (Tx: 2, Rx: 2) >> Core 5[socket:0] (Tx: 3, Rx: 3) >> >>Surprisingly this gives a session setup rate of 3M sess/s!! >> >>The packet processing cores are totally independent and only access >>local socket memory/ports. >>There is no locking or atomic variable access in fast path in our code. >>The mbuf pools are not shared between cores handling the same port so >>there should be no contention when allocating/freeing mbufs. >>In this specific test scenario all the cores handling port 0 are >>essentially executing the same code (TCP clients) and the cores on >>port 1 as well (TCP servers). >> >>Do you have any tips about what other things to check for? >> >>Thanks, >>Dumitru >> >> >> >>>> >>>>++Keith >>>> >>>>> >>>>>Just for reference, the cpu_layout script shows: >>>>>$ $RTE_SDK/tools/cpu_layout.py >>>>>============================================================ >>>>>Core and Socket Information (as reported by '/proc/cpuinfo') >>>>>============================================================ >>>>> >>>>>cores = [0, 1, 2, 3, 4, 8, 9, 10, 11, 12] >>>>>sockets = [0, 1] >>>>> >>>>> Socket 0 Socket 1 >>>>> -------- -------- >>>>>Core 0 [0, 20] [10, 30] >>>>>Core 1 [1, 21] [11, 31] >>>>>Core 2 [2, 22] [12, 32] >>>>>Core 3 [3, 23] [13, 33] >>>>>Core 4 [4, 24] [14, 34] >>>>>Core 8 [5, 25] [15, 35] >>>>>Core 9 [6, 26] [16, 36] >>>>>Core 10 [7, 27] [17, 37] >>>>>Core 11 [8, 28] [18, 38] >>>>>Core 12 [9, 29] [19, 39] >>>>> >>>>>I know it might be complicated to gigure out exactly what's happening >>>>>in our setup with our own code so please let me know if you need >>>>>additional information. >>>>> >>>>>I appreciate the help! >>>>> >>>>>Thanks, >>>>>Dumitru >>>>> >>>> >>>> >>>> >>>> >>> >>> >>> >> > > >