On Thu, Jun 16, 2016 at 10:19 PM, Wiles, Keith <keith.wiles at intel.com> wrote: > > On 6/16/16, 3:16 PM, "dev on behalf of Wiles, Keith" <dev-bounces at dpdk.org > on behalf of keith.wiles at intel.com> wrote: > >> >>On 6/16/16, 3:00 PM, "Take Ceara" <dumitru.ceara at gmail.com> wrote: >> >>>On Thu, Jun 16, 2016 at 9:33 PM, Wiles, Keith <keith.wiles at intel.com> >>>wrote: >>>> On 6/16/16, 1:20 PM, "Take Ceara" <dumitru.ceara at gmail.com> wrote: >>>> >>>>>On Thu, Jun 16, 2016 at 6:59 PM, Wiles, Keith <keith.wiles at intel.com> >>>>>wrote: >>>>>> >>>>>> On 6/16/16, 11:56 AM, "dev on behalf of Wiles, Keith" <dev-bounces at >>>>>> dpdk.org on behalf of keith.wiles at intel.com> wrote: >>>>>> >>>>>>> >>>>>>>On 6/16/16, 11:20 AM, "Take Ceara" <dumitru.ceara at gmail.com> wrote: >>>>>>> >>>>>>>>On Thu, Jun 16, 2016 at 5:29 PM, Wiles, Keith <keith.wiles at >>>>>>>>intel.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Right now I do not know what the issue is with the system. Could be >>>>>>>>> too many Rx/Tx ring pairs per port and limiting the memory in the >>>>>>>>> NICs, which is why you get better performance when you have 8 core >>>>>>>>> per port. I am not really seeing the whole picture and how DPDK is >>>>>>>>> configured to help more. Sorry. >>>>>>>> >>>>>>>>I doubt that there is a limitation wrt running 16 cores per port vs 8 >>>>>>>>cores per port as I've tried with two different machines connected >>>>>>>>back to back each with one X710 port and 16 cores on each of them >>>>>>>>running on that port. In that case our performance doubled as >>>>>>>>expected. >>>>>>>> >>>>>>>>> >>>>>>>>> Maybe seeing the DPDK command line would help. >>>>>>>> >>>>>>>>The command line I use with ports 01:00.3 and 81:00.3 is: >>>>>>>>./warp17 -c 0xFFFFFFFFF3 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>>>>>>>--qmap 0.0x003FF003F0 --qmap 1.0x0FC00FFC00 >>>>>>>> >>>>>>>>Our own qmap args allow the user to control exactly how cores are >>>>>>>>split between ports. In this case we end up with: >>>>>>>> >>>>>>>>warp17> show port map >>>>>>>>Port 0[socket: 0]: >>>>>>>> Core 4[socket:0] (Tx: 0, Rx: 0) >>>>>>>> Core 5[socket:0] (Tx: 1, Rx: 1) >>>>>>>> Core 6[socket:0] (Tx: 2, Rx: 2) >>>>>>>> Core 7[socket:0] (Tx: 3, Rx: 3) >>>>>>>> Core 8[socket:0] (Tx: 4, Rx: 4) >>>>>>>> Core 9[socket:0] (Tx: 5, Rx: 5) >>>>>>>> Core 20[socket:0] (Tx: 6, Rx: 6) >>>>>>>> Core 21[socket:0] (Tx: 7, Rx: 7) >>>>>>>> Core 22[socket:0] (Tx: 8, Rx: 8) >>>>>>>> Core 23[socket:0] (Tx: 9, Rx: 9) >>>>>>>> Core 24[socket:0] (Tx: 10, Rx: 10) >>>>>>>> Core 25[socket:0] (Tx: 11, Rx: 11) >>>>>>>> Core 26[socket:0] (Tx: 12, Rx: 12) >>>>>>>> Core 27[socket:0] (Tx: 13, Rx: 13) >>>>>>>> Core 28[socket:0] (Tx: 14, Rx: 14) >>>>>>>> Core 29[socket:0] (Tx: 15, Rx: 15) >>>>>>>> >>>>>>>>Port 1[socket: 1]: >>>>>>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>>>>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>>>>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>>>>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>>>>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>>>>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>>>>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>>>>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>>>>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>>>>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>>>>>> Core 30[socket:1] (Tx: 10, Rx: 10) >>>>>>>> Core 31[socket:1] (Tx: 11, Rx: 11) >>>>>>>> Core 32[socket:1] (Tx: 12, Rx: 12) >>>>>>>> Core 33[socket:1] (Tx: 13, Rx: 13) >>>>>>>> Core 34[socket:1] (Tx: 14, Rx: 14) >>>>>>>> Core 35[socket:1] (Tx: 15, Rx: 15) >>>>>>> >>>>>>>On each socket you have 10 physical cores or 20 lcores per socket for 40 >>>>>>>lcores total. >>>>>>> >>>>>>>The above is listing the LCORES (or hyper-threads) and not COREs, which >>>>>>>I understand some like to think they are interchangeable. The problem is >>>>>>>the hyper-threads are logically interchangeable, but not performance >>>>>>>wise. If you have two run-to-completion threads on a single physical >>>>>>>core each on a different hyper-thread of that core [0,1], then the >>>>>>>second lcore or thread (1) on that physical core will only get at most >>>>>>>about 30-20% of the CPU cycles. Normally it is much less, unless you >>>>>>>tune the code to make sure each thread is not trying to share the >>>>>>>internal execution units, but some internal execution units are always >>>>>>>shared. >>>>>>> >>>>>>>To get the best performance when hyper-threading is enable is to not run >>>>>>>both threads on a single physical core, but only run one hyper-thread-0. >>>>>>> >>>>>>>In the table below the table lists the physical core id and each of the >>>>>>>lcore ids per socket. Use the first lcore per socket for the best >>>>>>>performance: >>>>>>>Core 1 [1, 21] [11, 31] >>>>>>>Use lcore 1 or 11 depending on the socket you are on. >>>>>>> >>>>>>>The info below is most likely the best performance and utilization of >>>>>>>your system. If I got the values right ? >>>>>>> >>>>>>>./warp17 -c 0x00000FFFe0 -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 -- >>>>>>>--qmap 0.0x00000003FE --qmap 1.0x00000FFE00 >>>>>>> >>>>>>>Port 0[socket: 0]: >>>>>>> Core 2[socket:0] (Tx: 0, Rx: 0) >>>>>>> Core 3[socket:0] (Tx: 1, Rx: 1) >>>>>>> Core 4[socket:0] (Tx: 2, Rx: 2) >>>>>>> Core 5[socket:0] (Tx: 3, Rx: 3) >>>>>>> Core 6[socket:0] (Tx: 4, Rx: 4) >>>>>>> Core 7[socket:0] (Tx: 5, Rx: 5) >>>>>>> Core 8[socket:0] (Tx: 6, Rx: 6) >>>>>>> Core 9[socket:0] (Tx: 7, Rx: 7) >>>>>>> >>>>>>>8 cores on first socket leaving 0-1 lcores for Linux. >>>>>> >>>>>> 9 cores and leaving the first core or two lcores for Linux >>>>>>> >>>>>>>Port 1[socket: 1]: >>>>>>> Core 10[socket:1] (Tx: 0, Rx: 0) >>>>>>> Core 11[socket:1] (Tx: 1, Rx: 1) >>>>>>> Core 12[socket:1] (Tx: 2, Rx: 2) >>>>>>> Core 13[socket:1] (Tx: 3, Rx: 3) >>>>>>> Core 14[socket:1] (Tx: 4, Rx: 4) >>>>>>> Core 15[socket:1] (Tx: 5, Rx: 5) >>>>>>> Core 16[socket:1] (Tx: 6, Rx: 6) >>>>>>> Core 17[socket:1] (Tx: 7, Rx: 7) >>>>>>> Core 18[socket:1] (Tx: 8, Rx: 8) >>>>>>> Core 19[socket:1] (Tx: 9, Rx: 9) >>>>>>> >>>>>>>All 10 cores on the second socket. >>>>> >>>>>The values were almost right :) But that's because we reserve the >>>>>first two lcores that are passed to dpdk for our own management part. >>>>>I was aware that lcores are not physical cores so we don't expect >>>>>performance to scale linearly with the number of lcores. However, if >>>>>there's a chance that another hyperthread can run while the paired one >>>>>is stalling we'd like to take advantage of those cycles if possible. >>>>> >>>>>Leaving that aside I just ran two more tests while using only one of >>>>>the two hwthreads in a core. >>>>> >>>>>a. 2 ports on different sockets with 8 cores/port: >>>>>./build/warp17 -c 0xFF3FF -m 32768 -w 0000:81:00.3 -w 0000:01:00.3 >>>>>-- --qmap 0.0x3FC --qmap 1.0xFF000 >>>>>warp17> show port map >>>>>Port 0[socket: 0]: >>>>> Core 2[socket:0] (Tx: 0, Rx: 0) >>>>> Core 3[socket:0] (Tx: 1, Rx: 1) >>>>> Core 4[socket:0] (Tx: 2, Rx: 2) >>>>> Core 5[socket:0] (Tx: 3, Rx: 3) >>>>> Core 6[socket:0] (Tx: 4, Rx: 4) >>>>> Core 7[socket:0] (Tx: 5, Rx: 5) >>>>> Core 8[socket:0] (Tx: 6, Rx: 6) >>>>> Core 9[socket:0] (Tx: 7, Rx: 7) >>>>> >>>>>Port 1[socket: 1]: >>>>> Core 12[socket:1] (Tx: 0, Rx: 0) >>>>> Core 13[socket:1] (Tx: 1, Rx: 1) >>>>> Core 14[socket:1] (Tx: 2, Rx: 2) >>>>> Core 15[socket:1] (Tx: 3, Rx: 3) >>>>> Core 16[socket:1] (Tx: 4, Rx: 4) >>>>> Core 17[socket:1] (Tx: 5, Rx: 5) >>>>> Core 18[socket:1] (Tx: 6, Rx: 6) >>>>> Core 19[socket:1] (Tx: 7, Rx: 7) >>>>> >>>>>This gives a session setup rate of only 2M sessions/s. >>>>> >>>>>b. 2 ports on socket 0 with 4 cores/port: >>>>>./build/warp17 -c 0x3FF -m 32768 -w 0000:02:00.0 -w 0000:03:00.0 -- >>>>>--qmap 0.0x3C0 --qmap 1.0x03C >>>> >>>> One more thing to try change the ?m 32768 to ?socket-mem 16384,16384 to >>>> make sure the memory is split between the sockets. You may need to remove >>>> the /dev/huepages/* files or wherever you put them. >>>> >>>> What is the dpdk ?n option set to on your system? Mine is set to ??n 4? >>>> >>> >>>I tried with ?socket-mem 16384,16384 but it doesn't make any >>>difference. We call anyway rte_malloc_socket for everything that might >>>be accessed in fast path and the mempools are per-core and created >>>with the correct socket-id. Even when starting with '-m 32768' I see >>>that 16 hugepages get allocated on each of the sockets. >>> >>>On the test server I have 4 memory channels so '-n 4'. >>> >>>>>warp17> show port map >>>>>Port 0[socket: 0]: >>>>> Core 6[socket:0] (Tx: 0, Rx: 0) >>>>> Core 7[socket:0] (Tx: 1, Rx: 1) >>>>> Core 8[socket:0] (Tx: 2, Rx: 2) >>>>> Core 9[socket:0] (Tx: 3, Rx: 3) >>>>> >>>>>Port 1[socket: 0]: >>>>> Core 2[socket:0] (Tx: 0, Rx: 0) >>>>> Core 3[socket:0] (Tx: 1, Rx: 1) >>>>> Core 4[socket:0] (Tx: 2, Rx: 2) >>>>> Core 5[socket:0] (Tx: 3, Rx: 3) >> >>I do not know now. It seems like something else is going on here that we have >>not identified. > > Maybe vTune or some other type of debug performance tool would be the next > step here. >
Thanks for the patience Keith. I'll try some profiling and see where it takes us from there. I'll update this thread when I have some new info. Regards, Dumitru >> >>>>> >>>>>Surprisingly this gives a session setup rate of 3M sess/s!! >>>>> >>>>>The packet processing cores are totally independent and only access >>>>>local socket memory/ports. >>>>>There is no locking or atomic variable access in fast path in our code. >>>>>The mbuf pools are not shared between cores handling the same port so >>>>>there should be no contention when allocating/freeing mbufs. >>>>>In this specific test scenario all the cores handling port 0 are >>>>>essentially executing the same code (TCP clients) and the cores on >>>>>port 1 as well (TCP servers). >>>>> >>>>>Do you have any tips about what other things to check for? >>>>> >>>>>Thanks, >>>>>Dumitru >>>>> >>>>> >>>>> >>>>>>> >>>>>>>++Keith >>>>>>> >>>>>>>> >>>>>>>>Just for reference, the cpu_layout script shows: >>>>>>>>$ $RTE_SDK/tools/cpu_layout.py >>>>>>>>============================================================ >>>>>>>>Core and Socket Information (as reported by '/proc/cpuinfo') >>>>>>>>============================================================ >>>>>>>> >>>>>>>>cores = [0, 1, 2, 3, 4, 8, 9, 10, 11, 12] >>>>>>>>sockets = [0, 1] >>>>>>>> >>>>>>>> Socket 0 Socket 1 >>>>>>>> -------- -------- >>>>>>>>Core 0 [0, 20] [10, 30] >>>>>>>>Core 1 [1, 21] [11, 31] >>>>>>>>Core 2 [2, 22] [12, 32] >>>>>>>>Core 3 [3, 23] [13, 33] >>>>>>>>Core 4 [4, 24] [14, 34] >>>>>>>>Core 8 [5, 25] [15, 35] >>>>>>>>Core 9 [6, 26] [16, 36] >>>>>>>>Core 10 [7, 27] [17, 37] >>>>>>>>Core 11 [8, 28] [18, 38] >>>>>>>>Core 12 [9, 29] [19, 39] >>>>>>>> >>>>>>>>I know it might be complicated to gigure out exactly what's happening >>>>>>>>in our setup with our own code so please let me know if you need >>>>>>>>additional information. >>>>>>>> >>>>>>>>I appreciate the help! >>>>>>>> >>>>>>>>Thanks, >>>>>>>>Dumitru >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >> >> > > >