Re: [ovs-dev] OVS-DPDK performance problem on ixgbe vector PMD

Zoltan Kiss Wed, 26 Aug 2015 11:50:10 -0700

Hi,

I've checked it further, based on Stephen's suggestion I've tried perftop as well. The results were the same, it spends a lot of time in thatpart of the code, and there are high number of branch load misses(BR_MISS_PRED_RETIRED) around there too.I've also started to strip down miniflow_extract() to remove parts whichare not relevant to this very simple testcase. I've removed the metadatachecking branches and the "size < sizeof(struct eth_header)". I'veremoved the size check from emc_processing, and placed log messages inflow_extract and netdev_flow_key_from_flow, to make sure the excessivetime spent in miniflow_extract is not because these two are somehowcalling it.That way I've closed out all of the branches preceding this instruction.Oddly the high sample number now moved down a few instructions:

...
dp_packet_reset_offsets
  5113eb:       b8 ff ff ff ff          mov    $0xffffffff,%eax
  5113f0:       66 89 8f 86 00 00 00    mov    %cx,0x86(%rdi)
  5113f7:       c6 87 81 00 00 00 00    movb   $0x0,0x81(%rdi)
  5113fe:       66 89 87 82 00 00 00    mov    %ax,0x82(%rdi)
data_pull
  511405:       48 8d 4d 0c             lea    0xc(%rbp),%rcx
dp_packet_reset_offsets
  511409:       66 89 97 84 00 00 00    mov    %dx,0x84(%rdi)
memcpy
  511410:       48 8b 45 00             mov    0x0(%rbp),%rax
  511414:       48 89 46 18             mov    %rax,0x18(%rsi)

This last instruction moves the first 8 bytes of the MAC address (comingfrom 0x0(%rbp)) to 0x18(%rsi), which is basically memory pointed byparameter "struct miniflow *dst". It is allocated on the stack byemc_processing.I couldn't find any branch which can cause this miss, but then I'vechecked the PMD stats:


pmd thread numa_id 0 core_id 1:
        emc hits:4395834176
        megaflow hits:1
        miss:1
        lost:0
        polling cycles:166083129380 (16.65%)
        processing cycles:831536059972 (83.35%)
        avg cycles per packet: 226.95 (997619189352/4395834178)
        avg processing cycles per packet: 189.16 (831536059972/4395834178)

So everything hits EMC, when I measured the change of that counter for10 seconds, the result was around ~13.3 Mpps too. The cycle statisticsshows that it should be able to handle more than 15M packets per second,yet it doesn't receive that much, while with the non-vector PMD it canmax out the link.

Any more suggestions?

Regards,

Zoltan


On 21/08/15 19:05, Zoltan Kiss wrote:

Hi,

I've set up a simple packet forwarding perf test on a dual-port 10G
82599ES: one port receives 64 byte UDP packets, the other sends it out,
one core used. I've used latest OVS with DPDK 2.1, and the first result
was only 13.2 Mpps, which was a bit far from the 13.9 I've seen last
year with the same test. The first thing I've changed was to revert back
to the old behaviour about this issue:

http://permalink.gmane.org/gmane.comp.networking.dpdk.devel/22731

So instead of the new default I've passed 2048 + RTE_PKTMBUF_HEADROOM.
That increased the performance to 13.5, but to figure out what's wrong
started to play with the receive functions. First I've disabled vector
PMD, but ixgbe_recv_pkts_bulk_alloc() was even worse, only 12.5 Mpps. So
then I've enabled scattered RX, and with
ixgbe_recv_pkts_lro_bulk_alloc() I could manage to get 13.98 Mpps, which
is I guess as close as possible to the 14.2 line rate (on my HW at
least, with one core)
Does anyone has a good explanation about why the vector PMD performs so
significantly worse? I would expect that on a 3.2 GHz i5-4570 one core
should be able to reach ~14 Mpps, SG and vector PMD shouldn't make a
difference.
I've tried to look into it with oprofile, but the results were quite
strange: 35% of the samples were from miniflow_extract, the part where
parse_vlan calls data_pull to jump after the MAC addresses. The oprofile
snippet (1M samples):

   511454 19        0.0037  flow.c:511
   511458 149       0.0292  dp-packet.h:266
   51145f 4264      0.8357  dp-packet.h:267
   511466 18        0.0035  dp-packet.h:268
   51146d 43        0.0084  dp-packet.h:269
   511474 172       0.0337  flow.c:511
   51147a 4320      0.8467  string3.h:51
   51147e 358763   70.3176  flow.c:99
   511482 2        3.9e-04  string3.h:51
   511485 3060      0.5998  string3.h:51
   511488 1693      0.3318  string3.h:51
   51148c 2933      0.5749  flow.c:326
   511491 47        0.0092  flow.c:326

And the corresponding disassembled code:

   511454:       49 83 f9 0d             cmp    r9,0xd
   511458:       c6 83 81 00 00 00 00    mov    BYTE PTR [rbx+0x81],0x0
   51145f:       66 89 83 82 00 00 00    mov    WORD PTR [rbx+0x82],ax
   511466:       66 89 93 84 00 00 00    mov    WORD PTR [rbx+0x84],dx
   51146d:       66 89 8b 86 00 00 00    mov    WORD PTR [rbx+0x86],cx
   511474:       0f 86 af 01 00 00       jbe    511629
<miniflow_extract+0x279>
   51147a:       48 8b 45 00             mov    rax,QWORD PTR [rbp+0x0]
   51147e:       4c 8d 5d 0c             lea    r11,[rbp+0xc]
   511482:       49 89 00                mov    QWORD PTR [r8],rax
   511485:       8b 45 08                mov    eax,DWORD PTR [rbp+0x8]
   511488:       41 89 40 08             mov    DWORD PTR [r8+0x8],eax
   51148c:       44 0f b7 55 0c          movzx  r10d,WORD PTR [rbp+0xc]
   511491:       66 41 81 fa 81 00       cmp    r10w,0x81

My only explanation to this so far is that I misunderstand something
about the oprofile results.

Regards,

Zoltan

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Re: [ovs-dev] OVS-DPDK performance problem on ixgbe vector PMD

Reply via email to