On 5/17/19 3:04 PM, David Marchand wrote:


On Fri, May 17, 2019 at 2:23 PM Maxime Coquelin <maxime.coque...@redhat.com <mailto:maxime.coque...@redhat.com>> wrote:

    Some OVS-DPDK PVP benchmarks show a performance drop
    when switching from DPDK v17.11 to v18.11.

    With the addition of packed ring layout support,
    rte_vhost_enqueue_burst and rte_vhost_dequeue_burst
    became very large, and only a part of the instructions
    are executed (either packed or split ring used).

    This series aims at improving the I-cache pressure,
    first by un-inlining split and packed rings, but
    also by moving parts considered as cold in dedicated
    functions (dirty page logging, fragmented descriptors
    buffer management added for CVE-2018-1059).

    With the series applied, size of the enqueue and
    dequeue split paths is reduced significantly:

    +---------+--------------------+---------------------+
    | Version | Enqueue split path |  Dequeue split path |
    +---------+--------------------+---------------------+
    | v19.05  | 16461B             | 25521B              |
    | +series | 7286B              | 11285B              |
    +---------+--------------------+---------------------+

    Using perf tool to monitor iTLB-load-misses event
    while doing PVP benchmark with testpmd as vswitch,
    we can see the number of iTLB misses being reduced:

    - v19.05:
    # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10

      Performance counter stats for 'CPU(s) 2,3' (10 runs):

             2,438      iTLB-load-miss                   ( +- 13.43% )

            10.00058928 +- 0.00000336 seconds time elapsed  ( +-  0.00% )

    - +series:
    # perf stat --repeat 10  -C 2,3  -e iTLB-load-miss -- sleep 10

      Performance counter stats for 'CPU(s) 2,3' (10 runs):

                55      iTLB-load-miss                   ( +- 10.08% )

            10.00059466 +- 0.00000283 seconds time elapsed  ( +-  0.00% )

    The series also force the inlining of some rte_memcpy
    helpers, as by adding packed ring support, some of them
    were not more inlined but embedded as functions in
    the virtio_net object file, which was not expected.

    Finally, the series simplifies the descriptors buffers
    prefetching, by doing it in the recently introduced
    descriptor buffer mapping function.

    Maxime Coquelin (4):
       vhost: un-inline dirty pages logging functions
       vhost: do not inline packed and split functions
       vhost: do not inline unlikely fragmented buffers code
       vhost: simplify descriptor's buffer prefetching

    root (1):
       eal/x86: force inlining of all memcpy and mov helpers


root ? "oops" :-)

Indeed... Oops!



--
David Marchand

Reply via email to