Initializing the dp_packet's metadata can be a hot spot, especially
for very simple pipelines.  Therefore improving the code here can
sometimes make a difference.

Using memcpy instead of a plain assignment helps GCC and clang generate
faster code. Here's a comparison of the compiler generated code (GCC 4.8)
with or without this commit.

BEFORE (assignment)                 |     AFTER(memcpy)

c8:  add    $0x8,%r8                |   d8:  mov    (%rsi),%r8
     mov    (%rcx),%r9              |        mov    (%rdx),%rdi
     mov    (%rbx),%r11d            |        add    $0x1,%ecx
     mov    %r10,%rcx               |        add    $0x8,%rsi
     cmp    %rsi,%r8                |        cmp    -0x870(%rbp),%ecx
     lea    0x88(%r9),%rdi          |        mov    %rdi,0x88(%r8)
     rep    stos %rax,%es:(%rdi)    |        mov    0x8(%rdx),%rdi
     mov    %r11d,0xb8(%r9)         |        lea    0x88(%r8),%rax
     mov    %r8,%rcx                |        mov    %rdi,0x90(%r8)
     jne    c8                      |        mov    0x10(%rdx),%rdi
                                    |        mov    %rdi,0x98(%r8)
                                    |        mov    0x18(%rdx),%rdi
                                    |        mov    %rdi,0xa0(%r8)
                                    |        mov    0x20(%rdx),%r8
                                    |        mov    %r8,0x20(%rax)
                                    |        mov    0x28(%rdx),%r8
                                    |        mov    %r8,0x28(%rax)
                                    |        mov    0x30(%rdx),%r8
                                    |        mov    %r8,0x30(%rax)
                                    |        jl     d8

The old code uses a 'rep stos' and fetches the 'port_no' value from
the 'port' member at every iteration ('mov (%rbx),%r11d'), while the
new code uses a series of mov operation to accomplish everything.

I can measure a through improvement of ~7% on a single flow phy-phy test
with 64 bytes UDP packets.

The improvement has been observed on an Intel Xeon Sandy Bridge (2012)
and on an Intel Xeon Westmere (2010).

Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com>
---
 lib/dpif-netdev.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index f1d65f5..7d55997 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2507,13 +2507,16 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread 
*pmd,
     error = netdev_rxq_recv(rxq, packets, &cnt);
     cycles_count_end(pmd, PMD_CYCLES_POLLING);
     if (!error) {
+        const struct pkt_metadata md = PKT_METADATA_INITIALIZER(port->port_no);
         int i;
 
         *recirc_depth_get() = 0;
 
         /* XXX: initialize md in netdev implementation. */
         for (i = 0; i < cnt; i++) {
-            packets[i]->md = PKT_METADATA_INITIALIZER(port->port_no);
+            /* Use a memcpy instead of an assignment because it helps GCC and
+             * clang generate better code (even if the call gets inlined) */
+            memcpy(&packets[i]->md, &md, sizeof md);
         }
         cycles_count_start(pmd);
         dp_netdev_input(pmd, packets, cnt);
-- 
2.1.4

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to