Initializing the dp_packet's metadata can be a hot spot, especially for very simple pipelines. Therefore improving the code here can sometimes make a difference.
Using memcpy instead of a plain assignment helps GCC and clang generate faster code. Here's a comparison of the compiler generated code (GCC 4.8) with or without this commit. BEFORE (assignment) | AFTER(memcpy) c8: add $0x8,%r8 | d8: mov (%rsi),%r8 mov (%rcx),%r9 | mov (%rdx),%rdi mov (%rbx),%r11d | add $0x1,%ecx mov %r10,%rcx | add $0x8,%rsi cmp %rsi,%r8 | cmp -0x870(%rbp),%ecx lea 0x88(%r9),%rdi | mov %rdi,0x88(%r8) rep stos %rax,%es:(%rdi) | mov 0x8(%rdx),%rdi mov %r11d,0xb8(%r9) | lea 0x88(%r8),%rax mov %r8,%rcx | mov %rdi,0x90(%r8) jne c8 | mov 0x10(%rdx),%rdi | mov %rdi,0x98(%r8) | mov 0x18(%rdx),%rdi | mov %rdi,0xa0(%r8) | mov 0x20(%rdx),%r8 | mov %r8,0x20(%rax) | mov 0x28(%rdx),%r8 | mov %r8,0x28(%rax) | mov 0x30(%rdx),%r8 | mov %r8,0x30(%rax) | jl d8 The old code uses a 'rep stos' and fetches the 'port_no' value from the 'port' member at every iteration ('mov (%rbx),%r11d'), while the new code uses a series of mov operation to accomplish everything. I can measure a through improvement of ~7% on a single flow phy-phy test with 64 bytes UDP packets. The improvement has been observed on an Intel Xeon Sandy Bridge (2012) and on an Intel Xeon Westmere (2010). Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com> --- lib/dpif-netdev.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index f1d65f5..7d55997 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -2507,13 +2507,16 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd, error = netdev_rxq_recv(rxq, packets, &cnt); cycles_count_end(pmd, PMD_CYCLES_POLLING); if (!error) { + const struct pkt_metadata md = PKT_METADATA_INITIALIZER(port->port_no); int i; *recirc_depth_get() = 0; /* XXX: initialize md in netdev implementation. */ for (i = 0; i < cnt; i++) { - packets[i]->md = PKT_METADATA_INITIALIZER(port->port_no); + /* Use a memcpy instead of an assignment because it helps GCC and + * clang generate better code (even if the call gets inlined) */ + memcpy(&packets[i]->md, &md, sizeof md); } cycles_count_start(pmd); dp_netdev_input(pmd, packets, cnt); -- 2.1.4 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev