On 13 Mar 2019, at 10:19, Lollita Liu 
<lollita....@ericsson.com<mailto:lollita....@ericsson.com>> wrote:

Hi, Damjan.

We found in vlib_buffer_free_inline with CLIB_HAVE_VEC128, buffers will always 
be freed one by one, not via the 4 packets batch fast path. The phenomenon is 
the same by calling via avf_device_class_tx_fn or via error_drop.
       It will add about 20 CPU cycles in my test case, the pain point is 
clib_atomic_sub_fetch. Perf result attached for your reference.

  27.54%  avf_plugin.so         [.] avf_device_class_tx_fn_avx2
  20.89%  libvnet.so.19.04      [.] ip4_rewrite_node_fn_avx2
  13.12%  libvnet.so.19.04      [.] ip4_lookup_node_fn_avx2
  12.47%  dpdk_plugin.so        [.] ipsec_tx_input_node_fn_avx2
   8.14%  libvlib.so.19.04      [.] vlib_worker_loop
   4.65%  libvlib.so.19.04      [.] dispatch_pending_node
   4.64%  libvnet.so.19.04      [.] vnet_interface_output_node
   3.15%  dpdk_plugin.so        [.] ipsec_encrypt_tx_node_fn_avx2
   1.60%  dpdk_plugin.so        [.] dpdk_crypto_input_node_fn_avx2
   1.31%  libvlib.so.19.04      [.] vlib_put_next_frame
   1.02%  libvlib.so.19.04      [.] vlib_get_next_frame_internal
   0.44%  libvppinfra.so.19.04  [.] mspace_usable_size_with_delta
   0.44%  libvppinfra.so.19.04  [.] mspace_usable_size
   0.29%  libc-2.23.so          [.] __memmove_avx_unaligned
   0.15%  libvnet.so.19.04      [.] vlib_put_next_frame@plt
   0.15%  libvnet.so.19.04      [.] vnet_get_main
   0.00%  [kernel]              [k] native_write_msr


       │            if (clib_atomic_sub_fetch (&b[0]->ref_count, 1) == 0)
  0.29 │        lock   subb $0x1,0xc(%rax)
20.48 │     ┌──jne    1955
       │     │  mov    %ecx,%eax
       │     │  add    $0x1,%ecx
       │     │        {
       │     │          vlib_buffer_copy_template (b[0], &bt);
  0.87 │192b:│  mov    -0x4b0(%rbp),%rdx

       After checking the source code, I think the implement is different with 
or without CLIB_HAVE_VEC128.

       Without CLIB_HAVE_VEC128, as long as the four packets are not in buffer 
chain, with ref_count as 1, and allocated by same buffer pool, they will be 
freed in batch.
       With CLIB_HAVE_VEC128, even in those condition, the sum will be not 0, 
and will goto one_by_one.

       My suggestion is set ref_count in flags_refs_mask to 1, then if the 
ref_count is 0 or 1, it will go to fast path.

At this point ref_count cannot be 0.

This is not totally same as without CLIB_HAVE_VEC128, but it looks is the most 
simplified fix.

  vlib_buffer_t flags_refs_mask = {
    .flags = VLIB_BUFFER_NEXT_PRESENT,
    .ref_count = ~1
  };

Makes sense to me, and it should be the same as ref_count==0 at this point is 
not possible.
Can you please submit the patch. I know you already did it as part of bigger 
one which i half-merged.
Going forward please make one fix per patch.

Thanks

--
Damjan



-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.

View/Reply Online (#12504): https://lists.fd.io/g/vpp-dev/message/12504
Mute This Topic: https://lists.fd.io/mt/30414067/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub  [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to