Hi Eli,

I have some concerns about how we plug the packet state recover logic into 
dfc_processing() here. My concerns are inline below.

I'm concerned that we are hurting performance in the partial HWOL case, as this 
patchset is introducing new direct (non-inline) function calls per packet to 
the software datapath. We can reduce performance impact in this area, see 
specific suggestions below.

Thanks,
Cian

> -----Original Message-----
> From: Eli Britstein <el...@nvidia.com>
> Sent: Wednesday 23 June 2021 16:53
> To: d...@openvswitch.org; Ilya Maximets <i.maxim...@ovn.org>
> Cc: Gaetan Rivet <gaet...@nvidia.com>; Majd Dibbiny <m...@nvidia.com>; 
> Sriharsha Basavapatna
> <sriharsha.basavapa...@broadcom.com>; Hemal Shah <hemal.s...@broadcom.com>; 
> Ivan Malov
> <ivan.ma...@oktetlabs.ru>; Ferriter, Cian <cian.ferri...@intel.com>; Eli 
> Britstein <el...@nvidia.com>;
> Finn, Emma <emma.f...@intel.com>; Kovacevic, Marko <marko.kovace...@intel.com>
> Subject: [PATCH V7 06/13] dpif-netdev: Add HW miss packet state recover logic.
> 
> Recover the packet if it was partially processed by the HW. Fallback to
> lookup flow by mark association.
> 
> Signed-off-by: Eli Britstein <el...@nvidia.com>
> Reviewed-by: Gaetan Rivet <gaet...@nvidia.com>
> Acked-by: Sriharsha Basavapatna <sriharsha.basavapa...@broadcom.com>
> Tested-by: Emma Finn <emma.f...@intel.com>
> Tested-by: Marko Kovacevic <marko.kovace...@intel.com>
> Signed-off-by: Ilya Maximets <i.maxim...@ovn.org>
> ---
>  lib/dpif-netdev.c | 45 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 8fa7eb6d4..d5b7d5b73 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -114,6 +114,7 @@ COVERAGE_DEFINE(datapath_drop_invalid_port);
>  COVERAGE_DEFINE(datapath_drop_invalid_bond);
>  COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
>  COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
> +COVERAGE_DEFINE(datapath_drop_hw_miss_recover);
> 
>  /* Protects against changes to 'dp_netdevs'. */
>  static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER;
> @@ -7062,6 +7063,39 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit);
>  }
> 
> +static struct tx_port * pmd_send_port_cache_lookup(
> +    const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
> +
> +static inline int
> +dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
> +                  odp_port_t port_no,
> +                  struct dp_packet *packet,
> +                  struct dp_netdev_flow **flow)
> +{
> +    struct tx_port *p;
> +    uint32_t mark;
> +


Start of full HWOL recovery block

> +    /* Restore the packet if HW processing was terminated before completion. 
> */
> +    p = pmd_send_port_cache_lookup(pmd, port_no);
> +    if (OVS_LIKELY(p)) {
> +        int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
> +
> +        if (err && err != EOPNOTSUPP) {
> +            COVERAGE_INC(datapath_drop_hw_miss_recover);
> +            return -1;
> +        }
> +    }

End of full HWOL recovery block

IIUC, the above is only relevant to full HWOL where the packet is partially 
processed by the HW. In a partial HWOL testcase, we see a performance drop as a 
result of the above code. The performance after the patchset is applied is 
0.94x what the performance was before.

We should branch over this code in the partial HWOL scenario, where we don't 
need to get the call to pmd_send_port_cache_lookup() and 
netdev_hw_miss_packet_recover(). We want this branch to be transparent to the 
user. Since both full and partial HWOL falls under the 
other_config:hw-offload=true switch, we might need a configure time check NIC 
capabilities solution or something similar to branch over full HWOL packet 
recovery code. Does this make sense?


perf top showing cycles spent per function in my partial HWOL scenario. We can 
see netdev_hw_miss_packet_recover(), 
netdev_offload_dpdk_hw_miss_packet_recover() and netdev_is_flow_api_enabled() 
taking cycles:
    28.79%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_input__
    13.72%  pmd-c01/id:8  ovs-vswitchd        [.] parse_tcp_flags
    11.07%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_recv_pkts_vec_avx2
    10.94%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_xmit_fixed_burst_vec_avx2
     7.48%  pmd-c01/id:8  ovs-vswitchd        [.] cmap_find
     4.94%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_hw_miss_packet_recover
     4.77%  pmd-c01/id:8  ovs-vswitchd        [.] dp_execute_output_action
     3.81%  pmd-c01/id:8  ovs-vswitchd        [.] 
dp_netdev_pmd_flush_output_on_port
     3.40%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_send
     2.49%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_eth_send
     1.16%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_rxq_recv
     0.90%  pmd-c01/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
     0.75%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
     0.68%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_is_flow_api_enabled
     0.55%  pmd-c01/id:8  ovs-vswitchd        [.] 
netdev_offload_dpdk_hw_miss_packet_recover

> +
> +    /* If no mark, no flow to find. */
> +    if (!dp_packet_has_flow_mark(packet, &mark)) {
> +        *flow = NULL;
> +        return 0;
> +    }
> +
> +    *flow = mark_to_flow_find(pmd, mark);
> +    return 0;
> +}
> +
>  /* Try to process all ('cnt') the 'packets' using only the datapath flow 
> cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the
> @@ -7106,7 +7140,6 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> 
>      DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>          struct dp_netdev_flow *flow;
> -        uint32_t mark;
> 
>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>              dp_packet_delete(packet);
> @@ -7125,9 +7158,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>              pkt_metadata_init(&packet->md, port_no);
>          }
> 
> -        if ((*recirc_depth_get() == 0) &&
> -            dp_packet_has_flow_mark(packet, &mark)) {
> -            flow = mark_to_flow_find(pmd, mark);
> +        if (netdev_is_flow_api_enabled() && *recirc_depth_get() == 0) {

Here we have a per packet call to netdev_is_flow_api_enabled(). I think that 
netdev_is_flow_api_enabled() should be inlined if it's going to be called per 
packet. We can see from the above "perf top" that it isn't inlined since it 
shows up as a separate function.

> +            if (OVS_UNLIKELY(dp_netdev_hw_flow(pmd, port_no, packet, 
> &flow))) {
> +                /* Packet restoration failed and it was dropped, do not
> +                 * continue processing.
> +                 */
> +                continue;
> +            }
>              if (OVS_LIKELY(flow)) {
>                  tcp_flags = parse_tcp_flags(packet);
>                  if (OVS_LIKELY(batch_enable)) {
> --
> 2.28.0.2311.g225365fb51

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to