On 1/17/2021 9:48 AM, Sriharsha Basavapatna wrote:
On Sun, Jan 10, 2021 at 12:25 PM Eli Britstein <el...@nvidia.com> wrote:

On 1/7/2021 9:10 PM, Sriharsha Basavapatna wrote:
Hi Eli,

On Mon, Dec 28, 2020 at 11:43 PM Eli Britstein <el...@nvidia.com> wrote:
Netdev datapath offloads are done in a separate thread, using messaging
between the threads. With port removal there is a race between the offload
thread removing the offloaded rules and the actual port removal, so some
rules will not be removed.
Can you provide details on this sequence, to get more clarity on how
some rules will not be removed ?
A deletion sequence from dpif point of view starts with dpif_port_del.
It calls dpif->dpif_class->port_del (implemented by
dpif_netdev_port_del) and netdev_ports_remove.

flow_mark_flush is called
(dpif_netdev_port_del->do_del_port->reconfigure_datapath->reload_affected_pmds).
This function only posts deletion requests to the queue
(queue_netdev_flow_del), to be later handled by the offload thread.

When the offload thread wakes up to handle the request
(dp_netdev_flow_offload_del->mark_to_flow_disassociate) it looks for the
netdev object by netdev_ports_get in port_to_netdev map, racing the
execution of netdev_ports_remove that removes that mapping.

If the mapping is removed before the handling, the removal of the HW
rule won't take place, leaking it and the related allocated memory.
Thanks for the details; the above commit message could be reworded to
be more clear like this:

With port removal there is a race -->

"With port removal,  there is a race due to which by the time the flow
delete request is processed by the offload thread, the corresponding
netdev would have already been removed (in dpif_port_del()).  And the
offload thread fails to find the netdev in
mark_to_flow_disassociate(). Thus the flow is not deleted from HW."
The series got merged.

thread removing the offloaded rules and the actual port removal, so some
rules will not be removed.
In OVS the offload objects are not freed (memory
leak).
Since dp_netdev_free_flow_offload() frees the offload object, not sure
how the memory leak happens ?
As eventually netdev_offload_dpdk_flow_del is not called, the objects in
ufid_to_rte_flow that should have been freed by
ufid_to_rte_flow_disassociate are leaked (as well as the rte_flows not
destroyed).
Ok;  the commit message could be more specific about the objects that
are leaked.

In HW the remining of the rules depend on PMD behavior.
A few related questions:

1.  In Patch-1, netdev_flow_flush() is called from do_del_port().
Doesn't it violate the layering rules, since port deletion shouldn't
be directly involved with netdev/flow related operations (that would
otherwise be processed as a part of datapath reconfig) ?
netdev/flow operations are called from dpif-netdev layer, that includes
do_del_port. We can call the offload functions under dp->port_mutex
which is locked in that function.

Which layering rules are violated?
With this change, a port directly operates on the flows, as opposed to
letting the normal datapath routines delete the flow in the context of
a specific pmd thread.

Before this change, queue_netdev_flow_del() was the only entry point
for flow deletion and it was called by either flow_mark_flush() for a
given pmd thread or by dp_netdev_pmd_remove_flow() again for a given
pmd thread. But with the new change, the flow gets deleted without the
context of a pmd thread.  Also, we are now entirely bypassing the
offload thread.
The change only removes the offloads of the port. It doesn't handle removing the flows themselves.
2. Since the rte_flow is deleted first, doesn't it leave the
dpif_netdev_flow in an inconsistent state, since the flow is still
active from the pmd thread's perspective (megaflow_to_mark and
mark_to_flow cmap entries are still valid, indicating that it is
offloaded).
That flow is a sequence not related to this patch set, that occurs normally.

Currently there are two types of offloads (both are ingress):

- Partial. Packets are handled by SW with or without offload. Packets
are fully processed correctly by the SW either if they don't have mark
or have mark that is not found by mark_to_flow_find.

- Full. Packets will be handled correctly by the SW as they arrive with
no mark.

The mark<->megaflow mapping disassociation is handled correctly even if
the rte_flow is not removed.

However, indeed, this is a good point to take into consideration when
implementing egress offloading.
I understand how full and partial offloads work w.r.t mark. That was
not my point here. For an offloaded flow (partial or full), entries
are added to the "megaflow_to_mark" and "mark_to_flow" cmaps after the
flow is successfully offloaded by rte. Those entries get removed, when
the flow gets deleted, through dp_netdev_flow_offload_del() .  The
cmap entries are removed at the same time the flow is deleted from
rte. But with this change, the deletion of a flow occurs in two
separate sequences, leaving the entries in cmap (for sometime) while
there's no corresponding offloaded rte_flow.

Instead of flushing flows from do_del_port(), why can't we just wait
for the netdev reference count to drop, since you would have taken a
reference on the netdev for each rte_flow [ while offloading, in
ufid_to_rte_flow_associate() ].
Let reconfigure_datapath()->reload_affected_pmds() issue delete
requests to the offload thread; delete requests get processed by the
offload thread and the reference on the netdev is released for each
flow. Wait on the refcnt to drop in do_del_port() before returning.
This way, you go through the normal steps to delete each flow (via
offload thread), also ensuring that the cmap entries and the rte_flow
are deleted together.

As explained, it doesn't matter if the mark cmaps and the rte_flows are removed "together" or not, like with this change.

This approach is valid, but also has some cons. We need to wait in do_del_port thread for the offload thread, because we must not remove the port mapping before the offload requests are handled. Furthermore, queuing such requests may take eventually longer as they will be queued (and handled) after possible existing requests. As this is already merged, any change should be done on top of it.


3. Also, how does this work with partially offloaded flows (same
megaflow mapped to multiple pmd threads) ?
Also not related to this patch-set. The flow is ref-counted. Upon the
first ref the rte_flow is created, and it is destroyed with the last
unref (see mark_to_flow_disassociate).
Thanks,
-Harsha
This patch-set resolves this issue using flush.

v2-v1:
- Fix for pending offload requests.
v3-v2:
- Rebase.

Travis:
v1: https://travis-ci.org/github/elibritstein/OVS/builds/747022942
v2: https://travis-ci.org/github/elibritstein/OVS/builds/748788786
v3: https://travis-ci.org/github/elibritstein/OVS/builds/751787939

GitHub Actions:
v1: https://github.com/elibritstein/OVS/actions/runs/394296553
v2: https://github.com/elibritstein/OVS/actions/runs/413379295
v3: https://github.com/elibritstein/OVS/actions/runs/448541155

Eli Britstein (4):
    dpif-netdev: Flush offload rules upon port deletion
    netdev-offload-dpdk: Keep netdev in offload object
    netdev-offload-dpdk: Refactor disassociate and flow destroy
    netdev-offload-dpdk: Implement flow flush

   lib/dpif-netdev.c         |  2 +
   lib/netdev-offload-dpdk.c | 85 +++++++++++++++++++++++++--------------
   2 files changed, 56 insertions(+), 31 deletions(-)

--
2.28.0.546.g385c171

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to