On Wed, Feb 16, 2022, at 14:15, Ilya Maximets wrote:
> On 2/4/22 17:12, Gaetan Rivet wrote:
>> A race condition has been identified during datapath destruction,
>> along with the port offload flushes issued.
>> 
>> This series addresses these race conditions, cleaning up the
>> port deletion process.
>> 
>> The last patch also cleans up the offload structure.
>> It is not strictly necessary like the first two fixes,
>> so I put it last. It can wait until after the code-freeze
>> to be integrated.
>> 
>> I tested for a few hours without ASAN enabled without seeing
>> issues. ASAN has been executed as part of the github CI:
>> https://github.com/grivet/ovs/actions/runs/1795624401
>> It is however not too relevant, as no offloads are inserted during CI.
>> 
>> The following patch was used to fix an unrelated CI issue:
>> https://patchwork.ozlabs.org/project/openvswitch/patch/20220204150445.1481457-1-i.maxim...@ovn.org/
>> 
>> I also ran datapath creation + deletion loop with ASAN on an offload
>> test setup, but the execution was excruciatingly slow and could
>> not progress much. It reached datapath deletion without panicking
>> and no crash was seen, even though I had to interrupt the test after
>> a few hours.
>> 
>> Gaetan Rivet (2):
>>   dpif-netdev: Move port flush after datapath reconfiguration
>>   dpif-netdev: Use dp_netdev reference in offload threads
>> 
>> Sriharsha Basavapatna (1):
>>   dpif-netdev: Fix a race condition in deletion of offloaded flows
>> 
>>  lib/dpif-netdev.c | 71 ++++++++++++++++++++++++++---------------------
>>  1 file changed, 40 insertions(+), 31 deletions(-)
>
>
> Thanks!  I applied the series to master and 2.17.
>
> I was thinking about backports and I think that it should be OK to
> backport patch #2 (always enqueue deletions), because it doesn't
> introduce any new issues.
>
> But I'm not sure about the first patch.  On 2.17 flush acts like
> a barrier, so no offload requests are in the offload queue after
> the port removal from PMD thread and flush from the offload thread.
> However, there is no such synchronization mechanism on 2.16.
> Flush is performed from the main thread and some offload requests
> could still be in the offload queue.  And if dp is destroyed,
> processing of those offload requests will cause use-after-free.
> In other words, while patch #1 can reduce the race window for flows
> remaining in the hardware after the port deletion, it doesn't solve
> the use-after-free problem.  So, we need some other solution.
> Backporting the barrier machinery doesn't seem like a good option.
>
> Referencing the dp on creation of the PMD thread or on creation
> of the offload item might be a solution. But this will change the
> time dp will actually be destroyed, so needs a careful consideration.
>
> We may also not fix the UAF problem on 2.16 taking into account that
> issue is happening only during deletion of the datapath, so should not
> be very severe and 2.16 is not an LTS branch.
>
> Thoughts?
>
> Best regards, Ilya Maximets.

I agree that the first patch is only correct as long as flush serves as
a synchronization point. Without the barrier, nothing ensures that no further
offload reqs exists in the queue.

Managing dp reference and implementing an offload barrier are two solutions
that have the potential to impact beyond the datapath destruction, potentially
introducing more severe bugs.

Another potential solution might be offload request cancellation, but
it is more complex than the barrier one, so even less doable on the branch.

I will keep trying to find another simple solution that could be used,
but so far I agree that the safest thing would be to leave it as-is
(or just reduce the race window with patch 1 & 2).

Best regards,
-- 
Gaetan Rivet
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to