On 2/21/22 11:42, Gaëtan Rivet wrote:
On Mon, Feb 21, 2022, at 08:44, Adrian Moreno wrote:
On 2/18/22 12:18, Gaëtan Rivet wrote:
On Thu, Feb 17, 2022, at 16:40, Adrian Moreno wrote:
On 2/17/22 16:15, Gaëtan Rivet wrote:
On Thu, Feb 17, 2022, at 15:08, wangyunjian wrote:
-----Original Message-----
From: Gaëtan Rivet [mailto:gr...@u256.net]
Sent: Thursday, February 17, 2022 9:31 PM
To: wangyunjian <wangyunj...@huawei.com>; <d...@openvswitch.org>
<d...@openvswitch.org>; Ilya Maximets <i.maxim...@ovn.org>; 贺鹏
<xnhp0...@gmail.com>; amore...@redhat.com
Cc: dingxiaoxiong <dingxiaoxi...@huawei.com>
Subject: Re: [ovs-dev] [PATCH] ofproto: fix use-after-free for "ofproto".

On Thu, Feb 17, 2022, at 08:29, wangyunjian wrote:
-----Original Message-----
From: dev [mailto:ovs-dev-boun...@openvswitch.org] On Behalf Of
wangyunjian via dev
Sent: Thursday, February 17, 2022 11:27 AM
To: Gaëtan Rivet <gr...@u256.net>; <d...@openvswitch.org>
<d...@openvswitch.org>; Ilya Maximets <i.maxim...@ovn.org>
Cc: dingxiaoxiong <dingxiaoxi...@huawei.com>
Subject: Re: [ovs-dev] [PATCH] ofproto: fix use-after-free for "ofproto".



-----Original Message-----
From: Gaëtan Rivet [mailto:gr...@u256.net]
Sent: Thursday, February 17, 2022 1:29 AM
To: wangyunjian <wangyunj...@huawei.com>; <d...@openvswitch.org>
<d...@openvswitch.org>; Ilya Maximets <i.maxim...@ovn.org>
Cc: amore...@redhat.com; dingxiaoxiong <dingxiaoxi...@huawei.com>;
贺
<xnhp0...@gmail.com>
Subject: Re: [ovs-dev] [PATCH] ofproto: fix use-after-free for "ofproto".

On Wed, Feb 16, 2022, at 14:24, wangyunjian wrote:
-----Original Message-----
From: Gaëtan Rivet [mailto:gr...@u256.net]
Sent: Wednesday, February 16, 2022 7:34 PM
To: wangyunjian <wangyunj...@huawei.com>; <d...@openvswitch.org>
<d...@openvswitch.org>; Ilya Maximets <i.maxim...@ovn.org>
Cc: dingxiaoxiong <dingxiaoxi...@huawei.com>
Subject: Re: [ovs-dev] [PATCH] ofproto: fix use-after-free for "ofproto".

On Fri, Dec 3, 2021, at 12:25, Yunjian Wang via dev wrote:
When handler threads lookup a "ofproto" and use it, main
thread maybe remove and free the "ofproto" at the same time. The
"ofproto"
has not been protected well, which can lead to an OVS crash.

This patch fixes this by making the "ofproto" lookup RCU-safe
by using cmap instead of hmap and moving remove "ofproto" call
before xlate_txn_commit().


I don't understand the point of moving the cmap_remove() call
before xlate_txn_commit().

To use of the rcu_synchronize in the xlate_txn_commit to avoid
access to the ofproto from other thread through uuid map.


Yes the reason is clear.

But my question is why is it needed? It seems that the ofproto
lifecycle was written with the assumption that it would still be
used while being
destroyed.

Can you explain why it needs to be changed?

I didn't describe the problem clearly before. The main problem is
that hmap variable is not thread safe. The all_ofproto_dpifs_by_uuid
variable uses the hmap structure, which maybe be accessed by main thread
and handler threads.

I'm ok on the part switching to using CMAP to allow concurrent reads.
That I see the reason and it is fine.

The part that I don't understand is moving the cmap_remove() call before the
RCU sync.

As far as I know, the CMAP type does not require that to safely operate.
The writer thread is allowed to call cmap_remove() while a reader is iterating 
on
the CMAP to find a node. The only precaution needed is that actual destruction
of the node (freeing) is deferred, which it is.

So I don't see the reason to move cmap_remove() before the RCU
synchronization, instead of keeping it as it is now. Could you please explain 
your
reasoning?

I consider it is more reasonable for the upcall thread cannot find the ofproto
when the ofproto will be deleted, no other special consideration.

Thanks,
Yunjian

That might be an interesting change to clean up the 'execution context' of the 
ofproto destruction.

But I think it is out of scope for this fix. It means changing the xlate object,
introduce coupling between the ofproto map and the xlate layer. This goes beyond
fixing undefined behavior due to concurrent accesses on the map.


Hi Gaëtan,

I think the root cause of the crash this fix pretends to fix is not the
concurrent access to the map (though that is also an issue, of course) but the
upcall using an ofproto-dpif object after the main thread has run its
destruct(). So I think we need to fix both.

With regards to the way of fixing it, what do you think about removing
all_ofproto_dpifs_by_uuid altogether and using xbridge_lookup_by_uuid() intead?

Thanks
--
Adrián Moreno

Hi Adrián,

Thanks for explaining, it's clear now.
Your suggestion is interesting, it seems correct.

However, it means changing xcfg->xbridges into a cmap. It requires
xlate_xbridge_remove() itself to be RCU compatible. Although there is the
RCU synchronization during xlate_txn_commit(), there is still a window between
xlate_remove_ofproto() and xlate_txn_commit() during which a thread will lookup
into this CMAP and get an xbridge reference.


I don't think we would need xcfg->xbridges to be a cmap. Concurrency of
the xcfg
object is handled by the xlate_txn_start() / xlate_txn_commit()
mechanism.
The entire xcfg is copied to new_xcfg, modifications are performed in
next_cfg
and then xcfg is replaced. Copy and replacement of the entire structure
are rcu
protected.

So, I think the hmap itself is safe to access and since
xlate_txn_commit() calls
ovsrcu_synchronize(), upcall threads won't have references to anything
in the
previous configuration.

So, AFAICS, this suggestion is just removing code, not adding it.


I believe you are right, that's nice!

However, there is a potentially important behavioral change: the xbridge hmap
uses the ofproto pointer as hash while the current map uses the uuid. So the
lookup will be slower. Hence my original thought of just adding
all_ofproto_dpifs_by_uuid to xcfg.

Why not use the ofproto uuid as key for the xbridges map otherwise?


Yes, that should work without really making xbridge_lookup() by ofproto-dpif pointer slower.


Maybe it only means having to ovsrcu_postpone(free, xbridge) in 
xlate_xbridge_remove(),
but it seems dangerous to me to open an execution context to concurrency for an 
object
that was not initially written for it, in the context of a bug fix.

In any case, I understand the purpose of the whole patch now.
Maybe moving the cmap_remove() before xlate_txn_commit(), with a proper comment
fully justifying it, is better for a fix that is meant to be backported?


I agree that we should look for a fix with low intrusiveness for
backporting.
However, for the master branch I believe we should prioritize
robustness. And
having an object (all_ofproto_dpifs_by_uuid) silently use the
synchronization
mechanism of another unrelated object (xcfg) seems to me as more error
prone. WDYT?

Sure, totally agree on both counts. I was just worried that the concurrency
requirement for xlate would be too heavy for a fix, but I had missed what you 
just explained.


--
Adrián Moreno

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to