Hi, Ilya Ilya Maximets <i.maxim...@ovn.org> 于2021年2月19日周五 下午7:19写道: > > On 2/19/21 3:12 AM, 贺鹏 wrote: > > Hi, > > > > Looks like this bug is caused by violating the fact that if a rule is > > referenced, the related ofproto should not be destroyed. > > > > If so, I have a patch that also fixes the problem, not sure if this helps. > > > > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0...@bytedance.com/ > > There is at least one more problem that is not strictly related but > in more or less the same part of the code: > https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html
So maybe before add it into *ovsrcu_postpone*, we should add refcount of ofproto also? > > > > > William Tu <u9012...@gmail.com> 于2021年2月19日周五 上午8:10写道: > >> > >> On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun <pkusunyif...@gmail.com> wrote: > >>> > >>> ovs-vswitchd could crash under these circumstances: > >>> 1. When one bridge is being destroyed, ofproto_destroy() is called and > >>> connmgr pointer of its ofproto struct is nullified. This ofproto struct is > >>> deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'. > >>> 2. Before RCU enters quiesce state to actually free this ofproto struct, > >>> revalidator thread calls udpif_revalidator(), which could handle > >>> a learn flow and calls ofproto_flow_mod_learn(), it later calls > >>> ofmonitor_report() and ofproto struct's connmgr pointer is accessed. > >>> > >>> The crash stack trace is shown below: > >>> > >>> 0 ofmonitor_report (mgr=0x0, rule=rule@entry=0x7fa4ac067c30, > >>> event=event@entry=NXFME_ADDED, > >>> reason=reason@entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, > >>> abbrev_xid=0, old_actions=old_actions@entry=0x0) > >>> at ofproto/connmgr.c:2160 > >>> 1 0x00007fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, > >>> ofm=<optimized out>, req=req@entry=0x0) > >>> at ofproto/ofproto.c:5221 > >>> 2 0x00007fa4d68036af in modify_flows_finish (req=0x0, > >>> ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0) > >>> at ofproto/ofproto.c:5823 > >>> 3 ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, > >>> ofm=ofm@entry=0x7fa4980753f0, req=req@entry=0x0) > >>> at ofproto/ofproto.c:8088 > >>> 4 0x00007fa4d680372d in ofproto_flow_mod_learn_finish > >>> (ofm=ofm@entry=0x7fa4980753f0, > >>> orig_ofproto=orig_ofproto@entry=0x0) at ofproto/ofproto.c:5439 > >>> 5 0x00007fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, > >>> keep_ref=keep_ref@entry=true, > >>> limit=<optimized out>, below_limitp=below_limitp@entry=0x0) at > >>> ofproto/ofproto.c:5499 > >>> 6 0x00007fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, > >>> stats=stats@entry=0x7fa4d2701a10, > >>> offloaded=offloaded@entry=false) at > >>> ofproto/ofproto-dpif-xlate-cache.c:127 > >>> 7 0x00007fa4d6835e3a in xlate_push_stats (xcache=<optimized out>, > >>> stats=stats@entry=0x7fa4d2701a10, > >>> offloaded=offloaded@entry=false) at > >>> ofproto/ofproto-dpif-xlate-cache.c:181 > >>> 8 0x00007fa4d6822046 in revalidate_ukey > >>> (udpif=udpif@entry=0x55d90760b240, ukey=ukey@entry=0x7fa4b0191660, > >>> stats=stats@entry=0x7fa4d2705118, > >>> odp_actions=odp_actions@entry=0x7fa4d2701b50, > >>> reval_seq=reval_seq@entry=5655486242, > >>> recircs=recircs@entry=0x7fa4d2701b40, offloaded=false) > >>> at ofproto/ofproto-dpif-upcall.c:2294 > >>> 9 0x00007fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at > >>> ofproto/ofproto-dpif-upcall.c:2683 > >>> 10 0x00007fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at > >>> ofproto/ofproto-dpif-upcall.c:936 > >>> 11 0x00007fa4d6259c9f in ovsthread_wrapper (aux_=<optimized out>) at > >>> lib/ovs-thread.c:423 > >>> 12 0x00007fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0 > >>> 13 0x00007fa4d504b96d in clone () from /usr/lib64/libc.so.6 > >>> > >>> At the time of crash, the involved ofproto was already deallocated: > >>> > >>> (gdb) print *ofproto > >>> $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {..., > >>> one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ... > >>> > >>> This patch fixes it. > >>> > >>> VMware-BZ: #2700626 > >>> Signed-off-by: Yifeng Sun <pkusunyif...@gmail.com> > >>> --- > >>> v1->v2: Add check for ofmonitor_flush, thanks William. > >>> > >> > >> LGTM, thanks. > >> Acked-by: William Tu < u9012...@gmail.com> > >> > >> CC Ilya and Ben to see if any comments. > >> > >> I feel this kind of RCU issue is hard to find out, and > >> existing tools such as addressSanitizer are usually > >> not helpful. > >> > >> William > >> > >>> ofproto/connmgr.c | 6 +++++- > >>> 1 file changed, 5 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c > >>> index 9c5c633b4171..fa8f6cd0e83a 100644 > >>> --- a/ofproto/connmgr.c > >>> +++ b/ofproto/connmgr.c > >>> @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule > >>> *rule, > >>> const struct rule_actions *old_actions) > >>> OVS_REQUIRES(ofproto_mutex) > >>> { > >>> - if (rule_is_hidden(rule)) { > >>> + if (!mgr || rule_is_hidden(rule)) { > >>> return; > >>> } > >>> > >>> @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr) > >>> { > >>> struct ofconn *ofconn; > >>> > >>> + if (!mgr) { > >>> + return; > >>> + } > >>> + > >>> LIST_FOR_EACH (ofconn, connmgr_node, &mgr->conns) { > >>> struct rconn_packet_counter *counter = ofconn->monitor_counter; > >>> > >>> -- > >>> 2.7.4 > >>> > >>> _______________________________________________ > >>> dev mailing list > >>> d...@openvswitch.org > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >> _______________________________________________ > >> dev mailing list > >> d...@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > > > > -- hepeng _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev