By adding the following code right after process dispatch in the main loop,
the crash is fixed.

So I think the condition mentioned above is a rare but valid case.

A ctrl process node being scheduled adds a packet (pending frame) to a node
and the packet is referring to an interface which will be deleted soon.

The interface will then be deleted in the unix_epoll_input PRE_INPUT node
which handles API input, then in the following graph scheduling it will
trigger various assert failures.


      {
>         /* Ctrl nodes may have added work to the pending vector too.
>            Process pending vector until there is nothing left.
>            All pending vectors will be processed from input -> output. */
>         for (i = 0; i < _vec_len (nm->pending_frames); i++)
>           cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now);
>         /* Reset pending vector for next iteration. */
>         vec_set_len (nm->pending_frames, 0);
>
>         if (is_main)
>           {
>             // We also need do a barrier here to ensure worker node which
> have
>             // pkt handoffed.
>             vlib_worker_thread_barrier_sync (vm);
>             vlib_worker_thread_barrier_release (vm);
>           }
>       }
>

Zhang Dongya via lists.fd.io <fortitude.zhang=gmail....@lists.fd.io>
于2022年12月14日周三 11:52写道:

> Hi list,
>
> During the test, when l3sub if is deleted, I got a new abort in interface
> drop node, seems the packet reference to a deleted interface.
>
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #1  0x00007face8d17859 in __GI_abort () at abort.c:79
>> #2  0x0000000000407397 in os_exit (code=1) at
>> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:440
>> #3  0x00007face922dd57 in unix_signal_handler (signum=6,
>> si=0x7faca2891170, uc=0x7faca2891040) at
>> /home/fortitude/glx/vpp/src/vlib/unix/main.c:188
>> #4  <signal handler called>
>> #5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #6  0x00007face8d17859 in __GI_abort () at abort.c:79
>> #7  0x0000000000407333 in os_panic () at
>> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:416
>> #8  0x00007face9067039 in debugger () at
>> /home/fortitude/glx/vpp/src/vppinfra/error.c:84
>> #9  0x00007face9066dfa in _clib_error (how_to_die=2, function_name=0x0,
>> line_number=0, fmt=0x7face9f7a208 "%s:%d (%s) assertion `%s' fails") at
>> /home/fortitude/glx/vpp/src/vppinfra/error.c:143
>> #10 0x00007face9b28358 in vnet_get_sw_interface (vnm=0x7facea243f38
>> <vnet_main>, sw_if_index=14) at
>> /home/fortitude/glx/vpp/src/vnet/interface_funcs.h:60
>> #11 0x00007face9b2a4ba in interface_drop_punt (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, frame=0x7facc2004a40,
>> disposition=VNET_ERROR_DISPOSITION_DROP)
>>     at /home/fortitude/glx/vpp/src/vnet/interface_output.c:1061
>> #12 0x00007face9b29a96 in interface_drop_fn_hsw (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, frame=0x7facc2004a40) at
>> /home/fortitude/glx/vpp/src/vnet/interface_output.c:1215
>> #13 0x00007face91cd50d in dispatch_node (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, type=VLIB_NODE_TYPE_INTERNAL,
>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7facc2004a40,
>>     last_time_stamp=404307411779413) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:961
>> #14 0x00007face91cdfb0 in dispatch_pending_node (vm=0x7facac8e5b00,
>> pending_frame_index=3, last_time_stamp=404307411779413) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:1120
>> #15 0x00007face91c921f in vlib_main_or_worker_loop (vm=0x7facac8e5b00,
>> is_main=0) at /home/fortitude/glx/vpp/src/vlib/main.c:1589
>> #16 0x00007face91c8947 in vlib_worker_loop (vm=0x7facac8e5b00) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:1723
>> #17 0x00007face92080a4 in vlib_worker_thread_fn (arg=0x7facaa227d00) at
>> /home/fortitude/glx/vpp/src/vlib/threads.c:1579
>> #18 0x00007face9203195 in vlib_worker_thread_bootstrap_fn
>> (arg=0x7facaa227d00) at /home/fortitude/glx/vpp/src/vlib/threads.c:418
>> #19 0x00007face9121609 in start_thread (arg=<optimized out>) at
>> pthread_create.c:477
>> #20 0x00007face8e14133 in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>
> From the first mail, I want to know is the sequence can happen or not ?
>
> 1, my process node adds a pkt by using put_frame_to_node to ip4-lookup
> directly, which set the rx interface to the l3 sub interface created before.
> 2, my control plane agent (using govpp) delete the l3 sub interface. (it
> should be handled in vpp api-process node)
> 3, vpp schedule pending nodes. since the rx interface is deleted, vpp
> can't get a valid fib index and there is not check in the following
> ip4_fib_forwarding_lookup, so it crash with abort.
>
> I don't think a api barrier in step 2 can solve this, since the pkt is
> already in the pending frame.
>
> Zhang Dongya via lists.fd.io <fortitude.zhang=gmail....@lists.fd.io>
> 于2022年12月8日周四 00:17写道:
>
>> The crash have not been found anymore.
>>
>> Does this fix make any sense? it it does, I will submit a patch later.
>>
>> Zhang Dongya via lists.fd.io <fortitude.zhang=gmail....@lists.fd.io> 于
>> 2022年11月29日周二 22:51写道:
>>
>>> Hi ben,
>>>
>>> In the beginning I also think it should be a barrier issue, however it
>>> turned out not the case.
>>>
>>> The pkt which had sw_if_index[VLIB_RX] set as the to-be-deleted
>>> interface is actually being put to ip4-lookup node by my process node, the
>>> process node add pkt in a timer drive way.
>>>
>>> Since the pkt is added by my process node, I think it is not affected by
>>> the worker barrier.  in my case the sub if is deleted by API, which is
>>> processed in linux_epoll_input PRE_INPUT node, let's consider the following
>>> sequence:
>>>
>>>
>>>    1. my process add a pkt to ip4-node, and the pkt refer to a valid sw
>>>    if index
>>>    2. linux_epoll_input process a API request to delete the above sw if
>>>    index.
>>>    3. vpp schedule ip4-lookup node, then it will crash because the sw
>>>    if index is deleted and ip4_lookup node can't use sw_if_index[VLIB_RX]
>>>    which is now ~0 to get a valid fib index.
>>>
>>>
>>> There are some code that do this way (ikev2_send_ike and others), I
>>> think it's not doable to update the pending frame when the interface is
>>> deleted.
>>>
>>> Benoit Ganne (bganne) via lists.fd.io <bganne=cisco....@lists.fd.io>
>>> 于2022年11月29日周二 22:22写道:
>>>
>>>> Hi Zhang,
>>>>
>>>> I'd expect the interface deletion to happen under the worker barrier.
>>>> VPP workers should drain all their in-flight packets before entering the
>>>> barrier, so it should not be possible for the interface to disappear
>>>> between your node and ip4-lookup. Or am I missing something?
>>>> What I have seen happening is you'd have some data structure where you
>>>> keep the interface index that you use in your node, and this data is not
>>>> updated when the interface is removed.
>>>> Regarding your proposal, I suspect an issue could be when we reuse the
>>>> sw_if_index: if you del a sw_interface and then add a new one, chances are
>>>> you'll be reusing the same index, but fib_index might be different.
>>>>
>>>> Best
>>>> ben
>>>>
>>>> > -----Original Message-----
>>>> > From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Zhang
>>>> Dongya
>>>> > Sent: Tuesday, November 29, 2022 3:45
>>>> > To: vpp-dev@lists.fd.io
>>>> > Subject: Re: [vpp-dev] possible use deleted sw if index in ip4-lookup
>>>> and
>>>> > cause crash
>>>> >
>>>> >
>>>> > I have found a solution and it can solve the crash issue.
>>>> >
>>>> > In ip4_sw_interface_add_del which is a callback for interface
>>>> deletion, we
>>>> > may set the fib index of the removed interface to 0 (default fib)
>>>> instead
>>>> > of ~0.  This behavior is same with interface creation.
>>>> >
>>>> >
>>>> >
>>>> > Zhang Dongya via lists.fd.io <http://lists.fd.io>
>>>> > <fortitude.zhang=gmail....@lists.fd.io <mailto:gmail....@lists.fd.io>
>>>> > 于
>>>> > 2022年11月28日周一 19:41写道:
>>>> >
>>>> >
>>>> >       Hi list,
>>>> >
>>>> >       Recently I encountered a vpp crash with my plugin enabled, after
>>>> > some investigation I find it may related with l3 sub interface delete
>>>> > while my process node add work to ip4-lookup node.
>>>> >
>>>> >
>>>> >       Intuitively I think it may related to a barrier usage but I
>>>> tried
>>>> > to fix by add some check in my process node to guard the case that l3
>>>> sub
>>>> > interface is deleted. however the crash still exists.
>>>> >
>>>> >       Finally I think it should be related to a pattern like this:
>>>> >
>>>> >       1, my process node adds a pkt by using put_frame_to_node to ip4-
>>>> > lookup directly, which set the rx interface to the l3 sub interface
>>>> > created before.
>>>> >
>>>> >       2, my control plane agent (using govpp) delete the l3 sub
>>>> > interface. (it should be handled in vpp api-process node)
>>>> >
>>>> >       3, vpp schedule pending nodes. since the rx interface is
>>>> deleted,
>>>> > vpp can't get a valid fib index and there is not check in the
>>>> following
>>>> > ip4_fib_forwarding_lookup, so it crash with abort.
>>>> >
>>>> >       I think vpp may schedule my process node(timeout driven) and
>>>> api-
>>>> > process node one over one, then it will schedule the pending nodes.
>>>> >
>>>> >       Should I add some check in ip4-lookup or there are better way of
>>>> > sending pkt in ctrl process not correct ?
>>>> >
>>>> >       Thanks a lot.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
> 
>
>
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22327): https://lists.fd.io/g/vpp-dev/message/22327
Mute This Topic: https://lists.fd.io/mt/95307938/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to