Re: [vpp-dev] possible use deleted sw if index in ip4-lookup and cause crash

2022-12-13 Thread Zhang Dongya
By adding the following code right after process dispatch in the main loop,
the crash is fixed.

So I think the condition mentioned above is a rare but valid case.

A ctrl process node being scheduled adds a packet (pending frame) to a node
and the packet is referring to an interface which will be deleted soon.

The interface will then be deleted in the unix_epoll_input PRE_INPUT node
which handles API input, then in the following graph scheduling it will
trigger various assert failures.


  {
> /* Ctrl nodes may have added work to the pending vector too.
>Process pending vector until there is nothing left.
>All pending vectors will be processed from input -> output. */
> for (i = 0; i < _vec_len (nm->pending_frames); i++)
>   cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now);
> /* Reset pending vector for next iteration. */
> vec_set_len (nm->pending_frames, 0);
>
> if (is_main)
>   {
> // We also need do a barrier here to ensure worker node which
> have
> // pkt handoffed.
> vlib_worker_thread_barrier_sync (vm);
> vlib_worker_thread_barrier_release (vm);
>   }
>   }
>

Zhang Dongya via lists.fd.io 
于2022年12月14日周三 11:52写道:

> Hi list,
>
> During the test, when l3sub if is deleted, I got a new abort in interface
> drop node, seems the packet reference to a deleted interface.
>
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #1  0x7face8d17859 in __GI_abort () at abort.c:79
>> #2  0x00407397 in os_exit (code=1) at
>> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:440
>> #3  0x7face922dd57 in unix_signal_handler (signum=6,
>> si=0x7faca2891170, uc=0x7faca2891040) at
>> /home/fortitude/glx/vpp/src/vlib/unix/main.c:188
>> #4  
>> #5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>> #6  0x7face8d17859 in __GI_abort () at abort.c:79
>> #7  0x00407333 in os_panic () at
>> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:416
>> #8  0x7face9067039 in debugger () at
>> /home/fortitude/glx/vpp/src/vppinfra/error.c:84
>> #9  0x7face9066dfa in _clib_error (how_to_die=2, function_name=0x0,
>> line_number=0, fmt=0x7face9f7a208 "%s:%d (%s) assertion `%s' fails") at
>> /home/fortitude/glx/vpp/src/vppinfra/error.c:143
>> #10 0x7face9b28358 in vnet_get_sw_interface (vnm=0x7facea243f38
>> , sw_if_index=14) at
>> /home/fortitude/glx/vpp/src/vnet/interface_funcs.h:60
>> #11 0x7face9b2a4ba in interface_drop_punt (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, frame=0x7facc2004a40,
>> disposition=VNET_ERROR_DISPOSITION_DROP)
>> at /home/fortitude/glx/vpp/src/vnet/interface_output.c:1061
>> #12 0x7face9b29a96 in interface_drop_fn_hsw (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, frame=0x7facc2004a40) at
>> /home/fortitude/glx/vpp/src/vnet/interface_output.c:1215
>> #13 0x7face91cd50d in dispatch_node (vm=0x7facac8e5b00,
>> node=0x7faca95c8840, type=VLIB_NODE_TYPE_INTERNAL,
>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7facc2004a40,
>> last_time_stamp=404307411779413) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:961
>> #14 0x7face91cdfb0 in dispatch_pending_node (vm=0x7facac8e5b00,
>> pending_frame_index=3, last_time_stamp=404307411779413) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:1120
>> #15 0x7face91c921f in vlib_main_or_worker_loop (vm=0x7facac8e5b00,
>> is_main=0) at /home/fortitude/glx/vpp/src/vlib/main.c:1589
>> #16 0x7face91c8947 in vlib_worker_loop (vm=0x7facac8e5b00) at
>> /home/fortitude/glx/vpp/src/vlib/main.c:1723
>> #17 0x7face92080a4 in vlib_worker_thread_fn (arg=0x7facaa227d00) at
>> /home/fortitude/glx/vpp/src/vlib/threads.c:1579
>> #18 0x7face9203195 in vlib_worker_thread_bootstrap_fn
>> (arg=0x7facaa227d00) at /home/fortitude/glx/vpp/src/vlib/threads.c:418
>> #19 0x7face9121609 in start_thread (arg=) at
>> pthread_create.c:477
>> #20 0x7face8e14133 in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>
> From the first mail, I want to know is the sequence can happen or not ?
>
> 1, my process node adds a pkt by using put_frame_to_node to ip4-lookup
> directly, which set the rx interface to the l3 sub interface created before.
> 2, my control plane agent (using govpp) delete the l3 sub interface. (it
> should be handled in vpp api-process node)
> 3, vpp schedule pending nodes. since the rx interface is deleted, vpp
> can't get a valid fib index and there is not check in the following
> ip4_fib_forwarding_lookup, so it crash with abort.
>
> I don't think a api barrier in step 2 can solve this, since the pkt is
> already in the pending frame.
>
> Zhang Dongya via lists.fd.io 
> 于2022年12月8日周四 00:17写道:
>
>> The crash have not been found anymore.
>>
>> Does this fix make any sense? it it does, I will submit a patch later.
>>
>> Zhang Dongya via lists.fd.io  于
>> 2022年11月29日周二 22:51写道:
>>
>>> Hi ben,
>>>
>>> In the 

Re: [vpp-dev] memory growth in charon using vpp_sswan

2022-12-13 Thread Mahdi Varasteh
Hi Kai,

Thanks for your response. Yes your understanding is correct and 
`stat_segment_connect` is called only when querying SAs. This query does not 
occur unless: 1) you ask for SA status and 2) before sending DPD messages. I'm 
afraid using 5.9.6 version did not change anything. May I ask if in your end, 
you use DPD?( It is disabled by default, to enable `dpd_delay`  value should be 
set in `swanctl.conf`) If the dead peer detection is disabled and I don't query 
my SAs too often, the situation does not arise.

Regards

Mahdi

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22326): https://lists.fd.io/g/vpp-dev/message/22326
Mute This Topic: https://lists.fd.io/mt/95641379/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] possible use deleted sw if index in ip4-lookup and cause crash

2022-12-13 Thread Zhang Dongya
Hi list,

During the test, when l3sub if is deleted, I got a new abort in interface
drop node, seems the packet reference to a deleted interface.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1  0x7face8d17859 in __GI_abort () at abort.c:79
> #2  0x00407397 in os_exit (code=1) at
> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:440
> #3  0x7face922dd57 in unix_signal_handler (signum=6,
> si=0x7faca2891170, uc=0x7faca2891040) at
> /home/fortitude/glx/vpp/src/vlib/unix/main.c:188
> #4  
> #5  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #6  0x7face8d17859 in __GI_abort () at abort.c:79
> #7  0x00407333 in os_panic () at
> /home/fortitude/glx/vpp/src/vpp/vnet/main.c:416
> #8  0x7face9067039 in debugger () at
> /home/fortitude/glx/vpp/src/vppinfra/error.c:84
> #9  0x7face9066dfa in _clib_error (how_to_die=2, function_name=0x0,
> line_number=0, fmt=0x7face9f7a208 "%s:%d (%s) assertion `%s' fails") at
> /home/fortitude/glx/vpp/src/vppinfra/error.c:143
> #10 0x7face9b28358 in vnet_get_sw_interface (vnm=0x7facea243f38
> , sw_if_index=14) at
> /home/fortitude/glx/vpp/src/vnet/interface_funcs.h:60
> #11 0x7face9b2a4ba in interface_drop_punt (vm=0x7facac8e5b00,
> node=0x7faca95c8840, frame=0x7facc2004a40,
> disposition=VNET_ERROR_DISPOSITION_DROP)
> at /home/fortitude/glx/vpp/src/vnet/interface_output.c:1061
> #12 0x7face9b29a96 in interface_drop_fn_hsw (vm=0x7facac8e5b00,
> node=0x7faca95c8840, frame=0x7facc2004a40) at
> /home/fortitude/glx/vpp/src/vnet/interface_output.c:1215
> #13 0x7face91cd50d in dispatch_node (vm=0x7facac8e5b00,
> node=0x7faca95c8840, type=VLIB_NODE_TYPE_INTERNAL,
> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7facc2004a40,
> last_time_stamp=404307411779413) at
> /home/fortitude/glx/vpp/src/vlib/main.c:961
> #14 0x7face91cdfb0 in dispatch_pending_node (vm=0x7facac8e5b00,
> pending_frame_index=3, last_time_stamp=404307411779413) at
> /home/fortitude/glx/vpp/src/vlib/main.c:1120
> #15 0x7face91c921f in vlib_main_or_worker_loop (vm=0x7facac8e5b00,
> is_main=0) at /home/fortitude/glx/vpp/src/vlib/main.c:1589
> #16 0x7face91c8947 in vlib_worker_loop (vm=0x7facac8e5b00) at
> /home/fortitude/glx/vpp/src/vlib/main.c:1723
> #17 0x7face92080a4 in vlib_worker_thread_fn (arg=0x7facaa227d00) at
> /home/fortitude/glx/vpp/src/vlib/threads.c:1579
> #18 0x7face9203195 in vlib_worker_thread_bootstrap_fn
> (arg=0x7facaa227d00) at /home/fortitude/glx/vpp/src/vlib/threads.c:418
> #19 0x7face9121609 in start_thread (arg=) at
> pthread_create.c:477
> #20 0x7face8e14133 in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>

>From the first mail, I want to know is the sequence can happen or not ?

1, my process node adds a pkt by using put_frame_to_node to ip4-lookup
directly, which set the rx interface to the l3 sub interface created before.
2, my control plane agent (using govpp) delete the l3 sub interface. (it
should be handled in vpp api-process node)
3, vpp schedule pending nodes. since the rx interface is deleted, vpp can't
get a valid fib index and there is not check in the following
ip4_fib_forwarding_lookup, so it crash with abort.

I don't think a api barrier in step 2 can solve this, since the pkt is
already in the pending frame.

Zhang Dongya via lists.fd.io 
于2022年12月8日周四 00:17写道:

> The crash have not been found anymore.
>
> Does this fix make any sense? it it does, I will submit a patch later.
>
> Zhang Dongya via lists.fd.io  于
> 2022年11月29日周二 22:51写道:
>
>> Hi ben,
>>
>> In the beginning I also think it should be a barrier issue, however it
>> turned out not the case.
>>
>> The pkt which had sw_if_index[VLIB_RX] set as the to-be-deleted interface
>> is actually being put to ip4-lookup node by my process node, the process
>> node add pkt in a timer drive way.
>>
>> Since the pkt is added by my process node, I think it is not affected by
>> the worker barrier.  in my case the sub if is deleted by API, which is
>> processed in linux_epoll_input PRE_INPUT node, let's consider the following
>> sequence:
>>
>>
>>1. my process add a pkt to ip4-node, and the pkt refer to a valid sw
>>if index
>>2. linux_epoll_input process a API request to delete the above sw if
>>index.
>>3. vpp schedule ip4-lookup node, then it will crash because the sw if
>>index is deleted and ip4_lookup node can't use sw_if_index[VLIB_RX] which
>>is now ~0 to get a valid fib index.
>>
>>
>> There are some code that do this way (ikev2_send_ike and others), I think
>> it's not doable to update the pending frame when the interface is deleted.
>>
>> Benoit Ganne (bganne) via lists.fd.io 
>> 于2022年11月29日周二 22:22写道:
>>
>>> Hi Zhang,
>>>
>>> I'd expect the interface deletion to happen under the worker barrier.
>>> VPP workers should drain all their in-flight packets before entering the
>>> barrier, so it should not be possible for the interface to disappear
>>> 

Re: [vpp-dev] memory growth in charon using vpp_sswan

2022-12-13 Thread Ji, Kai
Hi Mahdi,

Thank you for report your discovery in vpp_sswan, unfortunately we haven't see 
this memory growth issue in our end.
If my understanding is correct, the stat_segment_connect function should be 
called only if you want to see how many packets and bytes were processed by 
each SA, and the function was embedded in vpp, and there is no changes in sswan 
plugin.
Can I ask you to try a different version of sswan (5.9.5 or 5.9.6) to see the 
issue still remains.

regards

Kai

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22324): https://lists.fd.io/g/vpp-dev/message/22324
Mute This Topic: https://lists.fd.io/mt/95641379/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



[vpp-dev] memory growth in charon using vpp_sswan

2022-12-13 Thread Mahdi Varasteh
Hi,

I used the plugin resided in `extras/vpp_sswan` on both 5.9.8 and 5.8.2 
Strongswan versions. All functionalities are working great but after Child SAs 
are established, there's a constant memory growth in charon process( Reaching 
from 20MB RSS to 46MB RSS in 4 days, but this growth is visible from the 
beginning of establishment of SAs and is constant. Also I haven't tested it 
this issue exists with IKE SAs only with no Child SA). There are also no 
complaints From ASAN.

Interestingly, If `stat_segment_disconnect` is not called( and 
`stat_segment_connect` is called once) in `query_sa` function, The problem is 
solved and there is no memory growth. I tried protecting the process of 
querying SAs with mutex but it was no good.  I would like to know if there is 
any specific reason behind this connecting/fetching stats/disconnecting 
pattern? Couldn't connect once then keep the connection? I've also checked and 
tested the `stat_client.c` code. It seems fine and there are no memory issues 
there.

So any advice regarding where to lookup or which tools should be used to catch 
the root cause of this problem?

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22323): https://lists.fd.io/g/vpp-dev/message/22323
Mute This Topic: https://lists.fd.io/mt/95641379/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-