Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-20 Thread Zhang Dongya
Hi,

After review my code, I found that I have add a flag to the vnet_disconnect
API which will call session_reset instead of session_close, the reason I do
this is to make intermediate firewall just flush the state and reconstruct
if I later reconnect.

It seems in session_reset logic, for half open session, it also missing to
remove the session from the lookup hash which may cause the issue too.

I change my code and will test with your patch along, will provide feedback
later.

I also noticed the bihash issue discussed in the list recently, I will
merge later.

Florin Coras  于2023年3月21日周二 11:56写道:

> Hi,
>
> That last thing is pretty interesting. It’s either the issue fixed by this
> patch [1] or sessions are somehow cleaned up multiple times. If it’s the
> latter, I’d really like to understand how that happens.
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/38507
>
> On Mar 20, 2023, at 6:52 PM, Zhang Dongya 
> wrote:
>
> Hi,
>
> After merge this patch and update the test environment, the issue still
> persists.
>
> Let me clear my client app config:
> 1. register a reset callback, which will call vnet_disconnect there and
> also trigger reconnect by send event to the ctrl process.)
> 2. register a connected callback, which will handle connect err by trigger
> reconnect, on success, it will record session handle and extract tcp
> sequence for our app usage.
> 3. register a disconnect callback, which basically do same as reset
> callback.
> 4. register a cleanup callback and accept callback, which basically make
> the session layer happy without actually relevant work to do.
>
> There is a ctrl process in mater, which will handle periodically reconnect
> or triggered by event.
>
> BTW, I also see frequently warning 'session %u hash delete rv -3' in
> session_delete in my environment, hope this helps to investigate.
>
> Florin Coras  于2023年3月20日周一 23:29写道:
>
>> Hi,
>>
>> Understood and yes, connect will synchronously fail if port is not
>> available, so you should be able to retry it later.
>>
>> Regards,
>> Florin
>>
>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya 
>> wrote:
>>
>> Hi,
>>
>> It seems the issue occurs when there are disconnect called because our
>> network can't guarantee a tcp can't be reset even when 3 ways handshake is
>> completed (firewall issue :( ).
>>
>> When we find the app layer timeout, we will first disconnect (because we
>> record the session handle, this session might be a half open session), does
>> vnet session layer guarantee that if we reconnect from master thread when
>> the half open session still not be released yet (due to asynchronous logic)
>> that the reconnect fail? if then we can retry connect later.
>>
>> I prefer to not registered half open callback because I think it make app
>> complicated from a TCP programming prospective.
>>
>> For your patch, I think it should be work because I can't delete the half
>> open session immediately because there is worker configured, so the half
>> open will be removed from bihash when syn retrans timeout. I have merged
>> the patch and will provide feedback later.
>>
>> Florin Coras  于2023年3月20日周一 13:09写道:
>>
>>> Hi,
>>>
>>> Inline.
>>>
>>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya 
>>> wrote:
>>>
>>> Hi,
>>>
>>> It can be aborted both in established state or half open state because I
>>> will do timeout in our app layer.
>>>
>>>
>>> [fc] Okay! Is the issue present irrespective of the state of the session
>>> or does it happen only after a disconnect in hanf-open state? More lower.
>>>
>>>
>>> Regarding your question,
>>>
>>> - Yes we add a builtin in app relys on C apis that  mainly use
>>> vnet_connect/disconnect to connect or disconnect session.
>>>
>>>
>>> [fc] Understood
>>>
>>> - We call these api in a vpp ctrl process which should be running on the
>>> master thread, we never do session setup/teardown on worker thread. (the
>>> environment that found this issue is configured with 1 master + 1 worker
>>> setup.)
>>>
>>>
>>> [fc] With vpp latest it’s possible to connect from first workers. It’s
>>> an optimization meant to avoid 1) worker barrier on syns and 2) entering
>>> poll mode on main (consume less cpu)
>>>
>>> - We started to develop the app using 22.06 and I keep to merge upstream
>>> changes to latest vpp by cherry-picking. The reason for line mismatch is
>>> that I added some comment to the session layer code, it should be equal to
>>> the master branch now.
>>>
>>>
>>> [fc] Ack
>>>
>>>
>>> When reading the code I understand that we mainly want to cleanup half
>>> open from bihash in session_stream_connect_notify, however, in syn-sent
>>> state if I choose to close the session, the session might be closed by my
>>> app due to session setup timeout (in second scale), in that case, session
>>> will be marked as half_open_done and half open session will be freed
>>> shortly in the ctrl thread (the 1st worker?).
>>>
>>>
>>> [fc] Actually, this might be the issue. We did start to provide a
>>> 

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-20 Thread Florin Coras
Hi, 

That last thing is pretty interesting. It’s either the issue fixed by this 
patch [1] or sessions are somehow cleaned up multiple times. If it’s the 
latter, I’d really like to understand how that happens. 

Regards,
Florin

[1] https://gerrit.fd.io/r/c/vpp/+/38507 

> On Mar 20, 2023, at 6:52 PM, Zhang Dongya  wrote:
> 
> Hi,
> 
> After merge this patch and update the test environment, the issue still 
> persists.
> 
> Let me clear my client app config:
> 1. register a reset callback, which will call vnet_disconnect there and also 
> trigger reconnect by send event to the ctrl process.)
> 2. register a connected callback, which will handle connect err by trigger 
> reconnect, on success, it will record session handle and extract tcp sequence 
> for our app usage.
> 3. register a disconnect callback, which basically do same as reset callback.
> 4. register a cleanup callback and accept callback, which basically make the 
> session layer happy without actually relevant work to do.
> 
> There is a ctrl process in mater, which will handle periodically reconnect or 
> triggered by event.
> 
> BTW, I also see frequently warning 'session %u hash delete rv -3' in 
> session_delete in my environment, hope this helps to investigate.
> 
> Florin Coras mailto:fcoras.li...@gmail.com>> 
> 于2023年3月20日周一 23:29写道:
>> Hi, 
>> 
>> Understood and yes, connect will synchronously fail if port is not 
>> available, so you should be able to retry it later. 
>> 
>> Regards, 
>> Florin
>> 
>>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya >> > wrote:
>>> 
>>> Hi,
>>> 
>>> It seems the issue occurs when there are disconnect called because our 
>>> network can't guarantee a tcp can't be reset even when 3 ways handshake is 
>>> completed (firewall issue :( ).
>>> 
>>> When we find the app layer timeout, we will first disconnect (because we 
>>> record the session handle, this session might be a half open session), does 
>>> vnet session layer guarantee that if we reconnect from master thread when 
>>> the half open session still not be released yet (due to asynchronous logic) 
>>> that the reconnect fail? if then we can retry connect later.
>>> 
>>> I prefer to not registered half open callback because I think it make app 
>>> complicated from a TCP programming prospective.
>>> 
>>> For your patch, I think it should be work because I can't delete the half 
>>> open session immediately because there is worker configured, so the half 
>>> open will be removed from bihash when syn retrans timeout. I have merged 
>>> the patch and will provide feedback later.
>>> 
>>> Florin Coras mailto:fcoras.li...@gmail.com>> 
>>> 于2023年3月20日周一 13:09写道:
 Hi, 
 
 Inline.
 
> On Mar 19, 2023, at 6:47 PM, Zhang Dongya  > wrote:
> 
> Hi,
> 
> It can be aborted both in established state or half open state because I 
> will do timeout in our app layer. 
 
 [fc] Okay! Is the issue present irrespective of the state of the session 
 or does it happen only after a disconnect in hanf-open state? More lower. 
 
> 
> Regarding your question,
> 
> - Yes we add a builtin in app relys on C apis that  mainly use 
> vnet_connect/disconnect to connect or disconnect session.
 
 [fc] Understood
 
> - We call these api in a vpp ctrl process which should be running on the 
> master thread, we never do session setup/teardown on worker thread. (the 
> environment that found this issue is configured with 1 master + 1 worker 
> setup.)
 
 [fc] With vpp latest it’s possible to connect from first workers. It’s an 
 optimization meant to avoid 1) worker barrier on syns and 2) entering poll 
 mode on main (consume less cpu)
 
> - We started to develop the app using 22.06 and I keep to merge upstream 
> changes to latest vpp by cherry-picking. The reason for line mismatch is 
> that I added some comment to the session layer code, it should be equal 
> to the master branch now.
 
 [fc] Ack
 
> 
> When reading the code I understand that we mainly want to cleanup half 
> open from bihash in session_stream_connect_notify, however, in syn-sent 
> state if I choose to close the session, the session might be closed by my 
> app due to session setup timeout (in second scale), in that case, session 
> will be marked as half_open_done and half open session will be freed 
> shortly in the ctrl thread (the 1st worker?).
 
 [fc] Actually, this might be the issue. We did start to provide a 
 half-open session handle to apps which if closed does clean up the session 
 but apparently it is missing the cleanup of the session lookup table. 
 Could you try this patch [1]? It might need additional work.
 
 Having said that, forcing a close/cleanup will not free the port 
 synchronously. So, if you’re using fixed ports, 

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-20 Thread Zhang Dongya
Hi,

After merge this patch and update the test environment, the issue still
persists.

Let me clear my client app config:
1. register a reset callback, which will call vnet_disconnect there and
also trigger reconnect by send event to the ctrl process.)
2. register a connected callback, which will handle connect err by trigger
reconnect, on success, it will record session handle and extract tcp
sequence for our app usage.
3. register a disconnect callback, which basically do same as reset
callback.
4. register a cleanup callback and accept callback, which basically make
the session layer happy without actually relevant work to do.

There is a ctrl process in mater, which will handle periodically reconnect
or triggered by event.

BTW, I also see frequently warning 'session %u hash delete rv -3' in
session_delete in my environment, hope this helps to investigate.

Florin Coras  于2023年3月20日周一 23:29写道:

> Hi,
>
> Understood and yes, connect will synchronously fail if port is not
> available, so you should be able to retry it later.
>
> Regards,
> Florin
>
> On Mar 20, 2023, at 1:58 AM, Zhang Dongya 
> wrote:
>
> Hi,
>
> It seems the issue occurs when there are disconnect called because our
> network can't guarantee a tcp can't be reset even when 3 ways handshake is
> completed (firewall issue :( ).
>
> When we find the app layer timeout, we will first disconnect (because we
> record the session handle, this session might be a half open session), does
> vnet session layer guarantee that if we reconnect from master thread when
> the half open session still not be released yet (due to asynchronous logic)
> that the reconnect fail? if then we can retry connect later.
>
> I prefer to not registered half open callback because I think it make app
> complicated from a TCP programming prospective.
>
> For your patch, I think it should be work because I can't delete the half
> open session immediately because there is worker configured, so the half
> open will be removed from bihash when syn retrans timeout. I have merged
> the patch and will provide feedback later.
>
> Florin Coras  于2023年3月20日周一 13:09写道:
>
>> Hi,
>>
>> Inline.
>>
>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya 
>> wrote:
>>
>> Hi,
>>
>> It can be aborted both in established state or half open state because I
>> will do timeout in our app layer.
>>
>>
>> [fc] Okay! Is the issue present irrespective of the state of the session
>> or does it happen only after a disconnect in hanf-open state? More lower.
>>
>>
>> Regarding your question,
>>
>> - Yes we add a builtin in app relys on C apis that  mainly use
>> vnet_connect/disconnect to connect or disconnect session.
>>
>>
>> [fc] Understood
>>
>> - We call these api in a vpp ctrl process which should be running on the
>> master thread, we never do session setup/teardown on worker thread. (the
>> environment that found this issue is configured with 1 master + 1 worker
>> setup.)
>>
>>
>> [fc] With vpp latest it’s possible to connect from first workers. It’s an
>> optimization meant to avoid 1) worker barrier on syns and 2) entering poll
>> mode on main (consume less cpu)
>>
>> - We started to develop the app using 22.06 and I keep to merge upstream
>> changes to latest vpp by cherry-picking. The reason for line mismatch is
>> that I added some comment to the session layer code, it should be equal to
>> the master branch now.
>>
>>
>> [fc] Ack
>>
>>
>> When reading the code I understand that we mainly want to cleanup half
>> open from bihash in session_stream_connect_notify, however, in syn-sent
>> state if I choose to close the session, the session might be closed by my
>> app due to session setup timeout (in second scale), in that case, session
>> will be marked as half_open_done and half open session will be freed
>> shortly in the ctrl thread (the 1st worker?).
>>
>>
>> [fc] Actually, this might be the issue. We did start to provide a
>> half-open session handle to apps which if closed does clean up the session
>> but apparently it is missing the cleanup of the session lookup table. Could
>> you try this patch [1]? It might need additional work.
>>
>> Having said that, forcing a close/cleanup will not free the port
>> synchronously. So, if you’re using fixed ports, you’ll have to wait for the
>> half-open cleanup notification.
>>
>>
>> Should I also registered half open callback or there are some other
>> reason that lead to this failure?
>>
>>
>> [fc] Yes, see above.
>>
>> Regards,
>> Florin
>>
>> [1] https://gerrit.fd.io/r/c/vpp/+/38526
>>
>>
>> Florin Coras  于2023年3月20日周一 06:22写道:
>>
>>> Hi,
>>>
>>> When you abort the connection, is it fully established or half-open?
>>> Half-opens are cleaned up by the owner thread after a timeout, but the
>>> 5-tuple should be assigned to the fully established session by that point.
>>> tcp_half_open_connection_cleanup does not cleanup the bihash instead
>>> session_stream_connect_notify does once tcp connect returns either success
>>> or failure.
>>>
>>> So a few 

Re: [vpp-dev] #vnet A bug which may cause assertion error in vnet/session

2023-03-20 Thread Florin Coras
Hi, 

First of all, could you try this [1] with latest vpp? It’s really interesting 
that iperf does not exhibit this issue. 

Regarding your config, some observations:
- I see you have configured 4 worker. I would then recommend to use 4 rx-queues 
and 5 tx-queues (main can send packets), as opposed to 2. 
- tcp defaults to cubic, so config can be omitted.
- evt_qs_memfd_seg is not deprecated, so it can be omitted as well
- any particular reason for "set interface rx-mode eth1 polling”? dpdk 
interfaces are in polling mode by default
- you’re using binary api socket "api-socket-name /run/vpp/api.sock”. That 
works, but going forward we’ll slowly deprecate that api. So it’d recommend 
using the app socket api. See for instance [2] for changes needed to session 
stanza and vcl. 

Regards,
Florin

[1] https://gerrit.fd.io/r/c/vpp/+/38529
[2] https://wiki.fd.io/view/VPP/HostStack/LDP/iperf


> On Mar 20, 2023, at 5:50 AM, Chen Weihao  wrote:
> 
> Thanks for your reply.
> I give a more detailed backtrace and config in 
> https://lists.fd.io/g/vpp-dev/message/22731.  
> My installation method is to 
> clone vpp from github and make build on Ubuntu 22.04(Kernel version is 
> 5.19),and I use make run for test and make debug for debugging. Yes, I yried 
> to make the server and client are attached to the same vpp instance.I tried 
> the latest version of vpp on github on yesterday, the problem is still exist.
> I am looking forward to your reply.
> 
> 
> 
> 


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22734): https://lists.fd.io/g/vpp-dev/message/22734
Mute This Topic: https://lists.fd.io/mt/97707720/21656
Mute #vnet:https://lists.fd.io/g/vpp-dev/mutehashtag/vnet
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-20 Thread Florin Coras
Hi, 

Understood and yes, connect will synchronously fail if port is not available, 
so you should be able to retry it later. 

Regards, 
Florin

> On Mar 20, 2023, at 1:58 AM, Zhang Dongya  wrote:
> 
> Hi,
> 
> It seems the issue occurs when there are disconnect called because our 
> network can't guarantee a tcp can't be reset even when 3 ways handshake is 
> completed (firewall issue :( ).
> 
> When we find the app layer timeout, we will first disconnect (because we 
> record the session handle, this session might be a half open session), does 
> vnet session layer guarantee that if we reconnect from master thread when the 
> half open session still not be released yet (due to asynchronous logic) that 
> the reconnect fail? if then we can retry connect later.
> 
> I prefer to not registered half open callback because I think it make app 
> complicated from a TCP programming prospective.
> 
> For your patch, I think it should be work because I can't delete the half 
> open session immediately because there is worker configured, so the half open 
> will be removed from bihash when syn retrans timeout. I have merged the patch 
> and will provide feedback later.
> 
> Florin Coras mailto:fcoras.li...@gmail.com>> 
> 于2023年3月20日周一 13:09写道:
>> Hi, 
>> 
>> Inline.
>> 
>>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya >> > wrote:
>>> 
>>> Hi,
>>> 
>>> It can be aborted both in established state or half open state because I 
>>> will do timeout in our app layer. 
>> 
>> [fc] Okay! Is the issue present irrespective of the state of the session or 
>> does it happen only after a disconnect in hanf-open state? More lower. 
>> 
>>> 
>>> Regarding your question,
>>> 
>>> - Yes we add a builtin in app relys on C apis that  mainly use 
>>> vnet_connect/disconnect to connect or disconnect session.
>> 
>> [fc] Understood
>> 
>>> - We call these api in a vpp ctrl process which should be running on the 
>>> master thread, we never do session setup/teardown on worker thread. (the 
>>> environment that found this issue is configured with 1 master + 1 worker 
>>> setup.)
>> 
>> [fc] With vpp latest it’s possible to connect from first workers. It’s an 
>> optimization meant to avoid 1) worker barrier on syns and 2) entering poll 
>> mode on main (consume less cpu)
>> 
>>> - We started to develop the app using 22.06 and I keep to merge upstream 
>>> changes to latest vpp by cherry-picking. The reason for line mismatch is 
>>> that I added some comment to the session layer code, it should be equal to 
>>> the master branch now.
>> 
>> [fc] Ack
>> 
>>> 
>>> When reading the code I understand that we mainly want to cleanup half open 
>>> from bihash in session_stream_connect_notify, however, in syn-sent state if 
>>> I choose to close the session, the session might be closed by my app due to 
>>> session setup timeout (in second scale), in that case, session will be 
>>> marked as half_open_done and half open session will be freed shortly in the 
>>> ctrl thread (the 1st worker?).
>> 
>> [fc] Actually, this might be the issue. We did start to provide a half-open 
>> session handle to apps which if closed does clean up the session but 
>> apparently it is missing the cleanup of the session lookup table. Could you 
>> try this patch [1]? It might need additional work.
>> 
>> Having said that, forcing a close/cleanup will not free the port 
>> synchronously. So, if you’re using fixed ports, you’ll have to wait for the 
>> half-open cleanup notification.
>> 
>>> 
>>> Should I also registered half open callback or there are some other reason 
>>> that lead to this failure?
>>> 
>> 
>> [fc] Yes, see above.
>> 
>> Regards, 
>> Florin
>> 
>> [1] https://gerrit.fd.io/r/c/vpp/+/38526
>> 
>>> 
>>> Florin Coras mailto:fcoras.li...@gmail.com>> 
>>> 于2023年3月20日周一 06:22写道:
 Hi, 
 
 When you abort the connection, is it fully established or half-open? 
 Half-opens are cleaned up by the owner thread after a timeout, but the 
 5-tuple should be assigned to the fully established session by that point. 
 tcp_half_open_connection_cleanup does not cleanup the bihash instead 
 session_stream_connect_notify does once tcp connect returns either success 
 or failure. 
 
 So a few questions:
 - is it accurate to assume you have a builtin vpp app and rely only on C 
 apis to interact with host stack?
 - on what thread (main or first worker) do you call vnet_connect?
 - what api do you use to close the session? 
 - what version of vpp is this because lines don’t match vpp latest?
 
 Regards,
 Florin
 
 > On Mar 19, 2023, at 2:08 AM, Zhang Dongya >>> > > wrote:
 > 
 > Hi list,
 > 
 > recently in our application, we constantly triggered such abrt issue 
 > which make our connectivity interrupt for a while:
 > 
 > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC 
 > 

Re: [vpp-dev] #vnet A bug which may cause assertion error in vnet/session

2023-03-20 Thread Chen Weihao
Thanks for your reply.
I give a more detailed backtrace and config in 
https://lists.fd.io/g/vpp-dev/message/22731. ( 
https://lists.fd.io/g/vpp-dev/message/22731, ) My installation method is to 
clone vpp from github and make build on Ubuntu 22.04(Kernel version is 
5.19),and I use make run for test and make debug for debugging. Yes, I yried to 
make the server and client are attached to the same vpp instance.I tried the 
latest version of vpp on github on yesterday, the problem is still exist.
I am looking forward to your reply.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22732): https://lists.fd.io/g/vpp-dev/message/22732
Mute This Topic: https://lists.fd.io/mt/97707720/21656
Mute #vnet:https://lists.fd.io/g/vpp-dev/mutehashtag/vnet
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] #vnet A bug which may cause assertion error in vnet/session

2023-03-20 Thread Chen Weihao
Thank you for your reply.
This is the stacktrace captured by gdb:

2: /home/chenweihao/vpp_dev/src/vnet/session/session.c:233 (session_is_valid) 
assertion `! pool_is_free (session_main.wrk[thread_index].sessions, _e)' fails

Thread 4 "vpp_wk_1" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffa93f5640 (LWP 4575)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140736032888384) at 
./nptl/pthread_kill.c:44
44./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736032888384)
at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140736032888384)
at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140736032888384, signo=signo@entry=6)
at ./nptl/pthread_kill.c:89
#3  0x76a42476 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/posix/raise.c:26
#4  0x76a287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0xb073 in os_panic ()
at /home/chenweihao/vpp_dev/src/vpp/vnet/main.c:417
#6  0x76f0a699 in debugger ()
at /home/chenweihao/vpp_dev/src/vppinfra/error.c:84
#7  0x76f0a450 in _clib_error (how_to_die=2, function_name=0x0,
line_number=0, fmt=0x77b88208 "%s:%d (%s) assertion `%s' fails")
at /home/chenweihao/vpp_dev/src/vppinfra/error.c:143
#8  0x775f31c7 in session_is_valid (si=4294967295,
thread_index=1 '\001')
at /home/chenweihao/vpp_dev/src/vnet/session/session.c:233
#9  0x775f177c in session_get (si=4294967295, thread_index=1)
at /home/chenweihao/vpp_dev/src/vnet/session/session.h:373
#10 0x775f3770 in ho_session_get (ho_index=4294967295)
at /home/chenweihao/vpp_dev/src/vnet/session/session.h:689
#11 0x775f37d8 in session_half_open_migrate_notify (tc=0x7fffbdba00c0)
at /home/chenweihao/vpp_dev/src/vnet/session/session.c:357
#12 0x77648600 in ct_accept_one (thread_index=2, ho_index=2)
at /home/chenweihao/vpp_dev/src/vnet/session/application_local.c:669
#13 0x77648243 in ct_accept_rpc_wrk_handler (rpc_args=0x2)
at /home/chenweihao/vpp_dev/src/vnet/session/application_local.c:760
#14 0x77620b42 in session_event_dispatch_ctrl (wrk=0x7fffbdb6fe00,
elt=0x7fffbdccaf2c)
at /home/chenweihao/vpp_dev/src/vnet/session/session_node.c:1656
#15 0x776175c2 in session_queue_node_fn (vm=0x7fffb82da700,
node=0x7fffbdbc21c0, frame=0x0)
at /home/chenweihao/vpp_dev/src/vnet/session/session_node.c:1962
#16 0x77ea3a62 in dispatch_node (vm=0x7fffb82da700,
node=0x7fffbdbc21c0, type=VLIB_NODE_TYPE_INPUT,
dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x0,
last_time_stamp=2911620275652)
at /home/chenweihao/vpp_dev/src/vlib/main.c:960
#17 0x77e9f7d1 in vlib_main_or_worker_loop (vm=0x7fffb82da700,
is_main=0) at /home/chenweihao/vpp_dev/src/vlib/main.c:1557
#18 0x77e9f1c7 in vlib_worker_loop (vm=0x7fffb82da700)
at /home/chenweihao/vpp_dev/src/vlib/main.c:1722
#19 0x77edb020 in vlib_worker_thread_fn (arg=0x7fffb9182980)
at /home/chenweihao/vpp_dev/src/vlib/threads.c:1598
#20 0x77ed62b6 in vlib_worker_thread_bootstrap_fn (arg=0x7fffb9182980)
at /home/chenweihao/vpp_dev/src/vlib/threads.c:418
#21 0x76a94b43 in start_thread (arg=)
at ./nptl/pthread_create.c:442
#22 0x76b26a00 in clone3 ()
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
The resaon why assertion failed is that  c_s_index value is -1 (or ~0).

The redis veision I used:
git clone https://github.com/redis/redis.git
cd redis
git checkout 6.0
make

My startup.conf:
unix {
nodaemon
interactive
full-coredump
cli-listen /run/vpp/cli.sock
gid vpp
startup-config /home/chenweihao/startup.txt
}

api-trace {
on
}

cpu {
main-core 0
workers 4
}

tcp {
cc-algo cubic
}

dpdk {
uio-driver vfio-pci
dev :03:00.0 {
name eth1
num-rx-queues 2
num-tx-queues 2
}
}

buffers {
buffers-per-numa 131072
}

session { evt_qs_memfd_seg  }

My startup.txt:
set interface ip address eth1 192.168.0.2/24
set interface state eth1 up
set interface rx-mode eth1 polling
create loopback interface
set interface ip address loop0 127.0.0.1/8
set interface state loop0 up

My vcl.conf:
vcl {
rx-fifo-size 4000
tx-fifo-size 4000
app-scope-local
app-scope-global
api-socket-name /run/vpp/api.sock
use-mq-eventfd
}
I tried iperf3 as you said,indeed I did not encounter this problem.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22731): https://lists.fd.io/g/vpp-dev/message/22731
Mute This Topic: https://lists.fd.io/mt/97707720/21656
Mute #vnet:https://lists.fd.io/g/vpp-dev/mutehashtag/vnet
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-



Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

2023-03-20 Thread Zhang Dongya
Hi,

It seems the issue occurs when there are disconnect called because our
network can't guarantee a tcp can't be reset even when 3 ways handshake is
completed (firewall issue :( ).

When we find the app layer timeout, we will first disconnect (because we
record the session handle, this session might be a half open session), does
vnet session layer guarantee that if we reconnect from master thread when
the half open session still not be released yet (due to asynchronous logic)
that the reconnect fail? if then we can retry connect later.

I prefer to not registered half open callback because I think it make app
complicated from a TCP programming prospective.

For your patch, I think it should be work because I can't delete the half
open session immediately because there is worker configured, so the half
open will be removed from bihash when syn retrans timeout. I have merged
the patch and will provide feedback later.

Florin Coras  于2023年3月20日周一 13:09写道:

> Hi,
>
> Inline.
>
> On Mar 19, 2023, at 6:47 PM, Zhang Dongya 
> wrote:
>
> Hi,
>
> It can be aborted both in established state or half open state because I
> will do timeout in our app layer.
>
>
> [fc] Okay! Is the issue present irrespective of the state of the session
> or does it happen only after a disconnect in hanf-open state? More lower.
>
>
> Regarding your question,
>
> - Yes we add a builtin in app relys on C apis that  mainly use
> vnet_connect/disconnect to connect or disconnect session.
>
>
> [fc] Understood
>
> - We call these api in a vpp ctrl process which should be running on the
> master thread, we never do session setup/teardown on worker thread. (the
> environment that found this issue is configured with 1 master + 1 worker
> setup.)
>
>
> [fc] With vpp latest it’s possible to connect from first workers. It’s an
> optimization meant to avoid 1) worker barrier on syns and 2) entering poll
> mode on main (consume less cpu)
>
> - We started to develop the app using 22.06 and I keep to merge upstream
> changes to latest vpp by cherry-picking. The reason for line mismatch is
> that I added some comment to the session layer code, it should be equal to
> the master branch now.
>
>
> [fc] Ack
>
>
> When reading the code I understand that we mainly want to cleanup half
> open from bihash in session_stream_connect_notify, however, in syn-sent
> state if I choose to close the session, the session might be closed by my
> app due to session setup timeout (in second scale), in that case, session
> will be marked as half_open_done and half open session will be freed
> shortly in the ctrl thread (the 1st worker?).
>
>
> [fc] Actually, this might be the issue. We did start to provide a
> half-open session handle to apps which if closed does clean up the session
> but apparently it is missing the cleanup of the session lookup table. Could
> you try this patch [1]? It might need additional work.
>
> Having said that, forcing a close/cleanup will not free the port
> synchronously. So, if you’re using fixed ports, you’ll have to wait for the
> half-open cleanup notification.
>
>
> Should I also registered half open callback or there are some other reason
> that lead to this failure?
>
>
> [fc] Yes, see above.
>
> Regards,
> Florin
>
> [1] https://gerrit.fd.io/r/c/vpp/+/38526
>
>
> Florin Coras  于2023年3月20日周一 06:22写道:
>
>> Hi,
>>
>> When you abort the connection, is it fully established or half-open?
>> Half-opens are cleaned up by the owner thread after a timeout, but the
>> 5-tuple should be assigned to the fully established session by that point.
>> tcp_half_open_connection_cleanup does not cleanup the bihash instead
>> session_stream_connect_notify does once tcp connect returns either success
>> or failure.
>>
>> So a few questions:
>> - is it accurate to assume you have a builtin vpp app and rely only on C
>> apis to interact with host stack?
>> - on what thread (main or first worker) do you call vnet_connect?
>> - what api do you use to close the session?
>> - what version of vpp is this because lines don’t match vpp latest?
>>
>> Regards,
>> Florin
>>
>> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya 
>> wrote:
>> >
>> > Hi list,
>> >
>> > recently in our application, we constantly triggered such abrt issue
>> which make our connectivity interrupt for a while:
>> >
>> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC
>> 0x7fefd3b2000b
>> > Mar 19 16:11:26 ubuntu vnet[2565933]:
>> /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 (tcp46_input_inline)
>> assertion `tcp_lookup_is_valid (tc0, b[0], tcp_buffer_hdr (b[0]))' fails
>> >
>> > Our scenario is quite simple, we will make 4 parallel tcp connection
>> (use 4 fixed source ports) to a remote vpp stack (fixed ip and port), and
>> will do some keepalive in our application layer, since we only use the vpp
>> tcp stack to make the middle box happy with the connection, we do not use
>> the data transport of tcp statck actually.
>> >
>> > However, since the network condition is