Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

Florin Coras Thu, 23 Mar 2023 09:36:30 -0700

Hi Zhang, 

Thanks for confirming! Give me a few more days to check if there’s any other 
improvements to be made in that area.


Regards,
Florin 

> On Mar 23, 2023, at 12:00 AM, Zhang Dongya <fortitude.zh...@gmail.com> wrote:
> 
> Hi,
> 
> The new patch works as expected, no assert triggered abort anymore.
> 
> Really appreciate your help and thanks a lot.
> 
> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> 
> 于2023年3月22日周三 11:54写道：
>> Hi Zhang, 
>> 
>> Awesome! Thanks!
>> 
>> Regards,
>> Florin
>> 
>>> On Mar 21, 2023, at 7:41 PM, Zhang Dongya <fortitude.zh...@gmail.com 
>>> <mailto:fortitude.zh...@gmail.com>> wrote:
>>> 
>>> Hi Florin,
>>> 
>>> Thanks a lot, the previous patch and with reset disabled have been running 
>>> 1 day without issue.
>>> 
>>> I will enable reset and with your new patch, will provide feedback later.
>>> 
>>> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> 
>>> 于2023年3月22日周三 02:12写道：
>>>> Hi, 
>>>> 
>>>> Okay, resetting of half-opens definitely not supported. I updated the 
>>>> patch to just clean them up on forced reset, without sending a reset to 
>>>> make sure session lookup table cleanup still happens. 
>>>> 
>>>> Regards,
>>>> Florin
>>>> 
>>>>> On Mar 20, 2023, at 9:13 PM, Zhang Dongya <fortitude.zh...@gmail.com 
>>>>> <mailto:fortitude.zh...@gmail.com>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> After review my code, I found that I have add a flag to the 
>>>>> vnet_disconnect API which will call session_reset instead of 
>>>>> session_close, the reason I do this is to make intermediate firewall just 
>>>>> flush the state and reconstruct if I later reconnect.
>>>>> 
>>>>> It seems in session_reset logic, for half open session, it also missing 
>>>>> to remove the session from the lookup hash which may cause the issue too.
>>>>> 
>>>>> I change my code and will test with your patch along, will provide 
>>>>> feedback later.
>>>>> 
>>>>> I also noticed the bihash issue discussed in the list recently, I will 
>>>>> merge later.
>>>>> 
>>>>> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> 
>>>>> 于2023年3月21日周二 11:56写道：
>>>>>> Hi, 
>>>>>> 
>>>>>> That last thing is pretty interesting. It’s either the issue fixed by 
>>>>>> this patch [1] or sessions are somehow cleaned up multiple times. If 
>>>>>> it’s the latter, I’d really like to understand how that happens. 
>>>>>> 
>>>>>> Regards,
>>>>>> Florin
>>>>>> 
>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/38507 
>>>>>> 
>>>>>>> On Mar 20, 2023, at 6:52 PM, Zhang Dongya <fortitude.zh...@gmail.com 
>>>>>>> <mailto:fortitude.zh...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> After merge this patch and update the test environment, the issue still 
>>>>>>> persists.
>>>>>>> 
>>>>>>> Let me clear my client app config:
>>>>>>> 1. register a reset callback, which will call vnet_disconnect there and 
>>>>>>> also trigger reconnect by send event to the ctrl process.)
>>>>>>> 2. register a connected callback, which will handle connect err by 
>>>>>>> trigger reconnect, on success, it will record session handle and 
>>>>>>> extract tcp sequence for our app usage.
>>>>>>> 3. register a disconnect callback, which basically do same as reset 
>>>>>>> callback.
>>>>>>> 4. register a cleanup callback and accept callback, which basically 
>>>>>>> make the session layer happy without actually relevant work to do.
>>>>>>> 
>>>>>>> There is a ctrl process in mater, which will handle periodically 
>>>>>>> reconnect or triggered by event.
>>>>>>> 
>>>>>>> BTW, I also see frequently warning 'session %u hash delete rv -3' in 
>>>>>>> session_delete in my environment, hope this helps to investigate.
>>>>>>> 
>>>>>>> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> 
>>>>>>> 于2023年3月20日周一 23:29写道：
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Understood and yes, connect will synchronously fail if port is not 
>>>>>>>> available, so you should be able to retry it later. 
>>>>>>>> 
>>>>>>>> Regards, 
>>>>>>>> Florin
>>>>>>>> 
>>>>>>>>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya <fortitude.zh...@gmail.com 
>>>>>>>>> <mailto:fortitude.zh...@gmail.com>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> It seems the issue occurs when there are disconnect called because 
>>>>>>>>> our network can't guarantee a tcp can't be reset even when 3 ways 
>>>>>>>>> handshake is completed (firewall issue :( ).
>>>>>>>>> 
>>>>>>>>> When we find the app layer timeout, we will first disconnect (because 
>>>>>>>>> we record the session handle, this session might be a half open 
>>>>>>>>> session), does vnet session layer guarantee that if we reconnect from 
>>>>>>>>> master thread when the half open session still not be released yet 
>>>>>>>>> (due to asynchronous logic) that the reconnect fail? if then we can 
>>>>>>>>> retry connect later.
>>>>>>>>> 
>>>>>>>>> I prefer to not registered half open callback because I think it make 
>>>>>>>>> app complicated from a TCP programming prospective.
>>>>>>>>> 
>>>>>>>>> For your patch, I think it should be work because I can't delete the 
>>>>>>>>> half open session immediately because there is worker configured, so 
>>>>>>>>> the half open will be removed from bihash when syn retrans timeout. I 
>>>>>>>>> have merged the patch and will provide feedback later.
>>>>>>>>> 
>>>>>>>>> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> 
>>>>>>>>> 于2023年3月20日周一 13:09写道：
>>>>>>>>>> Hi, 
>>>>>>>>>> 
>>>>>>>>>> Inline.
>>>>>>>>>> 
>>>>>>>>>>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya 
>>>>>>>>>>> <fortitude.zh...@gmail.com <mailto:fortitude.zh...@gmail.com>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> It can be aborted both in established state or half open state 
>>>>>>>>>>> because I will do timeout in our app layer. 
>>>>>>>>>> 
>>>>>>>>>> [fc] Okay! Is the issue present irrespective of the state of the 
>>>>>>>>>> session or does it happen only after a disconnect in hanf-open 
>>>>>>>>>> state? More lower. 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Regarding your question,
>>>>>>>>>>> 
>>>>>>>>>>> - Yes we add a builtin in app relys on C apis that  mainly use 
>>>>>>>>>>> vnet_connect/disconnect to connect or disconnect session.
>>>>>>>>>> 
>>>>>>>>>> [fc] Understood
>>>>>>>>>> 
>>>>>>>>>>> - We call these api in a vpp ctrl process which should be running 
>>>>>>>>>>> on the master thread, we never do session setup/teardown on worker 
>>>>>>>>>>> thread. (the environment that found this issue is configured with 1 
>>>>>>>>>>> master + 1 worker setup.)
>>>>>>>>>> 
>>>>>>>>>> [fc] With vpp latest it’s possible to connect from first workers. 
>>>>>>>>>> It’s an optimization meant to avoid 1) worker barrier on syns and 2) 
>>>>>>>>>> entering poll mode on main (consume less cpu)
>>>>>>>>>> 
>>>>>>>>>>> - We started to develop the app using 22.06 and I keep to merge 
>>>>>>>>>>> upstream changes to latest vpp by cherry-picking. The reason for 
>>>>>>>>>>> line mismatch is that I added some comment to the session layer 
>>>>>>>>>>> code, it should be equal to the master branch now.
>>>>>>>>>> 
>>>>>>>>>> [fc] Ack
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> When reading the code I understand that we mainly want to cleanup 
>>>>>>>>>>> half open from bihash in session_stream_connect_notify, however, in 
>>>>>>>>>>> syn-sent state if I choose to close the session, the session might 
>>>>>>>>>>> be closed by my app due to session setup timeout (in second scale), 
>>>>>>>>>>> in that case, session will be marked as half_open_done and half 
>>>>>>>>>>> open session will be freed shortly in the ctrl thread (the 1st 
>>>>>>>>>>> worker?).
>>>>>>>>>> 
>>>>>>>>>> [fc] Actually, this might be the issue. We did start to provide a 
>>>>>>>>>> half-open session handle to apps which if closed does clean up the 
>>>>>>>>>> session but apparently it is missing the cleanup of the session 
>>>>>>>>>> lookup table. Could you try this patch [1]? It might need additional 
>>>>>>>>>> work.
>>>>>>>>>> 
>>>>>>>>>> Having said that, forcing a close/cleanup will not free the port 
>>>>>>>>>> synchronously. So, if you’re using fixed ports, you’ll have to wait 
>>>>>>>>>> for the half-open cleanup notification.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Should I also registered half open callback or there are some other 
>>>>>>>>>>> reason that lead to this failure?
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> [fc] Yes, see above.
>>>>>>>>>> 
>>>>>>>>>> Regards, 
>>>>>>>>>> Florin
>>>>>>>>>> 
>>>>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/38526
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Florin Coras <fcoras.li...@gmail.com 
>>>>>>>>>>> <mailto:fcoras.li...@gmail.com>> 于2023年3月20日周一 06:22写道：
>>>>>>>>>>>> Hi, 
>>>>>>>>>>>> 
>>>>>>>>>>>> When you abort the connection, is it fully established or 
>>>>>>>>>>>> half-open? Half-opens are cleaned up by the owner thread after a 
>>>>>>>>>>>> timeout, but the 5-tuple should be assigned to the fully 
>>>>>>>>>>>> established session by that point. 
>>>>>>>>>>>> tcp_half_open_connection_cleanup does not cleanup the bihash 
>>>>>>>>>>>> instead session_stream_connect_notify does once tcp connect 
>>>>>>>>>>>> returns either success or failure. 
>>>>>>>>>>>> 
>>>>>>>>>>>> So a few questions:
>>>>>>>>>>>> - is it accurate to assume you have a builtin vpp app and rely 
>>>>>>>>>>>> only on C apis to interact with host stack?
>>>>>>>>>>>> - on what thread (main or first worker) do you call vnet_connect?
>>>>>>>>>>>> - what api do you use to close the session? 
>>>>>>>>>>>> - what version of vpp is this because lines don’t match vpp latest?
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Florin
>>>>>>>>>>>> 
>>>>>>>>>>>> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya 
>>>>>>>>>>>> > <fortitude.zh...@gmail.com <mailto:fortitude.zh...@gmail.com>> 
>>>>>>>>>>>> > wrote:
>>>>>>>>>>>> > 
>>>>>>>>>>>> > Hi list,
>>>>>>>>>>>> > 
>>>>>>>>>>>> > recently in our application, we constantly triggered such abrt 
>>>>>>>>>>>> > issue which make our connectivity interrupt for a while:
>>>>>>>>>>>> > 
>>>>>>>>>>>> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, 
>>>>>>>>>>>> > PC 0x7fefd3b2000b
>>>>>>>>>>>> > Mar 19 16:11:26 ubuntu vnet[2565933]: 
>>>>>>>>>>>> > /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 
>>>>>>>>>>>> > (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc0, b[0], 
>>>>>>>>>>>> > tcp_buffer_hdr (b[0]))' fails
>>>>>>>>>>>> > 
>>>>>>>>>>>> > Our scenario is quite simple, we will make 4 parallel tcp 
>>>>>>>>>>>> > connection (use 4 fixed source ports) to a remote vpp stack 
>>>>>>>>>>>> > (fixed ip and port), and will do some keepalive in our 
>>>>>>>>>>>> > application layer, since we only use the vpp tcp stack to make 
>>>>>>>>>>>> > the middle box happy with the connection, we do not use the data 
>>>>>>>>>>>> > transport of tcp statck actually.
>>>>>>>>>>>> > 
>>>>>>>>>>>> > However, since the network condition is complex, we have to  
>>>>>>>>>>>> > always need to abrt the connection and reconnect.
>>>>>>>>>>>> > 
>>>>>>>>>>>> > I keep to merge upstream session and tcp fix however the issue 
>>>>>>>>>>>> > still not fixed, what I found now it may be in some case 
>>>>>>>>>>>> > tcp_half_open_connection_cleanup may not deleted the half open 
>>>>>>>>>>>> > session from the lookup table (bihash) and the session index is 
>>>>>>>>>>>> > realloced by other connection.
>>>>>>>>>>>> > 
>>>>>>>>>>>> > Hope the list can provide some hint about how to overcome this 
>>>>>>>>>>>> > issue, thanks a lot.
>>>>>>>>>>>> > 
>>>>>>>>>>>> > 
>>>>>>>>>>>> > 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> 
> 
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22759): https://lists.fd.io/g/vpp-dev/message/22759
Mute This Topic: https://lists.fd.io/mt/97707823/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

Reply via email to