Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

Zhang Dongya Thu, 23 Mar 2023 00:00:57 -0700

Hi,

The new patch works as expected, no assert triggered abort anymore.


Really appreciate your help and thanks a lot.

Florin Coras <fcoras.li...@gmail.com> 于2023年3月22日周三 11:54写道：

> Hi Zhang,
>
> Awesome! Thanks!
>
> Regards,
> Florin
>
> On Mar 21, 2023, at 7:41 PM, Zhang Dongya <fortitude.zh...@gmail.com>
> wrote:
>
> Hi Florin,
>
> Thanks a lot, the previous patch and with reset disabled have been running
> 1 day without issue.
>
> I will enable reset and with your new patch, will provide feedback later.
>
> Florin Coras <fcoras.li...@gmail.com> 于2023年3月22日周三 02:12写道：
>
>> Hi,
>>
>> Okay, resetting of half-opens definitely not supported. I updated the
>> patch to just clean them up on forced reset, without sending a reset to
>> make sure session lookup table cleanup still happens.
>>
>> Regards,
>> Florin
>>
>> On Mar 20, 2023, at 9:13 PM, Zhang Dongya <fortitude.zh...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> After review my code, I found that I have add a flag to the
>> vnet_disconnect API which will call session_reset instead of session_close,
>> the reason I do this is to make intermediate firewall just flush the state
>> and reconstruct if I later reconnect.
>>
>> It seems in session_reset logic, for half open session, it also missing
>> to remove the session from the lookup hash which may cause the issue too.
>>
>> I change my code and will test with your patch along, will provide
>> feedback later.
>>
>> I also noticed the bihash issue discussed in the list recently, I will
>> merge later.
>>
>> Florin Coras <fcoras.li...@gmail.com> 于2023年3月21日周二 11:56写道：
>>
>>> Hi,
>>>
>>> That last thing is pretty interesting. It’s either the issue fixed by
>>> this patch [1] or sessions are somehow cleaned up multiple times. If it’s
>>> the latter, I’d really like to understand how that happens.
>>>
>>> Regards,
>>> Florin
>>>
>>> [1] https://gerrit.fd.io/r/c/vpp/+/38507
>>>
>>> On Mar 20, 2023, at 6:52 PM, Zhang Dongya <fortitude.zh...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> After merge this patch and update the test environment, the issue still
>>> persists.
>>>
>>> Let me clear my client app config:
>>> 1. register a reset callback, which will call vnet_disconnect there and
>>> also trigger reconnect by send event to the ctrl process.)
>>> 2. register a connected callback, which will handle connect err by
>>> trigger reconnect, on success, it will record session handle and extract
>>> tcp sequence for our app usage.
>>> 3. register a disconnect callback, which basically do same as reset
>>> callback.
>>> 4. register a cleanup callback and accept callback, which basically make
>>> the session layer happy without actually relevant work to do.
>>>
>>> There is a ctrl process in mater, which will handle periodically
>>> reconnect or triggered by event.
>>>
>>> BTW, I also see frequently warning 'session %u hash delete rv -3' in
>>> session_delete in my environment, hope this helps to investigate.
>>>
>>> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 23:29写道：
>>>
>>>> Hi,
>>>>
>>>> Understood and yes, connect will synchronously fail if port is not
>>>> available, so you should be able to retry it later.
>>>>
>>>> Regards,
>>>> Florin
>>>>
>>>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya <fortitude.zh...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> It seems the issue occurs when there are disconnect called because our
>>>> network can't guarantee a tcp can't be reset even when 3 ways handshake is
>>>> completed (firewall issue :( ).
>>>>
>>>> When we find the app layer timeout, we will first disconnect (because
>>>> we record the session handle, this session might be a half open session),
>>>> does vnet session layer guarantee that if we reconnect from master thread
>>>> when the half open session still not be released yet (due to asynchronous
>>>> logic) that the reconnect fail? if then we can retry connect later.
>>>>
>>>> I prefer to not registered half open callback because I think it make
>>>> app complicated from a TCP programming prospective.
>>>>
>>>> For your patch, I think it should be work because I can't delete the
>>>> half open session immediately because there is worker configured, so the
>>>> half open will be removed from bihash when syn retrans timeout. I have
>>>> merged the patch and will provide feedback later.
>>>>
>>>> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 13:09写道：
>>>>
>>>>> Hi,
>>>>>
>>>>> Inline.
>>>>>
>>>>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya <fortitude.zh...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> It can be aborted both in established state or half open state because
>>>>> I will do timeout in our app layer.
>>>>>
>>>>>
>>>>> [fc] Okay! Is the issue present irrespective of the state of the
>>>>> session or does it happen only after a disconnect in hanf-open state? More
>>>>> lower.
>>>>>
>>>>>
>>>>> Regarding your question,
>>>>>
>>>>> - Yes we add a builtin in app relys on C apis that  mainly use
>>>>> vnet_connect/disconnect to connect or disconnect session.
>>>>>
>>>>>
>>>>> [fc] Understood
>>>>>
>>>>> - We call these api in a vpp ctrl process which should be running on
>>>>> the master thread, we never do session setup/teardown on worker thread.
>>>>> (the environment that found this issue is configured with 1 master + 1
>>>>> worker setup.)
>>>>>
>>>>>
>>>>> [fc] With vpp latest it’s possible to connect from first workers. It’s
>>>>> an optimization meant to avoid 1) worker barrier on syns and 2) entering
>>>>> poll mode on main (consume less cpu)
>>>>>
>>>>> - We started to develop the app using 22.06 and I keep to merge
>>>>> upstream changes to latest vpp by cherry-picking. The reason for line
>>>>> mismatch is that I added some comment to the session layer code, it should
>>>>> be equal to the master branch now.
>>>>>
>>>>>
>>>>> [fc] Ack
>>>>>
>>>>>
>>>>> When reading the code I understand that we mainly want to cleanup half
>>>>> open from bihash in session_stream_connect_notify, however, in syn-sent
>>>>> state if I choose to close the session, the session might be closed by my
>>>>> app due to session setup timeout (in second scale), in that case, session
>>>>> will be marked as half_open_done and half open session will be freed
>>>>> shortly in the ctrl thread (the 1st worker?).
>>>>>
>>>>>
>>>>> [fc] Actually, this might be the issue. We did start to provide a
>>>>> half-open session handle to apps which if closed does clean up the session
>>>>> but apparently it is missing the cleanup of the session lookup table. 
>>>>> Could
>>>>> you try this patch [1]? It might need additional work.
>>>>>
>>>>> Having said that, forcing a close/cleanup will not free the port
>>>>> synchronously. So, if you’re using fixed ports, you’ll have to wait for 
>>>>> the
>>>>> half-open cleanup notification.
>>>>>
>>>>>
>>>>> Should I also registered half open callback or there are some other
>>>>> reason that lead to this failure?
>>>>>
>>>>>
>>>>> [fc] Yes, see above.
>>>>>
>>>>> Regards,
>>>>> Florin
>>>>>
>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/38526
>>>>>
>>>>>
>>>>> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 06:22写道：
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> When you abort the connection, is it fully established or half-open?
>>>>>> Half-opens are cleaned up by the owner thread after a timeout, but the
>>>>>> 5-tuple should be assigned to the fully established session by that 
>>>>>> point.
>>>>>> tcp_half_open_connection_cleanup does not cleanup the bihash instead
>>>>>> session_stream_connect_notify does once tcp connect returns either 
>>>>>> success
>>>>>> or failure.
>>>>>>
>>>>>> So a few questions:
>>>>>> - is it accurate to assume you have a builtin vpp app and rely only
>>>>>> on C apis to interact with host stack?
>>>>>> - on what thread (main or first worker) do you call vnet_connect?
>>>>>> - what api do you use to close the session?
>>>>>> - what version of vpp is this because lines don’t match vpp latest?
>>>>>>
>>>>>> Regards,
>>>>>> Florin
>>>>>>
>>>>>> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya <
>>>>>> fortitude.zh...@gmail.com> wrote:
>>>>>> >
>>>>>> > Hi list,
>>>>>> >
>>>>>> > recently in our application, we constantly triggered such abrt
>>>>>> issue which make our connectivity interrupt for a while:
>>>>>> >
>>>>>> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC
>>>>>> 0x7fefd3b2000b
>>>>>> > Mar 19 16:11:26 ubuntu vnet[2565933]:
>>>>>> /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 
>>>>>> (tcp46_input_inline)
>>>>>> assertion `tcp_lookup_is_valid (tc0, b[0], tcp_buffer_hdr (b[0]))' fails
>>>>>> >
>>>>>> > Our scenario is quite simple, we will make 4 parallel tcp
>>>>>> connection (use 4 fixed source ports) to a remote vpp stack (fixed ip and
>>>>>> port), and will do some keepalive in our application layer, since we only
>>>>>> use the vpp tcp stack to make the middle box happy with the connection, 
>>>>>> we
>>>>>> do not use the data transport of tcp statck actually.
>>>>>> >
>>>>>> > However, since the network condition is complex, we have to  always
>>>>>> need to abrt the connection and reconnect.
>>>>>> >
>>>>>> > I keep to merge upstream session and tcp fix however the issue
>>>>>> still not fixed, what I found now it may be in some case
>>>>>> tcp_half_open_connection_cleanup may not deleted the half open session 
>>>>>> from
>>>>>> the lookup table (bihash) and the session index is realloced by other
>>>>>> connection.
>>>>>> >
>>>>>> > Hope the list can provide some hint about how to overcome this
>>>>>> issue, thanks a lot.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22756): https://lists.fd.io/g/vpp-dev/message/22756
Mute This Topic: https://lists.fd.io/mt/97707823/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

Reply via email to