Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

Zhang Dongya Mon, 20 Mar 2023 18:53:17 -0700

Hi,

After merge this patch and update the test environment, the issue still
persists.


Let me clear my client app config:
1. register a reset callback, which will call vnet_disconnect there and
also trigger reconnect by send event to the ctrl process.)
2. register a connected callback, which will handle connect err by trigger
reconnect, on success, it will record session handle and extract tcp
sequence for our app usage.
3. register a disconnect callback, which basically do same as reset
callback.
4. register a cleanup callback and accept callback, which basically make
the session layer happy without actually relevant work to do.

There is a ctrl process in mater, which will handle periodically reconnect
or triggered by event.

BTW, I also see frequently warning 'session %u hash delete rv -3' in
session_delete in my environment, hope this helps to investigate.

Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 23:29写道：

> Hi,
>
> Understood and yes, connect will synchronously fail if port is not
> available, so you should be able to retry it later.
>
> Regards,
> Florin
>
> On Mar 20, 2023, at 1:58 AM, Zhang Dongya <fortitude.zh...@gmail.com>
> wrote:
>
> Hi,
>
> It seems the issue occurs when there are disconnect called because our
> network can't guarantee a tcp can't be reset even when 3 ways handshake is
> completed (firewall issue :( ).
>
> When we find the app layer timeout, we will first disconnect (because we
> record the session handle, this session might be a half open session), does
> vnet session layer guarantee that if we reconnect from master thread when
> the half open session still not be released yet (due to asynchronous logic)
> that the reconnect fail? if then we can retry connect later.
>
> I prefer to not registered half open callback because I think it make app
> complicated from a TCP programming prospective.
>
> For your patch, I think it should be work because I can't delete the half
> open session immediately because there is worker configured, so the half
> open will be removed from bihash when syn retrans timeout. I have merged
> the patch and will provide feedback later.
>
> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 13:09写道：
>
>> Hi,
>>
>> Inline.
>>
>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya <fortitude.zh...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> It can be aborted both in established state or half open state because I
>> will do timeout in our app layer.
>>
>>
>> [fc] Okay! Is the issue present irrespective of the state of the session
>> or does it happen only after a disconnect in hanf-open state? More lower.
>>
>>
>> Regarding your question,
>>
>> - Yes we add a builtin in app relys on C apis that  mainly use
>> vnet_connect/disconnect to connect or disconnect session.
>>
>>
>> [fc] Understood
>>
>> - We call these api in a vpp ctrl process which should be running on the
>> master thread, we never do session setup/teardown on worker thread. (the
>> environment that found this issue is configured with 1 master + 1 worker
>> setup.)
>>
>>
>> [fc] With vpp latest it’s possible to connect from first workers. It’s an
>> optimization meant to avoid 1) worker barrier on syns and 2) entering poll
>> mode on main (consume less cpu)
>>
>> - We started to develop the app using 22.06 and I keep to merge upstream
>> changes to latest vpp by cherry-picking. The reason for line mismatch is
>> that I added some comment to the session layer code, it should be equal to
>> the master branch now.
>>
>>
>> [fc] Ack
>>
>>
>> When reading the code I understand that we mainly want to cleanup half
>> open from bihash in session_stream_connect_notify, however, in syn-sent
>> state if I choose to close the session, the session might be closed by my
>> app due to session setup timeout (in second scale), in that case, session
>> will be marked as half_open_done and half open session will be freed
>> shortly in the ctrl thread (the 1st worker?).
>>
>>
>> [fc] Actually, this might be the issue. We did start to provide a
>> half-open session handle to apps which if closed does clean up the session
>> but apparently it is missing the cleanup of the session lookup table. Could
>> you try this patch [1]? It might need additional work.
>>
>> Having said that, forcing a close/cleanup will not free the port
>> synchronously. So, if you’re using fixed ports, you’ll have to wait for the
>> half-open cleanup notification.
>>
>>
>> Should I also registered half open callback or there are some other
>> reason that lead to this failure?
>>
>>
>> [fc] Yes, see above.
>>
>> Regards,
>> Florin
>>
>> [1] https://gerrit.fd.io/r/c/vpp/+/38526
>>
>>
>> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 06:22写道：
>>
>>> Hi,
>>>
>>> When you abort the connection, is it fully established or half-open?
>>> Half-opens are cleaned up by the owner thread after a timeout, but the
>>> 5-tuple should be assigned to the fully established session by that point.
>>> tcp_half_open_connection_cleanup does not cleanup the bihash instead
>>> session_stream_connect_notify does once tcp connect returns either success
>>> or failure.
>>>
>>> So a few questions:
>>> - is it accurate to assume you have a builtin vpp app and rely only on C
>>> apis to interact with host stack?
>>> - on what thread (main or first worker) do you call vnet_connect?
>>> - what api do you use to close the session?
>>> - what version of vpp is this because lines don’t match vpp latest?
>>>
>>> Regards,
>>> Florin
>>>
>>> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya <fortitude.zh...@gmail.com>
>>> wrote:
>>> >
>>> > Hi list,
>>> >
>>> > recently in our application, we constantly triggered such abrt issue
>>> which make our connectivity interrupt for a while:
>>> >
>>> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC
>>> 0x7fefd3b2000b
>>> > Mar 19 16:11:26 ubuntu vnet[2565933]:
>>> /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 (tcp46_input_inline)
>>> assertion `tcp_lookup_is_valid (tc0, b[0], tcp_buffer_hdr (b[0]))' fails
>>> >
>>> > Our scenario is quite simple, we will make 4 parallel tcp connection
>>> (use 4 fixed source ports) to a remote vpp stack (fixed ip and port), and
>>> will do some keepalive in our application layer, since we only use the vpp
>>> tcp stack to make the middle box happy with the connection, we do not use
>>> the data transport of tcp statck actually.
>>> >
>>> > However, since the network condition is complex, we have to  always
>>> need to abrt the connection and reconnect.
>>> >
>>> > I keep to merge upstream session and tcp fix however the issue still
>>> not fixed, what I found now it may be in some case
>>> tcp_half_open_connection_cleanup may not deleted the half open session from
>>> the lookup table (bihash) and the session index is realloced by other
>>> connection.
>>> >
>>> > Hope the list can provide some hint about how to overcome this issue,
>>> thanks a lot.
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>
> 
>
>

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#22735): https://lists.fd.io/g/vpp-dev/message/22735
Mute This Topic: https://lists.fd.io/mt/97707823/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy 
[arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] Sigabrt in tcp46_input_inline for tcp_lookup_is_valid

Reply via email to