Hi, The new patch works as expected, no assert triggered abort anymore.
Really appreciate your help and thanks a lot. Florin Coras <fcoras.li...@gmail.com> 于2023年3月22日周三 11:54写道: > Hi Zhang, > > Awesome! Thanks! > > Regards, > Florin > > On Mar 21, 2023, at 7:41 PM, Zhang Dongya <fortitude.zh...@gmail.com> > wrote: > > Hi Florin, > > Thanks a lot, the previous patch and with reset disabled have been running > 1 day without issue. > > I will enable reset and with your new patch, will provide feedback later. > > Florin Coras <fcoras.li...@gmail.com> 于2023年3月22日周三 02:12写道: > >> Hi, >> >> Okay, resetting of half-opens definitely not supported. I updated the >> patch to just clean them up on forced reset, without sending a reset to >> make sure session lookup table cleanup still happens. >> >> Regards, >> Florin >> >> On Mar 20, 2023, at 9:13 PM, Zhang Dongya <fortitude.zh...@gmail.com> >> wrote: >> >> Hi, >> >> After review my code, I found that I have add a flag to the >> vnet_disconnect API which will call session_reset instead of session_close, >> the reason I do this is to make intermediate firewall just flush the state >> and reconstruct if I later reconnect. >> >> It seems in session_reset logic, for half open session, it also missing >> to remove the session from the lookup hash which may cause the issue too. >> >> I change my code and will test with your patch along, will provide >> feedback later. >> >> I also noticed the bihash issue discussed in the list recently, I will >> merge later. >> >> Florin Coras <fcoras.li...@gmail.com> 于2023年3月21日周二 11:56写道: >> >>> Hi, >>> >>> That last thing is pretty interesting. It’s either the issue fixed by >>> this patch [1] or sessions are somehow cleaned up multiple times. If it’s >>> the latter, I’d really like to understand how that happens. >>> >>> Regards, >>> Florin >>> >>> [1] https://gerrit.fd.io/r/c/vpp/+/38507 >>> >>> On Mar 20, 2023, at 6:52 PM, Zhang Dongya <fortitude.zh...@gmail.com> >>> wrote: >>> >>> Hi, >>> >>> After merge this patch and update the test environment, the issue still >>> persists. >>> >>> Let me clear my client app config: >>> 1. register a reset callback, which will call vnet_disconnect there and >>> also trigger reconnect by send event to the ctrl process.) >>> 2. register a connected callback, which will handle connect err by >>> trigger reconnect, on success, it will record session handle and extract >>> tcp sequence for our app usage. >>> 3. register a disconnect callback, which basically do same as reset >>> callback. >>> 4. register a cleanup callback and accept callback, which basically make >>> the session layer happy without actually relevant work to do. >>> >>> There is a ctrl process in mater, which will handle periodically >>> reconnect or triggered by event. >>> >>> BTW, I also see frequently warning 'session %u hash delete rv -3' in >>> session_delete in my environment, hope this helps to investigate. >>> >>> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 23:29写道: >>> >>>> Hi, >>>> >>>> Understood and yes, connect will synchronously fail if port is not >>>> available, so you should be able to retry it later. >>>> >>>> Regards, >>>> Florin >>>> >>>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya <fortitude.zh...@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> It seems the issue occurs when there are disconnect called because our >>>> network can't guarantee a tcp can't be reset even when 3 ways handshake is >>>> completed (firewall issue :( ). >>>> >>>> When we find the app layer timeout, we will first disconnect (because >>>> we record the session handle, this session might be a half open session), >>>> does vnet session layer guarantee that if we reconnect from master thread >>>> when the half open session still not be released yet (due to asynchronous >>>> logic) that the reconnect fail? if then we can retry connect later. >>>> >>>> I prefer to not registered half open callback because I think it make >>>> app complicated from a TCP programming prospective. >>>> >>>> For your patch, I think it should be work because I can't delete the >>>> half open session immediately because there is worker configured, so the >>>> half open will be removed from bihash when syn retrans timeout. I have >>>> merged the patch and will provide feedback later. >>>> >>>> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 13:09写道: >>>> >>>>> Hi, >>>>> >>>>> Inline. >>>>> >>>>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya <fortitude.zh...@gmail.com> >>>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> It can be aborted both in established state or half open state because >>>>> I will do timeout in our app layer. >>>>> >>>>> >>>>> [fc] Okay! Is the issue present irrespective of the state of the >>>>> session or does it happen only after a disconnect in hanf-open state? More >>>>> lower. >>>>> >>>>> >>>>> Regarding your question, >>>>> >>>>> - Yes we add a builtin in app relys on C apis that mainly use >>>>> vnet_connect/disconnect to connect or disconnect session. >>>>> >>>>> >>>>> [fc] Understood >>>>> >>>>> - We call these api in a vpp ctrl process which should be running on >>>>> the master thread, we never do session setup/teardown on worker thread. >>>>> (the environment that found this issue is configured with 1 master + 1 >>>>> worker setup.) >>>>> >>>>> >>>>> [fc] With vpp latest it’s possible to connect from first workers. It’s >>>>> an optimization meant to avoid 1) worker barrier on syns and 2) entering >>>>> poll mode on main (consume less cpu) >>>>> >>>>> - We started to develop the app using 22.06 and I keep to merge >>>>> upstream changes to latest vpp by cherry-picking. The reason for line >>>>> mismatch is that I added some comment to the session layer code, it should >>>>> be equal to the master branch now. >>>>> >>>>> >>>>> [fc] Ack >>>>> >>>>> >>>>> When reading the code I understand that we mainly want to cleanup half >>>>> open from bihash in session_stream_connect_notify, however, in syn-sent >>>>> state if I choose to close the session, the session might be closed by my >>>>> app due to session setup timeout (in second scale), in that case, session >>>>> will be marked as half_open_done and half open session will be freed >>>>> shortly in the ctrl thread (the 1st worker?). >>>>> >>>>> >>>>> [fc] Actually, this might be the issue. We did start to provide a >>>>> half-open session handle to apps which if closed does clean up the session >>>>> but apparently it is missing the cleanup of the session lookup table. >>>>> Could >>>>> you try this patch [1]? It might need additional work. >>>>> >>>>> Having said that, forcing a close/cleanup will not free the port >>>>> synchronously. So, if you’re using fixed ports, you’ll have to wait for >>>>> the >>>>> half-open cleanup notification. >>>>> >>>>> >>>>> Should I also registered half open callback or there are some other >>>>> reason that lead to this failure? >>>>> >>>>> >>>>> [fc] Yes, see above. >>>>> >>>>> Regards, >>>>> Florin >>>>> >>>>> [1] https://gerrit.fd.io/r/c/vpp/+/38526 >>>>> >>>>> >>>>> Florin Coras <fcoras.li...@gmail.com> 于2023年3月20日周一 06:22写道: >>>>> >>>>>> Hi, >>>>>> >>>>>> When you abort the connection, is it fully established or half-open? >>>>>> Half-opens are cleaned up by the owner thread after a timeout, but the >>>>>> 5-tuple should be assigned to the fully established session by that >>>>>> point. >>>>>> tcp_half_open_connection_cleanup does not cleanup the bihash instead >>>>>> session_stream_connect_notify does once tcp connect returns either >>>>>> success >>>>>> or failure. >>>>>> >>>>>> So a few questions: >>>>>> - is it accurate to assume you have a builtin vpp app and rely only >>>>>> on C apis to interact with host stack? >>>>>> - on what thread (main or first worker) do you call vnet_connect? >>>>>> - what api do you use to close the session? >>>>>> - what version of vpp is this because lines don’t match vpp latest? >>>>>> >>>>>> Regards, >>>>>> Florin >>>>>> >>>>>> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya < >>>>>> fortitude.zh...@gmail.com> wrote: >>>>>> > >>>>>> > Hi list, >>>>>> > >>>>>> > recently in our application, we constantly triggered such abrt >>>>>> issue which make our connectivity interrupt for a while: >>>>>> > >>>>>> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, PC >>>>>> 0x7fefd3b2000b >>>>>> > Mar 19 16:11:26 ubuntu vnet[2565933]: >>>>>> /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 >>>>>> (tcp46_input_inline) >>>>>> assertion `tcp_lookup_is_valid (tc0, b[0], tcp_buffer_hdr (b[0]))' fails >>>>>> > >>>>>> > Our scenario is quite simple, we will make 4 parallel tcp >>>>>> connection (use 4 fixed source ports) to a remote vpp stack (fixed ip and >>>>>> port), and will do some keepalive in our application layer, since we only >>>>>> use the vpp tcp stack to make the middle box happy with the connection, >>>>>> we >>>>>> do not use the data transport of tcp statck actually. >>>>>> > >>>>>> > However, since the network condition is complex, we have to always >>>>>> need to abrt the connection and reconnect. >>>>>> > >>>>>> > I keep to merge upstream session and tcp fix however the issue >>>>>> still not fixed, what I found now it may be in some case >>>>>> tcp_half_open_connection_cleanup may not deleted the half open session >>>>>> from >>>>>> the lookup table (bihash) and the session index is realloced by other >>>>>> connection. >>>>>> > >>>>>> > Hope the list can provide some hint about how to overcome this >>>>>> issue, thanks a lot. >>>>>> > >>>>>> > >>>>>> > >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >> >> >> >> >> > > > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#22756): https://lists.fd.io/g/vpp-dev/message/22756 Mute This Topic: https://lists.fd.io/mt/97707823/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-