Hi Zhang, Thanks for confirming! Give me a few more days to check if there’s any other improvements to be made in that area.
Regards, Florin > On Mar 23, 2023, at 12:00 AM, Zhang Dongya <fortitude.zh...@gmail.com> wrote: > > Hi, > > The new patch works as expected, no assert triggered abort anymore. > > Really appreciate your help and thanks a lot. > > Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> > 于2023年3月22日周三 11:54写道: >> Hi Zhang, >> >> Awesome! Thanks! >> >> Regards, >> Florin >> >>> On Mar 21, 2023, at 7:41 PM, Zhang Dongya <fortitude.zh...@gmail.com >>> <mailto:fortitude.zh...@gmail.com>> wrote: >>> >>> Hi Florin, >>> >>> Thanks a lot, the previous patch and with reset disabled have been running >>> 1 day without issue. >>> >>> I will enable reset and with your new patch, will provide feedback later. >>> >>> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> >>> 于2023年3月22日周三 02:12写道: >>>> Hi, >>>> >>>> Okay, resetting of half-opens definitely not supported. I updated the >>>> patch to just clean them up on forced reset, without sending a reset to >>>> make sure session lookup table cleanup still happens. >>>> >>>> Regards, >>>> Florin >>>> >>>>> On Mar 20, 2023, at 9:13 PM, Zhang Dongya <fortitude.zh...@gmail.com >>>>> <mailto:fortitude.zh...@gmail.com>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> After review my code, I found that I have add a flag to the >>>>> vnet_disconnect API which will call session_reset instead of >>>>> session_close, the reason I do this is to make intermediate firewall just >>>>> flush the state and reconstruct if I later reconnect. >>>>> >>>>> It seems in session_reset logic, for half open session, it also missing >>>>> to remove the session from the lookup hash which may cause the issue too. >>>>> >>>>> I change my code and will test with your patch along, will provide >>>>> feedback later. >>>>> >>>>> I also noticed the bihash issue discussed in the list recently, I will >>>>> merge later. >>>>> >>>>> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> >>>>> 于2023年3月21日周二 11:56写道: >>>>>> Hi, >>>>>> >>>>>> That last thing is pretty interesting. It’s either the issue fixed by >>>>>> this patch [1] or sessions are somehow cleaned up multiple times. If >>>>>> it’s the latter, I’d really like to understand how that happens. >>>>>> >>>>>> Regards, >>>>>> Florin >>>>>> >>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/38507 >>>>>> >>>>>>> On Mar 20, 2023, at 6:52 PM, Zhang Dongya <fortitude.zh...@gmail.com >>>>>>> <mailto:fortitude.zh...@gmail.com>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> After merge this patch and update the test environment, the issue still >>>>>>> persists. >>>>>>> >>>>>>> Let me clear my client app config: >>>>>>> 1. register a reset callback, which will call vnet_disconnect there and >>>>>>> also trigger reconnect by send event to the ctrl process.) >>>>>>> 2. register a connected callback, which will handle connect err by >>>>>>> trigger reconnect, on success, it will record session handle and >>>>>>> extract tcp sequence for our app usage. >>>>>>> 3. register a disconnect callback, which basically do same as reset >>>>>>> callback. >>>>>>> 4. register a cleanup callback and accept callback, which basically >>>>>>> make the session layer happy without actually relevant work to do. >>>>>>> >>>>>>> There is a ctrl process in mater, which will handle periodically >>>>>>> reconnect or triggered by event. >>>>>>> >>>>>>> BTW, I also see frequently warning 'session %u hash delete rv -3' in >>>>>>> session_delete in my environment, hope this helps to investigate. >>>>>>> >>>>>>> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> >>>>>>> 于2023年3月20日周一 23:29写道: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Understood and yes, connect will synchronously fail if port is not >>>>>>>> available, so you should be able to retry it later. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Florin >>>>>>>> >>>>>>>>> On Mar 20, 2023, at 1:58 AM, Zhang Dongya <fortitude.zh...@gmail.com >>>>>>>>> <mailto:fortitude.zh...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> It seems the issue occurs when there are disconnect called because >>>>>>>>> our network can't guarantee a tcp can't be reset even when 3 ways >>>>>>>>> handshake is completed (firewall issue :( ). >>>>>>>>> >>>>>>>>> When we find the app layer timeout, we will first disconnect (because >>>>>>>>> we record the session handle, this session might be a half open >>>>>>>>> session), does vnet session layer guarantee that if we reconnect from >>>>>>>>> master thread when the half open session still not be released yet >>>>>>>>> (due to asynchronous logic) that the reconnect fail? if then we can >>>>>>>>> retry connect later. >>>>>>>>> >>>>>>>>> I prefer to not registered half open callback because I think it make >>>>>>>>> app complicated from a TCP programming prospective. >>>>>>>>> >>>>>>>>> For your patch, I think it should be work because I can't delete the >>>>>>>>> half open session immediately because there is worker configured, so >>>>>>>>> the half open will be removed from bihash when syn retrans timeout. I >>>>>>>>> have merged the patch and will provide feedback later. >>>>>>>>> >>>>>>>>> Florin Coras <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com>> >>>>>>>>> 于2023年3月20日周一 13:09写道: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Inline. >>>>>>>>>> >>>>>>>>>>> On Mar 19, 2023, at 6:47 PM, Zhang Dongya >>>>>>>>>>> <fortitude.zh...@gmail.com <mailto:fortitude.zh...@gmail.com>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> It can be aborted both in established state or half open state >>>>>>>>>>> because I will do timeout in our app layer. >>>>>>>>>> >>>>>>>>>> [fc] Okay! Is the issue present irrespective of the state of the >>>>>>>>>> session or does it happen only after a disconnect in hanf-open >>>>>>>>>> state? More lower. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Regarding your question, >>>>>>>>>>> >>>>>>>>>>> - Yes we add a builtin in app relys on C apis that mainly use >>>>>>>>>>> vnet_connect/disconnect to connect or disconnect session. >>>>>>>>>> >>>>>>>>>> [fc] Understood >>>>>>>>>> >>>>>>>>>>> - We call these api in a vpp ctrl process which should be running >>>>>>>>>>> on the master thread, we never do session setup/teardown on worker >>>>>>>>>>> thread. (the environment that found this issue is configured with 1 >>>>>>>>>>> master + 1 worker setup.) >>>>>>>>>> >>>>>>>>>> [fc] With vpp latest it’s possible to connect from first workers. >>>>>>>>>> It’s an optimization meant to avoid 1) worker barrier on syns and 2) >>>>>>>>>> entering poll mode on main (consume less cpu) >>>>>>>>>> >>>>>>>>>>> - We started to develop the app using 22.06 and I keep to merge >>>>>>>>>>> upstream changes to latest vpp by cherry-picking. The reason for >>>>>>>>>>> line mismatch is that I added some comment to the session layer >>>>>>>>>>> code, it should be equal to the master branch now. >>>>>>>>>> >>>>>>>>>> [fc] Ack >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> When reading the code I understand that we mainly want to cleanup >>>>>>>>>>> half open from bihash in session_stream_connect_notify, however, in >>>>>>>>>>> syn-sent state if I choose to close the session, the session might >>>>>>>>>>> be closed by my app due to session setup timeout (in second scale), >>>>>>>>>>> in that case, session will be marked as half_open_done and half >>>>>>>>>>> open session will be freed shortly in the ctrl thread (the 1st >>>>>>>>>>> worker?). >>>>>>>>>> >>>>>>>>>> [fc] Actually, this might be the issue. We did start to provide a >>>>>>>>>> half-open session handle to apps which if closed does clean up the >>>>>>>>>> session but apparently it is missing the cleanup of the session >>>>>>>>>> lookup table. Could you try this patch [1]? It might need additional >>>>>>>>>> work. >>>>>>>>>> >>>>>>>>>> Having said that, forcing a close/cleanup will not free the port >>>>>>>>>> synchronously. So, if you’re using fixed ports, you’ll have to wait >>>>>>>>>> for the half-open cleanup notification. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Should I also registered half open callback or there are some other >>>>>>>>>>> reason that lead to this failure? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [fc] Yes, see above. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Florin >>>>>>>>>> >>>>>>>>>> [1] https://gerrit.fd.io/r/c/vpp/+/38526 >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Florin Coras <fcoras.li...@gmail.com >>>>>>>>>>> <mailto:fcoras.li...@gmail.com>> 于2023年3月20日周一 06:22写道: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> When you abort the connection, is it fully established or >>>>>>>>>>>> half-open? Half-opens are cleaned up by the owner thread after a >>>>>>>>>>>> timeout, but the 5-tuple should be assigned to the fully >>>>>>>>>>>> established session by that point. >>>>>>>>>>>> tcp_half_open_connection_cleanup does not cleanup the bihash >>>>>>>>>>>> instead session_stream_connect_notify does once tcp connect >>>>>>>>>>>> returns either success or failure. >>>>>>>>>>>> >>>>>>>>>>>> So a few questions: >>>>>>>>>>>> - is it accurate to assume you have a builtin vpp app and rely >>>>>>>>>>>> only on C apis to interact with host stack? >>>>>>>>>>>> - on what thread (main or first worker) do you call vnet_connect? >>>>>>>>>>>> - what api do you use to close the session? >>>>>>>>>>>> - what version of vpp is this because lines don’t match vpp latest? >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Florin >>>>>>>>>>>> >>>>>>>>>>>> > On Mar 19, 2023, at 2:08 AM, Zhang Dongya >>>>>>>>>>>> > <fortitude.zh...@gmail.com <mailto:fortitude.zh...@gmail.com>> >>>>>>>>>>>> > wrote: >>>>>>>>>>>> > >>>>>>>>>>>> > Hi list, >>>>>>>>>>>> > >>>>>>>>>>>> > recently in our application, we constantly triggered such abrt >>>>>>>>>>>> > issue which make our connectivity interrupt for a while: >>>>>>>>>>>> > >>>>>>>>>>>> > Mar 19 16:11:26 ubuntu vnet[2565933]: received signal SIGABRT, >>>>>>>>>>>> > PC 0x7fefd3b2000b >>>>>>>>>>>> > Mar 19 16:11:26 ubuntu vnet[2565933]: >>>>>>>>>>>> > /home/fortitude/glx/vpp/src/vnet/tcp/tcp_input.c:3004 >>>>>>>>>>>> > (tcp46_input_inline) assertion `tcp_lookup_is_valid (tc0, b[0], >>>>>>>>>>>> > tcp_buffer_hdr (b[0]))' fails >>>>>>>>>>>> > >>>>>>>>>>>> > Our scenario is quite simple, we will make 4 parallel tcp >>>>>>>>>>>> > connection (use 4 fixed source ports) to a remote vpp stack >>>>>>>>>>>> > (fixed ip and port), and will do some keepalive in our >>>>>>>>>>>> > application layer, since we only use the vpp tcp stack to make >>>>>>>>>>>> > the middle box happy with the connection, we do not use the data >>>>>>>>>>>> > transport of tcp statck actually. >>>>>>>>>>>> > >>>>>>>>>>>> > However, since the network condition is complex, we have to >>>>>>>>>>>> > always need to abrt the connection and reconnect. >>>>>>>>>>>> > >>>>>>>>>>>> > I keep to merge upstream session and tcp fix however the issue >>>>>>>>>>>> > still not fixed, what I found now it may be in some case >>>>>>>>>>>> > tcp_half_open_connection_cleanup may not deleted the half open >>>>>>>>>>>> > session from the lookup table (bihash) and the session index is >>>>>>>>>>>> > realloced by other connection. >>>>>>>>>>>> > >>>>>>>>>>>> > Hope the list can provide some hint about how to overcome this >>>>>>>>>>>> > issue, thanks a lot. >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> >>> >> >> >> >> > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#22759): https://lists.fd.io/g/vpp-dev/message/22759 Mute This Topic: https://lists.fd.io/mt/97707823/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/leave/1480452/21656/631435203/xyzzy [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-