hzhaop commented on PR #777:
URL: https://github.com/apache/skywalking-java/pull/777#issuecomment-3442077021

     The scenario you mentioned, where the agent quickly reconnects after a 
server reboot, typically occurs when the server shuts down cleanly, allowing 
TCP connections to terminate properly.
   
     However, the problem we encountered primarily arises in unstable network 
environments, leading to TCP connections entering a half-open state. In such 
situations:
   
      1. The server-side connection is terminated, but the client still 
believes the connection is alive. This causes the client's send-Q to 
continuously accumulate data, and the agent remains unaware that the connection 
has become invalid, thus not triggering an automatic reconnection.
   
      2. The role of gRPC keepalive: The purpose of introducing gRPC keepalive 
is precisely to actively detect these half-open connections.By periodically 
sending heartbeats, the agent can promptly discover connections that are 
actually dead but still perceived as alive by the client, thereby forcing their 
closure and initiating the reconnection process.
   
     Regarding your point, "If nothing changed, there is no point to create a 
new channel":
   
      * Change in connection state: Even if the target backend address remains 
unchanged, the internal state of the previous connection iscorrupted due to its 
half-open status. In this scenario, simply reusing the old channel is 
ineffective as it cannot recover.
   
      * Necessity of forced reconnection: We observed that after keepalive 
detected a connection failure, if the agent subsequently selected the same 
backend, the original reconnection logic would not immediately force the 
establishment of a new channel. Instead, it would wait for a long period 
(approximately one hour) before attempting to reconnect. Therefore, modifying 
the reconnection logic toensure that, upon detecting a connection failure, the 
old channel is forcibly closed and a new `channel` is established, regardlessof 
whether the same backend is selected, is crucial for ensuring timely connection 
recovery and preventing prolonged serviceinterruptions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to