hzhaop commented on PR #777:
URL: https://github.com/apache/skywalking-java/pull/777#issuecomment-3498721552

   Thanks for the feedback @wu-sheng! I've revised the implementation based on 
your suggestions:
   
   ## Changes Made
   
   ### 1. Restored original reconnection logic + Added TRANSIENT_FAILURE 
monitoring
   
   **I agree with your point** - the original logic is correct: only force 
reconnect when selecting a **different server**. When the **same server** is 
selected, we should rely on gRPC's auto-reconnect.
   
   **What I've added:**
   - Monitor `ConnectivityState.TRANSIENT_FAILURE` state continuously
   - If channel stays in TRANSIENT_FAILURE for too long, force rebuild channel
   - Dual safeguard: force reconnect if either `reconnectCount` OR 
`transientFailureCount` exceeds threshold
   
   This solves the half-open connection issue without disrupting gRPC's 
auto-reconnect mechanism.
   
   **Key code:**
   ```java
   if (index == selectedIdx) {
       reconnectCount++;
       boolean forceReconnect = reconnectCount > threshold
                             || transientFailureCount > threshold;
       if (forceReconnect) {
           createNewChannel(...);  // Force rebuild
       } else if (managedChannel.isConnected(false)) {
           markAsConnected();  // Let gRPC auto-reconnect work
       }
   }
   ```
   
   Also added `keepAliveWithoutCalls(true)` to detect half-open connections.
   
   ### 2. Fixed race condition between reportError() and run()
   
   **Issue from heap dump analysis:**
   My production heap dump showed `reconnect = false` while listeners were in 
DISCONNECT state, proving the race condition exists. **This causes SkyWalking 
Agent to permanently stop uploading data.**
   
   **Fix:**
   Wrapped all state changes in `synchronized(statusLock)` to make flag updates 
and notifications atomic.
   
   ```java
   synchronized (statusLock) {
       reconnect = true;
       notify(GRPCChannelStatus.DISCONNECT);
   }
   ```
   
   ## Additional Changes
   - Adjusted keepalive default from 60s to 120s (reduces overhead)
   - Added `getState()` method to GRPCChannel for monitoring
   
   Let me know if this approach looks good!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to