hzhaop commented on PR #777:
URL: https://github.com/apache/skywalking-java/pull/777#issuecomment-3510207871
> A key question before checking codes, as you said, `TRANSIENT_FAILURE`
represents `Next Call goes to broken connection, and gRPC returns this.` Why
doesn't treat this as a network error directly? You created another method and
configurations to verify this status, and also trigger the reconnection like
other network issue. It seems no different to me. Did I miss anything?
You're absolutely right! After reviewing the code more carefully, I realized
that monitoring TRANSIENT_FAILURE state is indeed redundant.
Here's why:
1. When the channel is in TRANSIENT_FAILURE state, any RPC call will
immediately throw an UNAVAILABLE exception
2. UNAVAILABLE is already in the `isNetworkError()` list, which triggers
`reportError()` and reconnection
3. So the TRANSIENT_FAILURE monitoring provides no additional benefit
I've removed the `transientFailureCount` mechanism and
`checkChannelStateAndTriggerReconnectIfNeeded()` method in the latest commits.
## The Real Problem and Solution
After analyzing production logs, I found the actual issue was with the
**original reconnection logic**:
**Original code problem:**
```java
} else if (managedChannel.isConnected(++reconnectCount > 5)) {
reconnectCount = 0; // Reset counter when isConnected() returns true
reconnect = false;
}
```
When `isConnected()` returns true (even if it's a false positive due to
half-open connection), the counter gets reset. This means:
- If the connection is in a half-open state, `isConnected()` may return true
- `reconnectCount` gets reset to 0 every time
- The counter never reaches the threshold
- Channel never gets rebuilt, even though the connection is actually broken
**Current solution:**
```java
if (reconnectCount > Config.Agent.FORCE_RECONNECTION_PERIOD) {
// Force rebuild channel
createNewChannel(...);
} else if (managedChannel.isConnected(false)) {
// Trust the connection but DON'T reset reconnectCount
notifyConnected();
}
```
Now `reconnectCount` only resets in `createNewChannel()` after actual
channel rebuild. This ensures:
- Even if `isConnected()` returns false positives, the counter keeps
accumulating
- After threshold is exceeded, channel is forcibly rebuilt
- Half-open connections are resolved within predictable timeframe (threshold
× check_interval)
This matches the original intent of `FORCE_RECONNECTION_PERIOD`
configuration.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]