devjue opened a new issue, #16344:
URL: https://github.com/apache/dubbo/issues/16344

   ### Pre-check
   
   - [x] I am sure that all the content I provide is in English.
   
   
   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/dubbo/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Apache Dubbo Component
   
   Java SDK (apache/dubbo)
   
   ### Dubbo Version
   
   Dubbo 3.3 (also reproducible on 3.2.x), JDK 8/17, Linux
   
   ### Steps to reproduce this issue
   
   1. A Dubbo Triple Consumer talks to any HTTP/2 server (gateway / reverse 
proxy / direct Provider) that periodically emits `GOAWAY(errorCode=0, 
lastStreamId=MAX_INT)`. Common triggers: `max_requests_per_connection`, 
`idle_timeout`, `max_connection_duration`, hot-restart drain, overload manager.
   2. Send sustained traffic (e.g. 1 QPS) over the connection.
   3. Every GOAWAY produces exactly one failed request, then traffic recovers 
within ~1s.
   
   Captured call chain on the Consumer side:
   
   ```
   SslHandler.decode
     -> DefaultHttp2FrameReader.readGoAwayFrame
     -> Http2FrameCodec.onHttp2Frame
     -> Http2MultiplexHandler
     -> TripleGoAwayHandler.channelRead(DefaultHttp2GoAwayFrame)
     -> NettyConnectionHandler.onGoAway
     -> AbstractNettyConnectionClient.onGoaway(channel)   // sets channel ref 
to null
   ```
   
   ### What you expected to happen
   
   A graceful GOAWAY (`errorCode=0`, `lastStreamId=MAX_INT`) means the server 
still accepts new streams on the old channel, so it should be transparent to 
callers — no request failure window.
   
   Actual behaviour: `AbstractNettyConnectionClient#onGoaway` immediately nulls 
the channel reference, then `NettyConnectionHandler#onGoAway` schedules a 
reconnect. In the gap, the application logs:
   
   ```
   WARN  SingleRouterChain  - filter result is empty
   WARN  AbstractDirectory  - validInvokers: 0, invokersToReconnect: 1
   ERROR ...                - No provider available for service ...
   ```
   
   `AbstractClusterInvoker#checkInvokers` throws `RpcException` **before** 
entering FailoverCluster's retry loop, so configuring `retries=N` does **not** 
mitigate the issue.
   
   Root causes:
   
   1. The channel is nulled immediately instead of being kept alive during a 
graceful migration; with `lastStreamId=MAX_INT` the old channel could still 
serve.
   2. `NettyConnectionClient#initBootstrap`'s `closeFuture` listener calls 
`clearNettyChannel()` unconditionally — when a new channel is later swapped in 
and the old one closes, the listener wipes the new reference (~1.3s downtime 
observed in tests).
   3. Scheduling `doConnect` on the Netty I/O `EventLoop` is unsafe: 
`awaitUninterruptibly(connectTimeout)` blocks the I/O thread, freezing every 
other channel sharing that EventLoop during TLS handshake.
   
   ### Anything else
   
   Happens **every time** a GOAWAY is received — exactly one failed request per 
GOAWAY. Affects any Triple Consumer behind any HTTP/2 server that emits 
graceful GOAWAY. Because Dubbo connections are typically shared across many 
`@DubboReference` instances, a single GOAWAY can fan out to multiple downstream 
calls.
   
   I have a working patch on top of the `3.3` branch (5 files, +178 / -18) 
implementing graceful migration:
   
   1. `NettyConnectionHandler#onGoAway`: keep old channel alive; schedule new 
connect on `connectivityExecutor` after `GRACEFUL_RECONNECT_DELAY_MS = 200ms`; 
one retry on failure, then fall back to 
`AbstractNettyConnectionClient#onGoaway`.
   2. `NettyConnectionClient#initBootstrap`: change `closeFuture` listener from 
`clearNettyChannel()` to `compareAndClearNettyChannel(ch)` so the old channel's 
close listener cannot wipe a freshly swapped-in new channel.
   3. `AbstractNettyConnectionClient`: add `compareAndClearNettyChannel`, plus 
package-private `getConnectivityExecutor()` / `getReconnectDuration()` 
accessors.
   4. `NettyChannel#getChannelIfPresent`: lookup-only API used by 
`channelInactive` to avoid allocating a transient `NettyChannel` purely for 
logging.
   5. `TripleGoAwayHandler`: log `errorCode` and `lastStreamId` so the GOAWAY 
trigger is observable.
   
   Validated against a local Triple demo: 26 GOAWAYs, 100% migration success, 
zero `ERROR` / `No provider`. Channel rotation chain (first 18 migrations):
   
   ```
   :45956 -> :45110 -> :53222 -> :58366 -> :56372 -> :55436 -> :55188 ->
   :52926 -> :50682 -> :48682 -> :47626 -> :44014 -> :43116 -> :45044 ->
   :48602 -> :51880 -> :50992 -> :51672
   ```
   
   Happy to submit a PR against `3.3` and backport to `3.2` if maintainers 
agree with the design.
   
   ### Do you have a (mini) reproduction demo?
   
   - [ ] Yes, I have a minimal reproduction demo to help resolve this issue 
more effectively!
   
   ### Are you willing to submit a pull request to fix on your own?
   
   - [x] Yes I am willing to submit a pull request on my own!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to