[PR] [fix][client] Retry for unknown exceptions when creating a producer or consumer [pulsar]

via GitHub Mon, 04 Aug 2025 05:42:24 -0700


BewareMyPower opened a new pull request, #24599:
URL: https://github.com/apache/pulsar/pull/24599

### Motivation

There are several methods that get a connection from the `ConnectionPool`.
1. `ConnectionPool#getConnection(ServiceNameResolver)`
It's only used in `BinaryProtoLookupService`. The callbacks are all
executed in `PulsarClientImpl#lookupExecutorProvider`: a single thread executor
whose thread name starts with `pulsar-client-lookup`.
2. `ConnectionPool#getConnection(InetSocketAddress)`
It's called by the 1st method directly. Besides, it's only called by
`BinaryProtoLookupService#findBroker`, which also uses
`PulsarClientImpl#lookupExecutorProvider` to execute the callback.
3. `ConnectionPool#getConnection(InetSocketAddress logicalAddress,
InetSocketAddress physicalAddress, int randomKey)`
It's called by the 2nd method directly. Besides, it's only called by the
4th method.
4. `PulsarClientImpl#getConnection(InetSocketAddress, InetSocketAddress,
int)`
It's called in `ConnectionHandler#grabCnx` to establish a connection
between broker and client (producer, consumer or reader). The callback calls
`connectionOpened` or `handleConnectionError` without switching to another
executor.
5. Other methods in `PulsarClientImpl`
Including:
- `getConnectionToServiceUrl`
- `getConnection(String, int)`
- `getConnection(String, String)`
They all call the 4th method in the callback of
`LookupService#getBroker` and only used in `grabCnx`.

To solve a race condition caused by the fact that socket is closed in
Netty's I/O thread while `connectionOpened` that sends the command is executed
in another thread, #23499 completes the future of
`ConnectionPool#getConnection` in Netty's' I/O thread as well. However, this
adds an additional thread switching for all usages in method 1 above, which is
not necessary.

I found this issue when I found a producer creation was blocked forever due
to a deadlock in `sendAsync`'s callback, which is executed in Netty's I/O
thread. When checking the heap dump, I found `ClientCnx#pendingRequests` was
empty, which means `client.getCnxPool().getConnection(socketAddress)` never
complete, see
https://github.com/apache/pulsar/blob/1e57827b33395a2124593d95fa7641ee5e125d9b/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BinaryProtoLookupService.java#L218

Though even if without the change, other response processing will still be
blocked because the I/O thread is blocked. It's very confusing when reviewing
the heap dump:
- `BinaryProtoLookupService#partitionedMetadataInProgress` is not empty
- `ClientCnx#pendingRequests` is empty (the only connection in the pool)

Actually, the root cause of the issue described in #23499 is that the
`StacklessClosedChannelException` is treated as an exception cannot be retried.
However, all network exceptions should be retried. Hence, this PR proposed a
different solution to retry for such errors. Technically, only a few known
exceptions should be treated as not retriable, e.g. `AuthorizationException`.
Other known or unknown exceptions should be retried.

It's not guaranteed that `writeAndFlush` will always succeed. For example,
if the code reaches here:
https://github.com/apache/pulsar/blob/d2728253c666f7d4bd4f111356f3b97663603b6a/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L1062

The future of `ClientCnx#sendRequestWithId` could fail with an exception
that is not `PulsarClientException`.

### Modifications

Revert the change in #23499 and retry for retriable exception even if it's
not `PulsarClientException`. Improve `SimpleProduceConsumeIoTest` to cover
consumer creation as well. Since this test only covers a very limited case, add
`testUnknownRpcExceptionFor*` tests that inject failure on the 1st
`writeAndFlush` in `connectionOpened`.

### Documentation

- [ ] `doc` 
- [ ] `doc-required` 
- [x] `doc-not-needed` 
- [ ] `doc-complete`

### Matching PR in forked repository

PR in forked repository:

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [fix][client] Retry for unknown exceptions when creating a producer or consumer [pulsar]

Reply via email to