BewareMyPower opened a new pull request, #24599:
URL: https://github.com/apache/pulsar/pull/24599
### Motivation
There are several methods that get a connection from the `ConnectionPool`.
1. `ConnectionPool#getConnection(ServiceNameResolver)`
It's only used in `BinaryProtoLookupService`. The callbacks are all
executed in `PulsarClientImpl#lookupExecutorProvider`: a single thread executor
whose thread name starts with `pulsar-client-lookup`.
2. `ConnectionPool#getConnection(InetSocketAddress)`
It's called by the 1st method directly. Besides, it's only called by
`BinaryProtoLookupService#findBroker`, which also uses
`PulsarClientImpl#lookupExecutorProvider` to execute the callback.
3. `ConnectionPool#getConnection(InetSocketAddress logicalAddress,
InetSocketAddress physicalAddress, int randomKey)`
It's called by the 2nd method directly. Besides, it's only called by the
4th method.
4. `PulsarClientImpl#getConnection(InetSocketAddress, InetSocketAddress,
int)`
It's called in `ConnectionHandler#grabCnx` to establish a connection
between broker and client (producer, consumer or reader). The callback calls
`connectionOpened` or `handleConnectionError` without switching to another
executor.
5. Other methods in `PulsarClientImpl`
Including:
- `getConnectionToServiceUrl`
- `getConnection(String, int)`
- `getConnection(String, String)`
They all call the 4th method in the callback of
`LookupService#getBroker` and only used in `grabCnx`.
To solve a race condition caused by the fact that socket is closed in
Netty's I/O thread while `connectionOpened` that sends the command is executed
in another thread, #23499 completes the future of
`ConnectionPool#getConnection` in Netty's' I/O thread as well. However, this
adds an additional thread switching for all usages in method 1 above, which is
not necessary.
I found this issue when I found a producer creation was blocked forever due
to a deadlock in `sendAsync`'s callback, which is executed in Netty's I/O
thread. When checking the heap dump, I found `ClientCnx#pendingRequests` was
empty, which means `client.getCnxPool().getConnection(socketAddress)` never
complete, see
https://github.com/apache/pulsar/blob/1e57827b33395a2124593d95fa7641ee5e125d9b/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BinaryProtoLookupService.java#L218
Though even if without the change, other response processing will still be
blocked because the I/O thread is blocked. It's very confusing when reviewing
the heap dump:
- `BinaryProtoLookupService#partitionedMetadataInProgress` is not empty
- `ClientCnx#pendingRequests` is empty (the only connection in the pool)
<img width="877" height="253" alt="image"
src="https://github.com/user-attachments/assets/ca3cbbb8-9110-4697-b3c9-72c4b61c19d1"
/>
<img width="678" height="336" alt="image"
src="https://github.com/user-attachments/assets/be1dc242-d6e3-41ad-92df-8336500653db"
/>
Actually, the root cause of the issue described in #23499 is that the
`StacklessClosedChannelException` is treated as an exception cannot be retried.
However, all network exceptions should be retried. Hence, this PR proposed a
different solution to retry for such errors. Technically, only a few known
exceptions should be treated as not retriable, e.g. `AuthorizationException`.
Other known or unknown exceptions should be retried.
It's not guaranteed that `writeAndFlush` will always succeed. For example,
if the code reaches here:
https://github.com/apache/pulsar/blob/d2728253c666f7d4bd4f111356f3b97663603b6a/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L1062
The future of `ClientCnx#sendRequestWithId` could fail with an exception
that is not `PulsarClientException`.
### Modifications
Revert the change in #23499 and retry for retriable exception even if it's
not `PulsarClientException`. Improve `SimpleProduceConsumeIoTest` to cover
consumer creation as well. Since this test only covers a very limited case, add
`testUnknownRpcExceptionFor*` tests that inject failure on the 1st
`writeAndFlush` in `connectionOpened`.
### Documentation
<!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
- [ ] `doc` <!-- Your PR contains doc changes. -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update
later -->
- [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->
### Matching PR in forked repository
PR in forked repository:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]