BewareMyPower opened a new pull request, #24599:
URL: https://github.com/apache/pulsar/pull/24599

   ### Motivation
   
   There are several methods that get a connection from the `ConnectionPool`.
   1. `ConnectionPool#getConnection(ServiceNameResolver)`
       It's only used in `BinaryProtoLookupService`. The callbacks are all 
executed in `PulsarClientImpl#lookupExecutorProvider`: a single thread executor 
whose thread name starts with `pulsar-client-lookup`.
   2. `ConnectionPool#getConnection(InetSocketAddress)`
       It's called by the 1st method directly. Besides, it's only called by 
`BinaryProtoLookupService#findBroker`, which also uses 
`PulsarClientImpl#lookupExecutorProvider` to execute the callback.
   3. `ConnectionPool#getConnection(InetSocketAddress logicalAddress, 
InetSocketAddress physicalAddress, int randomKey)`
       It's called by the 2nd method directly. Besides, it's only called by the 
4th method.
   4. `PulsarClientImpl#getConnection(InetSocketAddress, InetSocketAddress, 
int)`
       It's called in `ConnectionHandler#grabCnx` to establish a connection 
between broker and client (producer, consumer or reader). The callback calls 
`connectionOpened` or `handleConnectionError` without switching to another 
executor.
   5. Other methods in `PulsarClientImpl`
       Including:
       - `getConnectionToServiceUrl`
       - `getConnection(String, int)`
       - `getConnection(String, String)`
       They all call the 4th method in the callback of 
`LookupService#getBroker` and only used in `grabCnx`.
   
   To solve a race condition caused by the fact that socket is closed in 
Netty's I/O thread while `connectionOpened` that sends the command is executed 
in another thread, #23499 completes the future of 
`ConnectionPool#getConnection` in Netty's' I/O thread as well. However, this 
adds an additional thread switching for all usages in method 1 above, which is 
not necessary.
   
   I found this issue when I found a producer creation was blocked forever due 
to a deadlock in `sendAsync`'s callback, which is executed in Netty's I/O 
thread. When checking the heap dump, I found `ClientCnx#pendingRequests` was 
empty, which means `client.getCnxPool().getConnection(socketAddress)` never 
complete, see 
https://github.com/apache/pulsar/blob/1e57827b33395a2124593d95fa7641ee5e125d9b/pulsar-client/src/main/java/org/apache/pulsar/client/impl/BinaryProtoLookupService.java#L218
   
   Though even if without the change, other response processing will still be 
blocked because the I/O thread is blocked. It's very confusing when reviewing 
the heap dump:
   - `BinaryProtoLookupService#partitionedMetadataInProgress` is not empty
   - `ClientCnx#pendingRequests` is empty (the only connection in the pool)
   
   <img width="877" height="253" alt="image" 
src="https://github.com/user-attachments/assets/ca3cbbb8-9110-4697-b3c9-72c4b61c19d1";
 />
   
   <img width="678" height="336" alt="image" 
src="https://github.com/user-attachments/assets/be1dc242-d6e3-41ad-92df-8336500653db";
 />
   
   Actually, the root cause of the issue described in #23499 is that the 
`StacklessClosedChannelException` is treated as an exception cannot be retried. 
However, all network exceptions should be retried. Hence, this PR proposed a 
different solution to retry for such errors. Technically, only a few known 
exceptions should be treated as not retriable, e.g. `AuthorizationException`. 
Other known or unknown exceptions should be retried.
   
   It's not guaranteed that `writeAndFlush` will always succeed. For example, 
if the code reaches here: 
https://github.com/apache/pulsar/blob/d2728253c666f7d4bd4f111356f3b97663603b6a/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L1062
   
   The future of `ClientCnx#sendRequestWithId` could fail with an exception 
that is not `PulsarClientException`.
   
   ### Modifications
   
   Revert the change in #23499 and retry for retriable exception even if it's 
not `PulsarClientException`. Improve `SimpleProduceConsumeIoTest` to cover 
consumer creation as well. Since this test only covers a very limited case, add 
`testUnknownRpcExceptionFor*` tests that inject failure on the 1st 
`writeAndFlush` in `connectionOpened`.
   
   ### Documentation
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [ ] `doc` <!-- Your PR contains doc changes. -->
   - [ ] `doc-required` <!-- Your PR changes impact docs and you will update 
later -->
   - [x] `doc-not-needed` <!-- Your PR changes do not impact docs -->
   - [ ] `doc-complete` <!-- Docs have been already added -->
   
   ### Matching PR in forked repository
   
   PR in forked repository:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to