nodece opened a new issue, #23432:
URL: https://github.com/apache/pulsar/issues/23432

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Motivation
   
   #23378 CI failed, which reports the deadlock issue in the pulsar client.
   
   The binary lookup uses the pulsar client's internal executor.
   
   ```
   2024-10-09T14:07:01.8281866Z "broker-client-shared-internal-executor-203-1" 
#398 [4208] prio=5 os_prio=0 cpu=5.08ms elapsed=3422.76s tid=0x00007efce4013ce0 
nid=4208 waiting on condition  [0x00007efca57fd000]
   2024-10-09T14:07:01.8282015Z    java.lang.Thread.State: WAITING (parking)
   2024-10-09T14:07:01.8282244Z         at 
jdk.internal.misc.Unsafe.park([email protected]/Native Method)
   2024-10-09T14:07:01.8282658Z         - parking to wait for  
<0x000010000c5f6c58> (a java.util.concurrent.CompletableFuture$Signaller)
   2024-10-09T14:07:01.8283361Z         at 
java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:221)
   2024-10-09T14:07:01.8283798Z         at 
java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1864)
   2024-10-09T14:07:01.8284421Z         at 
java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3780)
   2024-10-09T14:07:01.8284853Z         at 
java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3725)
   2024-10-09T14:07:01.8285257Z         at 
java.util.concurrent.CompletableFuture.waitingGet([email protected]/CompletableFuture.java:1898)
   2024-10-09T14:07:01.8285614Z         at 
java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:2072)
   2024-10-09T14:07:01.8286326Z         at 
org.apache.pulsar.client.impl.schema.reader.AbstractMultiVersionReader.getSchemaInfoByVersion(AbstractMultiVersionReader.java:118)
   2024-10-09T14:07:01.8286927Z         at 
org.apache.pulsar.client.impl.schema.reader.MultiVersionAvroReader.loadReader(MultiVersionAvroReader.java:45)
   2024-10-09T14:07:01.8287435Z         at 
org.apache.pulsar.client.impl.schema.reader.AbstractMultiVersionReader$1.load(AbstractMultiVersionReader.java:51)
   2024-10-09T14:07:01.8287942Z         at 
org.apache.pulsar.client.impl.schema.reader.AbstractMultiVersionReader$1.load(AbstractMultiVersionReader.java:48)
   2024-10-09T14:07:01.8288307Z         at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3571)
   2024-10-09T14:07:01.8288582Z         at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2313)
   2024-10-09T14:07:01.8288898Z         at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2190)
   2024-10-09T14:07:01.8289291Z         - locked <0x000010000c5f4f80> (a 
com.google.common.cache.LocalCache$StrongAccessEntry)
   2024-10-09T14:07:01.8289762Z         at 
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2080)
   2024-10-09T14:07:01.8289989Z         at 
com.google.common.cache.LocalCache.get(LocalCache.java:4012)
   2024-10-09T14:07:01.8290251Z         at 
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4035)
   2024-10-09T14:07:01.8290554Z         at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5013)
   2024-10-09T14:07:01.8291133Z         at 
org.apache.pulsar.client.impl.schema.reader.AbstractMultiVersionReader.getSchemaReader(AbstractMultiVersionReader.java:82)
   2024-10-09T14:07:01.8291642Z         at 
org.apache.pulsar.client.impl.schema.reader.AbstractMultiVersionReader.read(AbstractMultiVersionReader.java:73)
   2024-10-09T14:07:01.8292052Z         at 
org.apache.pulsar.client.impl.schema.AbstractStructSchema.decode(AbstractStructSchema.java:90)
   2024-10-09T14:07:01.8292382Z         at 
org.apache.pulsar.client.impl.MessageImpl.decodeBySchema(MessageImpl.java:512)
   2024-10-09T14:07:01.8292665Z         at 
org.apache.pulsar.client.impl.MessageImpl.decode(MessageImpl.java:493)
   2024-10-09T14:07:01.8292959Z         at 
org.apache.pulsar.client.impl.MessageImpl.getValue(MessageImpl.java:478)
   2024-10-09T14:07:01.8293681Z         at 
org.apache.pulsar.broker.service.SystemTopicBasedTopicPoliciesService.refreshTopicPoliciesCache(SystemTopicBasedTopicPoliciesService.java:523)
   2024-10-09T14:07:01.8294463Z         at 
org.apache.pulsar.broker.service.SystemTopicBasedTopicPoliciesService.lambda$readMorePoliciesAsync$24(SystemTopicBasedTopicPoliciesService.java:500)
   2024-10-09T14:07:01.8294991Z         at 
org.apache.pulsar.broker.service.SystemTopicBasedTopicPoliciesService$$Lambda/0x00007efd04b54f00.accept(Unknown
 Source)
   2024-10-09T14:07:01.8295396Z         at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire([email protected]/CompletableFuture.java:718)
   2024-10-09T14:07:01.8295796Z         at 
java.util.concurrent.CompletableFuture.postComplete([email protected]/CompletableFuture.java:510)
   2024-10-09T14:07:01.8296257Z         at 
java.util.concurrent.CompletableFuture.complete([email protected]/CompletableFuture.java:2179)
   2024-10-09T14:07:01.8296677Z         at 
org.apache.pulsar.client.impl.ConsumerBase.lambda$completePendingReceive$0(ConsumerBase.java:333)
   2024-10-09T14:07:01.8297062Z         at 
org.apache.pulsar.client.impl.ConsumerBase$$Lambda/0x00007efd04bb29a8.run(Unknown
 Source)
   2024-10-09T14:07:01.8297599Z         at 
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1144)
   2024-10-09T14:07:01.8298124Z         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:642)
   2024-10-09T14:07:01.8298487Z         at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
   2024-10-09T14:07:01.8298698Z         at 
java.lang.Thread.runWith([email protected]/Thread.java:1596)
   2024-10-09T14:07:01.8298883Z         at 
java.lang.Thread.run([email protected]/Thread.java:1583)
   2024-10-09T14:07:01.8298896Z 
   2024-10-09T14:07:01.8299012Z    Locked ownable synchronizers:
   2024-10-09T14:07:01.8299347Z         - <0x000010000c50c540> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
   ```
   
   We are using synchronous calls in asynchronous callback.
   
   ### Solution
   
   ## Solution 1:
   1. Add a set of async methods:
      - `org.apache.pulsar.client.api.schema.SchemaReader#read`
      - `org.apache.pulsar.client.api.Message#getValue`
   2. Check the call stack, and then remove synchronous calls in asynchronous 
callback.
   
   This solution will change the method signature, which breaks the public 
interface(SchemaReader and Message).
   
   ## Solution 2:
   
   ```
   private void readMorePoliciesAsync(SystemTopicClient.Reader<PulsarEvent> 
reader) {
           if (closed.get()) {
               
cleanCacheAndCloseReader(reader.getSystemTopic().getTopicName().getNamespaceObject(),
 false);
               return;
           }
           reader.readNextAsync()
                   .thenAccept(msg -> {
                       refreshTopicPoliciesCache(msg);
                       notifyListener(msg);
                   })
   ```
   
   It looks like refreshTopicPoliciesCache has a deadlock, and the callback 
thread is broker-client-shared-internal-executor.
   We can use the `pulsarService.getExecutor()` to run the callback, this can 
avoid the deadlock.
   
   ## Solution 3:
   
   Don't use the pulsar client's internal executor in the 
`org.apache.pulsar.client.impl.BinaryProtoLookupService`.
   
   ### Alternatives
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to