codelipenghui opened a new issue #10703:
URL: https://github.com/apache/pulsar/issues/10703


   **Describe the bug**
   All the broker can't complete the managed ledger creation, the managed 
ledger will in the `None` state for 1min and after 1min to reconnect to the 
topic, the managed ledger been removed due to `too long pending time`
   
   Here is the logs:
   
   ```
   May 25 15:24:35 ip-10-0-0-179 pulsar[39322]: 15:24:35.579 [pulsar-io-4-7] 
INFO  org.apache.pulsar.broker.service.ServerCnx - 
[/10.0.0.179:46878][persistent://public/default/test-partition-0] Creating 
producer. producerId=0
   May 25 15:24:35 ip-10-0-0-179 pulsar[39322]: 15:24:35.580 
[pulsar-ordered-OrderedExecutor-3-0] WARN  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - 
[public/default/persistent/test-partition-0] Managed ledger has been pending in 
initialize state more than 1899091 milliseconds, remove it from cache to retry 
...
   May 25 15:24:35 ip-10-0-0-179 pulsar[39322]: 15:24:35.580 
[pulsar-ordered-OrderedExecutor-3-0] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger 
public/default/persistent/test-partition-0
   May 25 15:24:35 ip-10-0-0-179 pulsar[39322]: 15:24:35.581 
[bookkeeper-ml-scheduler-OrderedScheduler-6-0] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[public/default/persistent/test-partition-0] Creating ledger, metadata: 
{component=[109, 97, 110, 97, 103, 101, 100, 45, 108, 101, 100, 103, 101, 114], 
pulsar/managed-ledger=[112, 117, 98, 108, 105, 99, 47, 100, 101, 102, 97, 117, 
108, 116, 47, 112, 101, 114, 115, 105, 115, 116, 101, 110, 116, 47, 116, 101, 
115, 116, 45, 112, 97, 114, 116, 105, 116, 105, 111, 110, 45, 48], 
application=[112, 117, 108, 115, 97, 114]} - metadata ops timeout : 60 seconds
   May 25 15:24:35 ip-10-0-0-179 pulsar[39322]: 15:24:35.581 
[bookkeeper-ml-scheduler-OrderedScheduler-6-0] WARN  
org.apache.bookkeeper.client.BookieWatcherImpl - New ensemble: 
[10.0.0.179:3181, 10.0.0.104:3181] is not adhering to Placement Policy. 
quarantinedBookies: []
   May 25 15:24:35 ip-10-0-0-179 pulsar[39322]: 15:24:35.586 [main-EventThread] 
INFO  org.apache.bookkeeper.client.LedgerCreateOp - Ensemble: [10.0.0.179:3181, 
10.0.0.104:3181] for ledger: 42
   May 25 15:24:38 ip-10-0-0-179 pulsar[39322]: 15:24:38.762 [pulsar-web-40-5] 
INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 - - 
[25/May/2021:15:24:38 +0000] "GET /metrics HTTP/1.1" 302 0 "-" 
"Prometheus/2.21.0" 0
   May 25 15:24:38 ip-10-0-0-179 pulsar[39322]: 15:24:38.764 
[prometheus-stats-41-1] INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 
- - [25/May/2021:15:24:38 +0000] "GET /metrics/ HTTP/1.1" 200 43130 
"http://10.0.0.179:8080/metrics"; "Prometheus/2.21.0" 2
   May 25 15:24:43 ip-10-0-0-179 pulsar[39322]: 15:24:43.866 [pulsar-io-4-6] 
INFO  org.apache.pulsar.broker.service.ServerCnx - Closed connection from 
/127.0.0.1:59372
   May 25 15:24:43 ip-10-0-0-179 pulsar[39322]: 15:24:43.866 [pulsar-io-4-7] 
INFO  org.apache.pulsar.broker.service.ServerCnx - Closed connection from 
/10.0.0.179:46878
   May 25 15:24:48 ip-10-0-0-179 pulsar[39322]: 15:24:48.762 [pulsar-web-40-5] 
INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 - - 
[25/May/2021:15:24:48 +0000] "GET /metrics HTTP/1.1" 302 0 "-" 
"Prometheus/2.21.0" 0
   May 25 15:24:48 ip-10-0-0-179 pulsar[39322]: 15:24:48.764 
[prometheus-stats-41-1] INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 
- - [25/May/2021:15:24:48 +0000] "GET /metrics/ HTTP/1.1" 200 43128 
"http://10.0.0.179:8080/metrics"; "Prometheus/2.21.0" 2
   May 25 15:24:58 ip-10-0-0-179 pulsar[39322]: 15:24:58.761 [pulsar-web-40-5] 
INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 - - 
[25/May/2021:15:24:58 +0000] "GET /metrics HTTP/1.1" 302 0 "-" 
"Prometheus/2.21.0" 0
   May 25 15:24:58 ip-10-0-0-179 pulsar[39322]: 15:24:58.764 
[prometheus-stats-41-1] INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 
- - [25/May/2021:15:24:58 +0000] "GET /metrics/ HTTP/1.1" 200 43126 
"http://10.0.0.179:8080/metrics"; "Prometheus/2.21.0" 2
   May 25 15:25:01 ip-10-0-0-179 CRON[51146]: (root) CMD (command -v debian-sa1 
> /dev/null && debian-sa1 1 1)
   May 25 15:25:08 ip-10-0-0-179 pulsar[39322]: 15:25:08.761 [pulsar-web-40-5] 
INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 - - 
[25/May/2021:15:25:08 +0000] "GET /metrics HTTP/1.1" 302 0 "-" 
"Prometheus/2.21.0" 0
   May 25 15:25:08 ip-10-0-0-179 pulsar[39322]: 15:25:08.764 
[prometheus-stats-41-1] INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 
- - [25/May/2021:15:25:08 +0000] "GET /metrics/ HTTP/1.1" 200 43128 
"http://10.0.0.179:8080/metrics"; "Prometheus/2.21.0" 2
   May 25 15:25:18 ip-10-0-0-179 pulsar[39322]: 15:25:18.761 [pulsar-web-40-5] 
INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 - - 
[25/May/2021:15:25:18 +0000] "GET /metrics HTTP/1.1" 302 0 "-" 
"Prometheus/2.21.0" 0
   May 25 15:25:18 ip-10-0-0-179 pulsar[39322]: 15:25:18.764 
[prometheus-stats-41-1] INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 
- - [25/May/2021:15:25:18 +0000] "GET /metrics/ HTTP/1.1" 200 43128 
"http://10.0.0.179:8080/metrics"; "Prometheus/2.21.0" 2
   May 25 15:25:28 ip-10-0-0-179 pulsar[39322]: 15:25:28.762 [pulsar-web-40-5] 
INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 - - 
[25/May/2021:15:25:28 +0000] "GET /metrics HTTP/1.1" 302 0 "-" 
"Prometheus/2.21.0" 1
   May 25 15:25:28 ip-10-0-0-179 pulsar[39322]: 15:25:28.764 
[prometheus-stats-41-1] INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 
- - [25/May/2021:15:25:28 +0000] "GET /metrics/ HTTP/1.1" 200 43158 
"http://10.0.0.179:8080/metrics"; "Prometheus/2.21.0" 2
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: 15:25:35.580 [pulsar-io-4-8] 
ERROR org.apache.pulsar.broker.service.ServerCnx - [/10.0.0.179:46878] Failed 
to create topic persistent://public/default/test-partition-0, producerId=0
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: 
java.util.concurrent.CompletionException: 
org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to 
load topic within timeout
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
 ~[?:?]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632)
 ~[?:?]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
org.apache.pulsar.common.util.FutureUtil.lambda$addTimeoutHandling$1(FutureUtil.java:129)
 ~[org.apache.pulsar-pulsar-common-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) 
[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) 
[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
 [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
 [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) 
[io.netty-netty-transport-native-epoll-4.1.63.Final-linux-x86_64.jar:4.1.63.Final]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
 [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
[io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
java.lang.Thread.run(Thread.java:829) [?:?]
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: Caused by: 
org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to 
load topic within timeout
   May 25 15:25:35 ip-10-0-0-179 pulsar[39322]: #011at 
org.apache.pulsar.broker.service.BrokerService.futureWithDeadline(...)(Unknown 
Source) ~[org.apache.pulsar-pulsar-broker-2.8.0-SNAPSHOT.jar:2.8.0-SNAPSHOT]
   May 25 15:25:38 ip-10-0-0-179 pulsar[39322]: 15:25:38.761 [pulsar-web-40-5] 
INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 - - 
[25/May/2021:15:25:38 +0000] "GET /metrics HTTP/1.1" 302 0 "-" 
"Prometheus/2.21.0" 0
   May 25 15:25:38 ip-10-0-0-179 pulsar[39322]: 15:25:38.764 
[prometheus-stats-41-1] INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 
- - [25/May/2021:15:25:38 +0000] "GET /metrics/ HTTP/1.1" 200 43157 
"http://10.0.0.179:8080/metrics"; "Prometheus/2.21.0" 2
   May 25 15:25:40 ip-10-0-0-179 systemd[1]: Started Session 115 of user ubuntu.
   May 25 15:25:48 ip-10-0-0-179 pulsar[39322]: 15:25:48.761 [pulsar-web-40-5] 
INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 - - 
[25/May/2021:15:25:48 +0000] "GET /metrics HTTP/1.1" 302 0 "-" 
"Prometheus/2.21.0" 0
   May 25 15:25:48 ip-10-0-0-179 pulsar[39322]: 15:25:48.764 
[prometheus-stats-41-1] INFO  org.eclipse.jetty.server.RequestLog - 10.0.0.156 
- - [25/May/2021:15:25:48 +0000] "GET /metrics/ HTTP/1.1" 200 43157 
"http://10.0.0.179:8080/metrics"; "Prometheus/2.21.0" 2
   May 25 15:25:53 ip-10-0-0-179 pulsar[39322]: 15:25:53.534 [pulsar-io-4-10] 
INFO  org.apache.pulsar.broker.service.ServerCnx - New connection from 
/127.0.0.1:59376
   May 25 15:25:53 ip-10-0-0-179 pulsar[39322]: 15:25:53.686 [pulsar-io-4-11] 
INFO  org.apache.pulsar.broker.service.ServerCnx - New connection from 
/10.0.0.179:46882
   May 25 15:25:53 ip-10-0-0-179 pulsar[39322]: 15:25:53.690 [pulsar-io-4-11] 
INFO  org.apache.pulsar.broker.service.ServerCnx - 
[/10.0.0.179:46882][persistent://public/default/test-partition-0] Creating 
producer. producerId=0
   May 25 15:25:53 ip-10-0-0-179 pulsar[39322]: 15:25:53.690 
[pulsar-ordered-OrderedExecutor-3-0] WARN  
org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - 
[public/default/persistent/test-partition-0] Managed ledger has been pending in 
initialize state more than 78110 milliseconds, remove it from cache to retry ...
   May 25 15:25:53 ip-10-0-0-179 pulsar[39322]: 15:25:53.690 
[pulsar-ordered-OrderedExecutor-3-0] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger 
public/default/persistent/test-partition-0
   May 25 15:25:53 ip-10-0-0-179 pulsar[39322]: 15:25:53.691 
[bookkeeper-ml-scheduler-OrderedScheduler-6-0] INFO  
org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[public/default/persistent/test-partition-0] Creating ledger, metadata: 
{component=[109, 97, 110, 97, 103, 101, 100, 45, 108, 101, 100, 103, 101, 114], 
pulsar/managed-ledger=[112, 117, 98, 108, 105, 99, 47, 100, 101, 102, 97, 117, 
108, 116, 47, 112, 101, 114, 115, 105, 115, 116, 101, 110, 116, 47, 116, 101, 
115, 116, 45, 112, 97, 114, 116, 105, 116, 105, 111, 110, 45, 48], 
application=[112, 117, 108, 115, 97, 114]} - metadata ops timeout : 60 seconds
   May 25 15:25:53 ip-10-0-0-179 pulsar[39322]: 15:25:53.692 
[bookkeeper-ml-scheduler-OrderedScheduler-6-0] WARN  
org.apache.bookkeeper.client.BookieWatcherImpl - New ensemble: 
[10.0.0.179:3181, 10.0.0.104:3181] is not adhering to Placement Policy. 
quarantinedBookies: []
   May 25 15:25:53 ip-10-0-0-179 pulsar[39322]: 15:25:53.696 [main-EventThread] 
INFO  org.apache.bookkeeper.client.LedgerCreateOp - Ensemble: [10.0.0.179:3181, 
10.0.0.104:3181] for ledger: 43
   ```
   
   This will affect all topics of the broker and all the 3 brokers are in this 
state.
   I have make a thread dump and heap dump:
   
   
   
   
[heap.hprof.zip](https://github.com/apache/pulsar/files/6540692/heap.hprof.zip)
   [jstack_v2.log](https://github.com/apache/pulsar/files/6540699/jstack_v2.log)
   
   
   **To Reproduce**
   It's difficult to reproduce
   
   **Additional context**
   current master branch
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to