fracasula commented on issue #7682: URL: https://github.com/apache/pulsar/issues/7682#issuecomment-681831805
@codelipenghui @gaoran10 We created a clone of whole production environment with the offloaded bucket too and even with 2.6.1 we're perfectly able to replicate, in fact the consumers are still getting stuck. We're able to unlock them only by manually killing the pod on Kubernetes. I deployed 3 consumers with a key shared subscription listening to our `SpaceEvents` topic. Once all three are running I reset the consumers offset for that subscription to by doing `bin/pulsar-admin topics reset-cursor SpaceEvents -s cloud-pulsar-tester -t 99999w`. After the reset I can see that the 3 consumers started reading messages. The `msgRateOut` showed in the `bin/pulsar-admin topics stats SpaceEvents` also shows that all 3 are reading messages. At some point, when there are around 181k messages left in the backlog, the consumers get stuck. Golang µservices logs report warnings only, coming from the Pulsar library itself (which uses logrus), they are not coming from our code: ``` { remote_addr: { ForceQuery: false Host: "pulsar-proxy.pulsar:6650" RawPath: "" User: null Fragment: "" Opaque: "" RawQuery: "" Path: "" Scheme: "pulsar" RawFragment: "" } level: "warn" local_addr: { Port: 44814 IP: "10.56.0.18" Zone: "" } } { remote_addr: { RawPath: "" Scheme: "pulsar" Path: "" Opaque: "" RawQuery: "" Host: "pulsar-proxy.pulsar:6650" RawFragment: "" Fragment: "" User: null ForceQuery: false } level: "warn" error: "write tcp 10.56.0.18:44814->10.0.1.68:6650: use of closed network connection" local_addr: { Port: 44814 Zone: "" IP: "10.56.0.18" } } { remote_addr: { RawQuery: "" Opaque: "" ForceQuery: false RawFragment: "" Scheme: "pulsar" Fragment: "" RawPath: "" Path: "" Host: "pulsar-proxy.pulsar:6650" User: null } local_addr: { Zone: "" Port: 42612 IP: "10.56.0.21" } level: "warn" } { level: "warn" remote_addr: { User: null Path: "" Scheme: "pulsar" RawQuery: "" Fragment: "" Opaque: "" Host: "pulsar-proxy.pulsar:6650" ForceQuery: false RawPath: "" RawFragment: "" } local_addr: { Zone: "" IP: "10.56.0.21" Port: 42612 } error: "write tcp 10.56.0.21:42612->10.0.1.68:6650: use of closed network connection" } { local_addr: { Zone: "" Port: 51340 IP: "10.56.0.17" } remote_addr: { User: null ForceQuery: false RawPath: "" RawFragment: "" RawQuery: "" Path: "" Host: "pulsar-proxy.pulsar:6650" Opaque: "" Fragment: "" Scheme: "pulsar" } level: "warn" } ``` Pulsar logs report the following (by `grep`ping by `exception`). ### Broker ``` java.lang.NullPointerException: null org.apache.bookkeeper.mledger.ManagedLedgerException: Unknown exception 08:25:35.597 [bookkeeper-ml-workers-OrderedExecutor-3-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://public/default/SpaceEvents / cloud-pulsar-tester] Error reading entries at 1430:0 : Unknown exception, Read Type Replay - Retrying to read in 15.0 seconds java.lang.NullPointerException: null at org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1776) ~[org.apache.pulsar-pulsar-common-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:87) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] java.lang.NullPointerException: null at org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:125) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] java.lang.NullPointerException: null at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] 08:27:09.703 [bookkeeper-ml-workers-OrderedExecutor-3-0] WARN org.apache.bookkeeper.mledger.impl.OpReadEntry - [public/default/persistent/SpaceEvents][cloud-pulsar-tester] read failed from ledger at position:2193:0 : Unknown exception 08:27:09.703 [bookkeeper-ml-workers-OrderedExecutor-3-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://public/default/SpaceEvents / cloud-pulsar-tester] Error reading entries at 2193:0 : Unknown exception, Read Type Normal - Retrying to read in 15.0 seconds java.lang.NullPointerException: null at org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1776) ~[org.apache.pulsar-pulsar-common-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:87) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] java.lang.NullPointerException: null at org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:125) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] java.lang.NullPointerException: null at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] ``` ### Proxy ``` 08:27:14.083 [pulsar-proxy-io-2-3] WARN io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 08:27:14.090 [pulsar-proxy-io-2-3] WARN org.apache.pulsar.proxy.server.ProxyConnection - [/10.56.0.21:42612] Got exception NativeIoException : readAddress(..) failed: Connection reset by peer null 08:27:38.090 [pulsar-proxy-io-2-8] WARN io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer 08:27:38.090 [pulsar-proxy-io-2-8] WARN org.apache.pulsar.proxy.server.ProxyConnection - [/10.56.0.17:51340] Got exception NativeIoException : readAddress(..) failed: Connection reset by peer null ``` When inspecting the subscription, via the `pulsar-admin` cli tool, we see no consumers at all despite having 3 services running (and using your Pulsar Golang library - the native one, no cgo): `$ bin/pulsar-admin topics stats SpaceEvents | jq '.subscriptions["cloud-pulsar-tester"]' -c | jq` ``` { "msgRateOut": 0, "msgThroughputOut": 0, "bytesOutCounter": 0, "msgOutCounter": 0, "msgRateRedeliver": 0, "chuckedMessageRate": 0, "msgBacklog": 181626, "msgBacklogNoDelayed": 181626, "blockedSubscriptionOnUnackedMsgs": false, "msgDelayed": 0, "unackedMessages": 0, "type": "Key_Shared", "msgRateExpired": 0, "lastExpireTimestamp": 0, "lastConsumedFlowTimestamp": 1598516857278, "lastConsumedTimestamp": 0, "lastAckedTimestamp": 0, "consumers": [], "isDurable": true, "isReplicated": false } ``` If we inspect the Golang µservices logs with a lower debug level we see logs coming from the underlying Pulsar Golang library as shown below (please note that the following logs are regularly printed also after the consumers get stuck): ``` {"level":"debug","severity":"DEBUG","time":"2020-08-27T09:14:12Z","topic":"SpaceEvents"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} {"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"} ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org