2020-09-23 09:18:16 UTC - alex kurtser: Hello We experience the consumers stuck under very strange conditions. It’s still hard to catch . It always happens with KEY_SHARED subscriptions with multiple consumers. What we figured out that the trigger for the problem is the adding additional consumers. It can be occurred as part of upgrade ( for example the new upgraded consumers being added and after this the old consumers being removed from the topic) .
Sometimes , because of network glitch, consumer is disconnecting and after few seconds reconnecting. We have 12 consumers on this particular topic but it’s enough that only one consumer will disconnect and reconnect in order to stuck the whole partition for every consumers *while other partitions still works fine.* Again, the common trigger is some change with consumers, preferably the adding of new consumer/consumers It doesn’t happen always. My estimation that it happens approximately in half of cases. Of course, no errors on the brokers side expecting the standard messages about connecting new consumers. The problem is that in order to recover from this situation we need to do one of the following : 1. Pulsar-admin topics unload topicname 2. Or we need completely disconnect all consumers and reconnect ( it isn’t recovering if even one consumers remaining connected.. we need to disconnect all of them at once) Any ideas ? ---- 2020-09-23 10:07:56 UTC - QW: @QW has joined the channel ---- 2020-09-23 11:59:56 UTC - Gautam Lodhiya: Hello, I am facing some issue of frequent consumer disconnection and reconnection, due to which the consumer processing some message couldn’t ack and it gets re-enqueued directly on consumer disconnection. Can we have some event notified before it gets re-enqueued or some delay before getting re-enqueued? ---- 2020-09-23 12:15:57 UTC - Sydni Horner: @Sydni Horner has joined the channel ---- 2020-09-23 12:19:35 UTC - Jim M.: @Sydni Horner FYI ---- 2020-09-23 12:34:29 UTC - chris: you should be able to add this annotation and it will create an internal load balancer (non public IP) ```<http://cloud.google.com/load-balancer-type|cloud.google.com/load-balancer-type>: "Internal"``` <https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing> ---- 2020-09-23 12:54:18 UTC - Brent Evans: I'm going to see if I can write something to recreate the issue / system load as it only happens intermittently in our deployed environment. I'll get back when I've got something, thanks for the help so far ---- 2020-09-23 13:01:47 UTC - slouie: Hi @Sijie Guo, This issue hasn’t presented itself for the last 3 months since these posts, but just returned in force. I’m still investigating what we can do here and making sense of these stats gists. But wanted to share the output of those two commands you recommended to run. <https://gist.github.com/David-Miller-Bose/5d23a1437a6e9e56c595d2eff8b35964> If you have any advice or guidance on where else to look its greatly appreciated. We found that only one of our subscriptions require us to clear the backlog. We are looking into whats different about this one subscriber. ```Product-State-Subscription``` If I found out more and get to a fix, I’ll post back and hopefully leave some breadcrumbs for others. ---- 2020-09-23 13:49:49 UTC - Shawn: @Shawn has joined the channel ---- 2020-09-23 15:14:34 UTC - Sijie Guo: ```"Product-State-Subscription" : { "markDeletePosition" : "311744:132685", "readPosition" : "311762:71967", "waitingReadOp" : true, "pendingReadOps" : 0, "messagesConsumedCounter" : 1272570493, "cursorLedger" : 311678, "cursorLedgerLastEntry" : 8879, "individuallyDeletedMessages" : "[(311744:132686..311744:133765], (311744:133766..311762:71936], (311762:71938..311762:71947], (311762:71949..311762:71951], (311762:71952..311762:71954]]", "lastLedgerSwitchTimestamp" : "2020-09-22T15:17:51.399Z", "state" : "Open", "numberOfEntriesSinceFirstNotAckedMessage" : 865102, "totalNonContiguousDeletedMessagesRange" : 5, "properties" : { } },``` ---- 2020-09-23 15:15:16 UTC - Sijie Guo: from the topic stats output, it seems that there are a lot of unacked messages (see `individuallyDeletedMessages`). Did you enable ackTimeout? ---- 2020-09-23 15:16:32 UTC - slouie: yes, it’s `120s` We also leverage negative acks. ---- 2020-09-23 15:20:31 UTC - Sijie Guo: Interesting. Which version of Pulsar broker are you using? What is the subscription type? ---- 2020-09-23 15:20:53 UTC - David Miller: @David Miller has joined the channel ---- 2020-09-23 15:21:59 UTC - Sijie Guo: You can use OAuth2 authentication. <http://pulsar.apache.org/docs/en/security-oauth2/> ---- 2020-09-23 15:22:15 UTC - slouie: version `2.3.1.` the subscription type is `Shared` ---- 2020-09-23 15:25:01 UTC - Sijie Guo: I think Penghui suggested some commands to run to capture the topics stats and stats-internal in the email thread. Once you provide those information, Penghui can help you address the problem. ---- 2020-09-23 15:26:05 UTC - Sijie Guo: It seems that the broker timed out at reading the policy from zookeeper. So it can’t get the policies data from zookeeper and it will use the default one. ---- 2020-09-23 15:26:10 UTC - alex kurtser: Hi. Yes, i spoke with him after this message :slightly_smiling_face: I will try to reproduce and captur the memory ---- 2020-09-23 15:27:29 UTC - Sijie Guo: The message isn’t “re-enqueue”. If a message is not acked, the message will be dispatched again to the consumer. So I think the main question will be - why the consumer are disconnected. ---- 2020-09-23 15:27:52 UTC - Sijie Guo: Cool ---- 2020-09-23 15:31:26 UTC - Sijie Guo: I see. Just want to check one thing - have you tried to `topic unload` to unload a topic? If you unload a topic, did backlog move? ---- 2020-09-23 15:33:35 UTC - Ruwen Reddig: @Ruwen Reddig has joined the channel ---- 2020-09-23 15:45:43 UTC - slouie: I haven’t tried that yet. Our backlog is hovering around 60 messages. We’ve been doing a clear-backlog to clear the storage. This would redeliver those unack’d messages correct? Just want to understand the difference and how it could resolve this issue ---- 2020-09-23 16:26:45 UTC - Shawn: just to clarify some context (i work with slouie): That topic has anywhere from 500-1000 messages a second published. The consumers are keeping up with that, sending acks (and occasional nacks). the backlog at any given moment is around 20-60, however periodically the storage starts growing until the limit is reached (around 23gb), and publishes fail. the backlog itself never starts growing. ---- 2020-09-23 16:28:00 UTC - Sijie Guo: clear-backlog doesn’t trigger redeliveries ---- 2020-09-23 16:28:11 UTC - Sijie Guo: It basically skips the unacked messages ---- 2020-09-23 16:29:34 UTC - Sijie Guo: In order to debug this issue, it would be good to run `topic stats-internal` twice between a couple of minutes. Then we can compare the outputs to see if cursor is moving or not. ---- 2020-09-23 16:45:20 UTC - Sijie Guo: ApacheCon@Home is happening next week (Sep 29th ~ Oct 1st). There is a dedicated track for Pulsar/BookKeeper. It has 17 talks spanning multiple timezones. You can find the talks in <https://www.apachecon.com/acah2020/tracks/> and register the event at <https://hopin.to/events/apachecon-home>. star-struck : Frank Kelly, Axel Sirota ---- 2020-09-23 17:44:59 UTC - David Miller: Another a co-worker of @slouie here. I have a script collecting the output of the stats-internal every minute now. Hopefully the storage will trend back up beyond a GC window here shortly so that we know we have some samples that capture the issue. ---- 2020-09-23 18:15:16 UTC - Ebere Abanonu: Will it be recorded and made available? ---- 2020-09-23 18:41:34 UTC - David Miller: @Sijie Guo so we encountered the issue again. This time we have stats-internal data every minute before and after the issue. We also just attempted the unload on the topic and it did clear the storage associated with the topic. ---- 2020-09-23 18:42:28 UTC - David Miller: I’ll put together a few samples in a gist and will publish them in a few minutes. ---- 2020-09-23 18:42:58 UTC - Sijie Guo: Cool, Thank you! /cc @Penghui Li in the loop ---- 2020-09-23 18:43:34 UTC - Sijie Guo: It should be. Although I am not the organizer of ApacheCon, I have to double check with the organizers. ---- 2020-09-23 19:00:24 UTC - Evan Furman: Hi guys! We are not seeing Bookies listed under the cluster in Pulsar Manager. Have added the following params to the bookkeeper.conf. ```# The flag enables/disables starting the admin http server. Default value is 'false'. httpServerEnabled=true # The http server port to listen on. Default value is 8080. # Use `8000` as the port to keep it consistent with prometheus stats provider httpServerPort=8000 # The http server class httpServerClass=org.apache.bookkeeper.http.vertx.VertxHttpServer``` cc: @Tim Haak ---- 2020-09-23 19:02:05 UTC - Tim Haak: the call `/pulsar-manager/api/v1/bookies/pulsar-cluster`returns ```{"enableBookieHttp":false}``` ---- 2020-09-23 19:03:59 UTC - Addison Higham: hrm... I am not personally familiar with pulsar manager. Have you created an issue on the github repo? I will get it to the right person ---- 2020-09-23 19:06:08 UTC - Sijie Guo: Did you configure Pulsar Manager to talk to bookkeeper cluster? ---- 2020-09-23 19:06:29 UTC - Evan Furman: How would I do that? ---- 2020-09-23 19:11:21 UTC - Addison Higham: ```bookie.host=<http://localhost:8050> bookie.enable=false``` Is what needs configured ---- 2020-09-23 19:11:45 UTC - Sijie Guo: Yes. @Addison Higham you are faster than me :slightly_smiling_face: ---- 2020-09-23 19:12:11 UTC - Evan Furman: we are running the docker container — do these need to be passed in as env vars? ---- 2020-09-23 19:12:28 UTC - Addison Higham: I at least know where the config file is :smile: ---- 2020-09-23 19:13:50 UTC - Sijie Guo: @Evan Furman I am not sure it was exposed as environment variables yet. an alertnative way is to start the docker and pointing it with an application.properties file. So you can manage all the configs in the properties file. ---- 2020-09-23 19:18:27 UTC - Evan Furman: so bookie.host needs to list out all the bookies in the cluster? ---- 2020-09-23 19:19:47 UTC - Addison Higham: specifically, if you just built your own image (or used a volume) and place your custom properties file at `/pulsar-manager/pulsar-manager/application.properties` you can then set any arbitrary properties. The environment variables just get mapped to a few properties as seen here <https://github.com/apache/pulsar-manager/blob/master/docker/supervisord.conf#L31> and will still override the values in that property file ---- 2020-09-23 19:28:54 UTC - Evan Furman: ok, will give it a shot ---- 2020-09-23 20:04:14 UTC - Rounak Jaggi: Is there a way to dynamically update "tokenPublicKey" in proxy & broker config without a service restart? Also how's everyone managing this in thr enterprise? ---- 2020-09-23 20:05:09 UTC - Frank Kelly: For us a Kubernetes secret update should do the trick ---- 2020-09-23 20:05:28 UTC - Frank Kelly: We hope! ---- 2020-09-23 20:33:10 UTC - Rounak Jaggi: We are using ec2 instance. ---- 2020-09-23 20:33:48 UTC - Rounak Jaggi: @Addison Higham @Sijie Guo ---- 2020-09-23 20:42:27 UTC - Frank Kelly: It does appear that the Public Key is loaded on startup and then cached <https://github.com/apache/pulsar/blob/master/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/authentication/AuthenticationService.java#L57> So my K8S secret update hopes appear to be dashed :disappointed: ---- 2020-09-23 20:54:08 UTC - Rounak Jaggi: can pulsar support multiple "tokenPublicKey" public key in proxy & broker? ---- 2020-09-23 20:54:15 UTC - Rounak Jaggi: can pulsar support multiple "tokenPublicKey" public key in proxy & broker? ---- 2020-09-23 20:56:32 UTC - Frank Kelly: I'm no expert but my reading of the code is no - however it *should* be relatively straightforward - if you create an issue here <https://github.com/apache/pulsar/issues> we might get some help and at least get it on the roadmap +1 : Sijie Guo ---- 2020-09-23 21:01:02 UTC - Sijie Guo: Yeah. A github issue will be helpful ---- 2020-09-23 23:36:15 UTC - Seun: Thanks @chris ---- 2020-09-24 01:58:53 UTC - David Miller: Sorry for the delay. Here’s another gist that has several samples of the stats-internal for the topic. <https://gist.github.com/David-Miller-Bose/4ad8d460499ec4eae091f16ab239075d> ---- 2020-09-24 01:59:53 UTC - David Miller: Here’s an image of the metrics we have that show the graph during the samples included. ---- 2020-09-24 08:48:17 UTC - Gilles Barbier: @Devin G. Bost sure, please tell me if you make progress or have a different experience ---- 2020-09-24 08:57:08 UTC - amir hossain Delvarani: @amir hossain Delvarani has joined the channel ----
