2020-09-23 09:18:16 UTC - alex kurtser: Hello

We experience the consumers stuck under very strange conditions. It’s still 
hard to catch . It always happens with KEY_SHARED subscriptions  with multiple 
consumers.
What we figured out that the trigger for the problem is the adding additional 
consumers. It can be occurred as part of upgrade ( for example the new upgraded 
consumers being added and after this the old consumers being removed from the 
topic) .

Sometimes , because of network glitch, consumer is disconnecting and after few 
seconds reconnecting. We have 12 consumers on this particular topic but it’s 
enough that only one consumer will disconnect and reconnect in order to stuck 
the whole partition for every consumers *while other partitions still works 
fine.*

Again, the common trigger is some change with consumers, preferably the adding 
of new consumer/consumers

 It doesn’t happen always. My estimation that it happens approximately in half 
of cases.

Of course, no errors on the brokers side expecting the standard messages about 
connecting new consumers.

The problem is that in order to recover from this situation we need to do one 
of the following :

1. Pulsar-admin topics unload topicname
2. Or we need completely disconnect all consumers and reconnect ( it isn’t 
recovering if even one consumers remaining connected.. we need to disconnect 
all of them at once)
Any ideas ?
----
2020-09-23 10:07:56 UTC - QW: @QW has joined the channel
----
2020-09-23 11:59:56 UTC - Gautam Lodhiya: Hello,

I am facing some issue of frequent consumer disconnection and reconnection, due 
to which the consumer processing some message couldn’t ack and it gets 
re-enqueued directly on consumer disconnection.

Can we have some event notified before it gets re-enqueued or some delay before 
getting re-enqueued?
----
2020-09-23 12:15:57 UTC - Sydni Horner: @Sydni Horner has joined the channel
----
2020-09-23 12:19:35 UTC - Jim M.: @Sydni Horner FYI
----
2020-09-23 12:34:29 UTC - chris: you should be able to add this annotation and 
it will create an internal load balancer (non public IP)
```<http://cloud.google.com/load-balancer-type|cloud.google.com/load-balancer-type>:
 "Internal"```
<https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing>
----
2020-09-23 12:54:18 UTC - Brent Evans: I'm going to see if I can write 
something to recreate the issue / system load as it only happens intermittently 
in our deployed environment. I'll get back when I've got something, thanks for 
the help so far
----
2020-09-23 13:01:47 UTC - slouie: Hi @Sijie Guo,

This issue hasn’t presented itself for the last 3 months since these posts, but 
just returned in force. I’m still investigating what we can do here and making 
sense of these stats gists. But wanted to share the output of those two 
commands you recommended to run. 
<https://gist.github.com/David-Miller-Bose/5d23a1437a6e9e56c595d2eff8b35964>

If you have any advice or guidance on where else to look its greatly 
appreciated. We found that only one of our subscriptions require us to clear 
the backlog. We are looking into whats different about this one subscriber.
```Product-State-Subscription```
If I found out more and get to a fix, I’ll post back and hopefully leave some 
breadcrumbs for others.
----
2020-09-23 13:49:49 UTC - Shawn: @Shawn has joined the channel
----
2020-09-23 15:14:34 UTC - Sijie Guo: ```"Product-State-Subscription" : {
      "markDeletePosition" : "311744:132685",
      "readPosition" : "311762:71967",
      "waitingReadOp" : true,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 1272570493,
      "cursorLedger" : 311678,
      "cursorLedgerLastEntry" : 8879,
      "individuallyDeletedMessages" : "[(311744:132686..311744:133765], 
(311744:133766..311762:71936], (311762:71938..311762:71947], 
(311762:71949..311762:71951], (311762:71952..311762:71954]]",
      "lastLedgerSwitchTimestamp" : "2020-09-22T15:17:51.399Z",
      "state" : "Open",
      "numberOfEntriesSinceFirstNotAckedMessage" : 865102,
      "totalNonContiguousDeletedMessagesRange" : 5,
      "properties" : { }
    },```
----
2020-09-23 15:15:16 UTC - Sijie Guo: from the topic stats output, it seems that 
there are a lot of unacked messages (see `individuallyDeletedMessages`). Did 
you enable ackTimeout?
----
2020-09-23 15:16:32 UTC - slouie: yes, it’s `120s`  We also leverage negative 
acks.
----
2020-09-23 15:20:31 UTC - Sijie Guo: Interesting. Which version of Pulsar 
broker are you using? What is the subscription type?
----
2020-09-23 15:20:53 UTC - David Miller: @David Miller has joined the channel
----
2020-09-23 15:21:59 UTC - Sijie Guo: You can use OAuth2 authentication. 
<http://pulsar.apache.org/docs/en/security-oauth2/>
----
2020-09-23 15:22:15 UTC - slouie: version `2.3.1.` the subscription type is 
`Shared`
----
2020-09-23 15:25:01 UTC - Sijie Guo: I think Penghui suggested some commands to 
run to capture the topics stats and stats-internal in the email thread. Once 
you provide those information, Penghui can help you address the problem.
----
2020-09-23 15:26:05 UTC - Sijie Guo: It seems that the broker timed out at 
reading the policy from zookeeper. So it can’t get the policies data from 
zookeeper and it will use the default one.
----
2020-09-23 15:26:10 UTC - alex kurtser: Hi.
Yes, i spoke with him after this message :slightly_smiling_face:
I will try to reproduce and captur the memory
----
2020-09-23 15:27:29 UTC - Sijie Guo: The message isn’t “re-enqueue”. If a 
message is not acked, the message will be dispatched again to the consumer. So 
I think the main question will be - why the consumer are disconnected.
----
2020-09-23 15:27:52 UTC - Sijie Guo: Cool
----
2020-09-23 15:31:26 UTC - Sijie Guo: I see. Just want to check one thing - have 
you tried to `topic unload` to unload a topic? If you unload a topic, did 
backlog move?
----
2020-09-23 15:33:35 UTC - Ruwen Reddig: @Ruwen Reddig has joined the channel
----
2020-09-23 15:45:43 UTC - slouie: I haven’t tried that yet. Our backlog is 
hovering around 60 messages. We’ve been doing a clear-backlog to clear the 
storage. This would redeliver those unack’d messages correct? Just want to 
understand the difference and how it could resolve this issue
----
2020-09-23 16:26:45 UTC - Shawn: just to clarify some context (i work with 
slouie):
That topic has anywhere from 500-1000 messages a second published. The 
consumers are keeping up with that, sending acks (and occasional nacks). the 
backlog at any given moment is around 20-60, however periodically the storage 
starts growing until the limit is reached (around 23gb), and publishes fail. 
the backlog itself never starts growing.
----
2020-09-23 16:28:00 UTC - Sijie Guo: clear-backlog doesn’t trigger redeliveries
----
2020-09-23 16:28:11 UTC - Sijie Guo: It basically skips the unacked messages
----
2020-09-23 16:29:34 UTC - Sijie Guo: In order to debug this issue, it would be 
good to run `topic stats-internal`  twice between a couple of minutes. Then we 
can compare the outputs to see if cursor is moving or not.
----
2020-09-23 16:45:20 UTC - Sijie Guo: ApacheCon@Home is happening next week (Sep 
29th ~ Oct 1st). There is a dedicated track for Pulsar/BookKeeper. It has 17 
talks spanning multiple timezones. You can find the talks in 
<https://www.apachecon.com/acah2020/tracks/> and register the event at 
<https://hopin.to/events/apachecon-home>.
star-struck : Frank Kelly, Axel Sirota
----
2020-09-23 17:44:59 UTC - David Miller: Another a co-worker of @slouie here.  I 
have a script collecting the output of the stats-internal every minute now.  
Hopefully the storage will trend back up beyond a GC window here shortly so 
that we know we have some samples that capture the issue.
----
2020-09-23 18:15:16 UTC - Ebere Abanonu: Will it be recorded and made available?
----
2020-09-23 18:41:34 UTC - David Miller: @Sijie Guo so we encountered the issue 
again.  This time we have stats-internal data every minute before and after the 
issue.  We also just attempted the unload on the topic and it did clear the 
storage associated with the topic.
----
2020-09-23 18:42:28 UTC - David Miller: I’ll put together a few samples in a 
gist and will publish them in a few minutes.
----
2020-09-23 18:42:58 UTC - Sijie Guo: Cool, Thank you! /cc @Penghui Li in the 
loop
----
2020-09-23 18:43:34 UTC - Sijie Guo: It should be. Although I am not the 
organizer of ApacheCon, I have to double check with the organizers.
----
2020-09-23 19:00:24 UTC - Evan Furman: Hi guys! We are not seeing Bookies 
listed under the cluster in Pulsar Manager. Have added the following params to 
the bookkeeper.conf.
```# The flag enables/disables starting the admin http server. Default value is 
'false'.
httpServerEnabled=true

# The http server port to listen on. Default value is 8080.
# Use `8000` as the port to keep it consistent with prometheus stats provider
httpServerPort=8000

# The http server class
httpServerClass=org.apache.bookkeeper.http.vertx.VertxHttpServer```
cc: @Tim Haak
----
2020-09-23 19:02:05 UTC - Tim Haak: the call 
`/pulsar-manager/api/v1/bookies/pulsar-cluster`returns
```{"enableBookieHttp":false}```

----
2020-09-23 19:03:59 UTC - Addison Higham: hrm... I am not personally familiar 
with pulsar manager. Have you created an issue on the github repo? I will get 
it to the right person
----
2020-09-23 19:06:08 UTC - Sijie Guo: Did you configure Pulsar Manager to talk 
to bookkeeper cluster?
----
2020-09-23 19:06:29 UTC - Evan Furman: How would I do that?
----
2020-09-23 19:11:21 UTC - Addison Higham: ```bookie.host=<http://localhost:8050>
bookie.enable=false```
Is what needs configured
----
2020-09-23 19:11:45 UTC - Sijie Guo: Yes. @Addison Higham you are faster than 
me :slightly_smiling_face:
----
2020-09-23 19:12:11 UTC - Evan Furman: we are running the docker container — do 
these need to be passed in as env vars?
----
2020-09-23 19:12:28 UTC - Addison Higham: I at least know where the config file 
is :smile:
----
2020-09-23 19:13:50 UTC - Sijie Guo: @Evan Furman I am not sure it was exposed 
as environment variables yet. an alertnative way is to start the docker and 
pointing it with an application.properties file. So you can manage all the 
configs in the properties file.
----
2020-09-23 19:18:27 UTC - Evan Furman: so bookie.host needs to list out all the 
bookies in the cluster?
----
2020-09-23 19:19:47 UTC - Addison Higham: specifically, if you just built your 
own  image (or used a volume) and place your custom properties file at 
`/pulsar-manager/pulsar-manager/application.properties` you can then set any 
arbitrary properties. The environment variables just get mapped to a few 
properties as seen here 
<https://github.com/apache/pulsar-manager/blob/master/docker/supervisord.conf#L31>
 and will still override the values in that property file
----
2020-09-23 19:28:54 UTC - Evan Furman: ok, will give it a shot
----
2020-09-23 20:04:14 UTC - Rounak Jaggi: Is there a way to dynamically update 
"tokenPublicKey" in proxy &amp; broker config without a service restart? Also 
how's everyone managing this in thr enterprise?
----
2020-09-23 20:05:09 UTC - Frank Kelly: For us a Kubernetes secret update should 
do the trick
----
2020-09-23 20:05:28 UTC - Frank Kelly: We hope!
----
2020-09-23 20:33:10 UTC - Rounak Jaggi: We are using ec2 instance.
----
2020-09-23 20:33:48 UTC - Rounak Jaggi: @Addison Higham @Sijie Guo
----
2020-09-23 20:42:27 UTC - Frank Kelly: It does appear that the Public Key is 
loaded on startup and then cached 
<https://github.com/apache/pulsar/blob/master/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/authentication/AuthenticationService.java#L57>
So my K8S secret update hopes appear to be dashed :disappointed:
----
2020-09-23 20:54:08 UTC - Rounak Jaggi: can pulsar support multiple 
"tokenPublicKey" public key in proxy &amp; broker?
----
2020-09-23 20:54:15 UTC - Rounak Jaggi: can pulsar support multiple 
"tokenPublicKey" public key in proxy &amp; broker?
----
2020-09-23 20:56:32 UTC - Frank Kelly: I'm no expert but my reading of the code 
is no - however it *should* be relatively straightforward - if you create an 
issue here <https://github.com/apache/pulsar/issues> we might get some help and 
at least get it on the roadmap
+1 : Sijie Guo
----
2020-09-23 21:01:02 UTC - Sijie Guo: Yeah. A github issue will be helpful
----
2020-09-23 23:36:15 UTC - Seun: Thanks @chris
----
2020-09-24 01:58:53 UTC - David Miller: Sorry for the delay.  Here’s another 
gist that has several samples of the stats-internal for the topic.  
<https://gist.github.com/David-Miller-Bose/4ad8d460499ec4eae091f16ab239075d>
----
2020-09-24 01:59:53 UTC - David Miller: Here’s an image of the metrics we have 
that show the graph during the samples included.
----
2020-09-24 08:48:17 UTC - Gilles Barbier: @Devin G. Bost sure, please tell me 
if you make progress or have a different experience
----
2020-09-24 08:57:08 UTC - amir hossain Delvarani: @amir hossain Delvarani has 
joined the channel
----

Reply via email to