Slack digest for #general - 2020-09-24

Apache Pulsar Slack Thu, 24 Sep 2020 02:11:39 -0700

2020-09-23 09:18:16 UTC - alex kurtser: Hello

We experience the consumers stuck under very strange conditions. It’s still 
hard to catch . It always happens with KEY_SHARED subscriptions  with multiple 
consumers.
What we figured out that the trigger for the problem is the adding additional 
consumers. It can be occurred as part of upgrade ( for example the new upgraded 
consumers being added and after this the old consumers being removed from the 
topic) .

Sometimes , because of network glitch, consumer is disconnecting and after few
seconds reconnecting. We have 12 consumers on this particular topic but it’s
enough that only one consumer will disconnect and reconnect in order to stuck
the whole partition for every consumers *while other partitions still works
fine.*

Again, the common trigger is some change with consumers, preferably the adding
of new consumer/consumers

It doesn’t happen always. My estimation that it happens approximately in half
of cases.

Of course, no errors on the brokers side expecting the standard messages about
connecting new consumers.

The problem is that in order to recover from this situation we need to do one
of the following :

1. Pulsar-admin topics unload topicname
2. Or we need completely disconnect all consumers and reconnect ( it isn’t
recovering if even one consumers remaining connected.. we need to disconnect
all of them at once)
Any ideas ?
----
2020-09-23 10:07:56 UTC - QW: @QW has joined the channel
----
2020-09-23 11:59:56 UTC - Gautam Lodhiya: Hello,

I am facing some issue of frequent consumer disconnection and reconnection, due
to which the consumer processing some message couldn’t ack and it gets
re-enqueued directly on consumer disconnection.

Can we have some event notified before it gets re-enqueued or some delay before
getting re-enqueued?
----
2020-09-23 12:15:57 UTC - Sydni Horner: @Sydni Horner has joined the channel
----
2020-09-23 12:19:35 UTC - Jim M.: @Sydni Horner FYI
----
2020-09-23 12:34:29 UTC - chris: you should be able to add this annotation and
it will create an internal load balancer (non public IP)
```<http://cloud.google.com/load-balancer-type|cloud.google.com/load-balancer-type>:
"Internal"```
<https://cloud.google.com/kubernetes-engine/docs/how-to/internal-load-balancing>
----
2020-09-23 12:54:18 UTC - Brent Evans: I'm going to see if I can write
something to recreate the issue / system load as it only happens intermittently
in our deployed environment. I'll get back when I've got something, thanks for
the help so far
----
2020-09-23 13:01:47 UTC - slouie: Hi @Sijie Guo,

This issue hasn’t presented itself for the last 3 months since these posts, but
just returned in force. I’m still investigating what we can do here and making
sense of these stats gists. But wanted to share the output of those two
commands you recommended to run.
<https://gist.github.com/David-Miller-Bose/5d23a1437a6e9e56c595d2eff8b35964>

If you have any advice or guidance on where else to look its greatly
appreciated. We found that only one of our subscriptions require us to clear
the backlog. We are looking into whats different about this one subscriber.
```Product-State-Subscription```
If I found out more and get to a fix, I’ll post back and hopefully leave some
breadcrumbs for others.
----
2020-09-23 13:49:49 UTC - Shawn: @Shawn has joined the channel
----
2020-09-23 15:14:34 UTC - Sijie Guo: ```"Product-State-Subscription" : {
"markDeletePosition" : "311744:132685",
"readPosition" : "311762:71967",
"waitingReadOp" : true,
"pendingReadOps" : 0,
"messagesConsumedCounter" : 1272570493,
"cursorLedger" : 311678,
"cursorLedgerLastEntry" : 8879,
"individuallyDeletedMessages" : "[(311744:132686..311744:133765],
(311744:133766..311762:71936], (311762:71938..311762:71947],
(311762:71949..311762:71951], (311762:71952..311762:71954]]",
"lastLedgerSwitchTimestamp" : "2020-09-22T15:17:51.399Z",
"state" : "Open",
"numberOfEntriesSinceFirstNotAckedMessage" : 865102,
"totalNonContiguousDeletedMessagesRange" : 5,
"properties" : { }
},```
----
2020-09-23 15:15:16 UTC - Sijie Guo: from the topic stats output, it seems that
there are a lot of unacked messages (see `individuallyDeletedMessages`). Did
you enable ackTimeout?
----
2020-09-23 15:16:32 UTC - slouie: yes, it’s `120s` We also leverage negative
acks.
----
2020-09-23 15:20:31 UTC - Sijie Guo: Interesting. Which version of Pulsar
broker are you using? What is the subscription type?
----
2020-09-23 15:20:53 UTC - David Miller: @David Miller has joined the channel
----
2020-09-23 15:21:59 UTC - Sijie Guo: You can use OAuth2 authentication.
<http://pulsar.apache.org/docs/en/security-oauth2/>
----
2020-09-23 15:22:15 UTC - slouie: version `2.3.1.` the subscription type is
`Shared`
----
2020-09-23 15:25:01 UTC - Sijie Guo: I think Penghui suggested some commands to
run to capture the topics stats and stats-internal in the email thread. Once
you provide those information, Penghui can help you address the problem.
----
2020-09-23 15:26:05 UTC - Sijie Guo: It seems that the broker timed out at
reading the policy from zookeeper. So it can’t get the policies data from
zookeeper and it will use the default one.
----
2020-09-23 15:26:10 UTC - alex kurtser: Hi.
Yes, i spoke with him after this message :slightly_smiling_face:
I will try to reproduce and captur the memory
----
2020-09-23 15:27:29 UTC - Sijie Guo: The message isn’t “re-enqueue”. If a
message is not acked, the message will be dispatched again to the consumer. So
I think the main question will be - why the consumer are disconnected.
----
2020-09-23 15:27:52 UTC - Sijie Guo: Cool
----
2020-09-23 15:31:26 UTC - Sijie Guo: I see. Just want to check one thing - have
you tried to `topic unload` to unload a topic? If you unload a topic, did
backlog move?
----
2020-09-23 15:33:35 UTC - Ruwen Reddig: @Ruwen Reddig has joined the channel
----
2020-09-23 15:45:43 UTC - slouie: I haven’t tried that yet. Our backlog is
hovering around 60 messages. We’ve been doing a clear-backlog to clear the
storage. This would redeliver those unack’d messages correct? Just want to
understand the difference and how it could resolve this issue
----
2020-09-23 16:26:45 UTC - Shawn: just to clarify some context (i work with
slouie):
That topic has anywhere from 500-1000 messages a second published. The
consumers are keeping up with that, sending acks (and occasional nacks). the
backlog at any given moment is around 20-60, however periodically the storage
starts growing until the limit is reached (around 23gb), and publishes fail.
the backlog itself never starts growing.
----
2020-09-23 16:28:00 UTC - Sijie Guo: clear-backlog doesn’t trigger redeliveries
----
2020-09-23 16:28:11 UTC - Sijie Guo: It basically skips the unacked messages
----
2020-09-23 16:29:34 UTC - Sijie Guo: In order to debug this issue, it would be
good to run `topic stats-internal` twice between a couple of minutes. Then we
can compare the outputs to see if cursor is moving or not.
----
2020-09-23 16:45:20 UTC - Sijie Guo: ApacheCon@Home is happening next week (Sep
29th ~ Oct 1st). There is a dedicated track for Pulsar/BookKeeper. It has 17
talks spanning multiple timezones. You can find the talks in
<https://www.apachecon.com/acah2020/tracks/> and register the event at
<https://hopin.to/events/apachecon-home>.
star-struck : Frank Kelly, Axel Sirota
----
2020-09-23 17:44:59 UTC - David Miller: Another a co-worker of @slouie here. I
have a script collecting the output of the stats-internal every minute now.
Hopefully the storage will trend back up beyond a GC window here shortly so
that we know we have some samples that capture the issue.
----
2020-09-23 18:15:16 UTC - Ebere Abanonu: Will it be recorded and made available?
----
2020-09-23 18:41:34 UTC - David Miller: @Sijie Guo so we encountered the issue
again. This time we have stats-internal data every minute before and after the
issue. We also just attempted the unload on the topic and it did clear the
storage associated with the topic.
----
2020-09-23 18:42:28 UTC - David Miller: I’ll put together a few samples in a
gist and will publish them in a few minutes.
----
2020-09-23 18:42:58 UTC - Sijie Guo: Cool, Thank you! /cc @Penghui Li in the
loop
----
2020-09-23 18:43:34 UTC - Sijie Guo: It should be. Although I am not the
organizer of ApacheCon, I have to double check with the organizers.
----
2020-09-23 19:00:24 UTC - Evan Furman: Hi guys! We are not seeing Bookies
listed under the cluster in Pulsar Manager. Have added the following params to
the bookkeeper.conf.
```# The flag enables/disables starting the admin http server. Default value is
'false'.
httpServerEnabled=true

# The http server port to listen on. Default value is 8080.
# Use `8000` as the port to keep it consistent with prometheus stats provider
httpServerPort=8000

# The http server class
httpServerClass=org.apache.bookkeeper.http.vertx.VertxHttpServer```
cc: @Tim Haak
----
2020-09-23 19:02:05 UTC - Tim Haak: the call
`/pulsar-manager/api/v1/bookies/pulsar-cluster`returns
```{"enableBookieHttp":false}```

----
2020-09-23 19:03:59 UTC - Addison Higham: hrm... I am not personally familiar
with pulsar manager. Have you created an issue on the github repo? I will get
it to the right person
----
2020-09-23 19:06:08 UTC - Sijie Guo: Did you configure Pulsar Manager to talk
to bookkeeper cluster?
----
2020-09-23 19:06:29 UTC - Evan Furman: How would I do that?
----
2020-09-23 19:11:21 UTC - Addison Higham: ```bookie.host=<http://localhost:8050>
bookie.enable=false```
Is what needs configured
----
2020-09-23 19:11:45 UTC - Sijie Guo: Yes. @Addison Higham you are faster than
me :slightly_smiling_face:
----
2020-09-23 19:12:11 UTC - Evan Furman: we are running the docker container — do
these need to be passed in as env vars?
----
2020-09-23 19:12:28 UTC - Addison Higham: I at least know where the config file
is :smile:
----
2020-09-23 19:13:50 UTC - Sijie Guo: @Evan Furman I am not sure it was exposed
as environment variables yet. an alertnative way is to start the docker and
pointing it with an application.properties file. So you can manage all the
configs in the properties file.
----
2020-09-23 19:18:27 UTC - Evan Furman: so bookie.host needs to list out all the
bookies in the cluster?
----
2020-09-23 19:19:47 UTC - Addison Higham: specifically, if you just built your
own image (or used a volume) and place your custom properties file at
`/pulsar-manager/pulsar-manager/application.properties` you can then set any
arbitrary properties. The environment variables just get mapped to a few
properties as seen here
<https://github.com/apache/pulsar-manager/blob/master/docker/supervisord.conf#L31>
and will still override the values in that property file
----
2020-09-23 19:28:54 UTC - Evan Furman: ok, will give it a shot
----
2020-09-23 20:04:14 UTC - Rounak Jaggi: Is there a way to dynamically update
"tokenPublicKey" in proxy &amp; broker config without a service restart? Also
how's everyone managing this in thr enterprise?
----
2020-09-23 20:05:09 UTC - Frank Kelly: For us a Kubernetes secret update should
do the trick
----
2020-09-23 20:05:28 UTC - Frank Kelly: We hope!
----
2020-09-23 20:33:10 UTC - Rounak Jaggi: We are using ec2 instance.
----
2020-09-23 20:33:48 UTC - Rounak Jaggi: @Addison Higham @Sijie Guo
----
2020-09-23 20:42:27 UTC - Frank Kelly: It does appear that the Public Key is
loaded on startup and then cached
<https://github.com/apache/pulsar/blob/master/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/authentication/AuthenticationService.java#L57>
So my K8S secret update hopes appear to be dashed :disappointed:
----
2020-09-23 20:54:08 UTC - Rounak Jaggi: can pulsar support multiple
"tokenPublicKey" public key in proxy &amp; broker?
----
2020-09-23 20:54:15 UTC - Rounak Jaggi: can pulsar support multiple
"tokenPublicKey" public key in proxy &amp; broker?
----
2020-09-23 20:56:32 UTC - Frank Kelly: I'm no expert but my reading of the code
is no - however it *should* be relatively straightforward - if you create an
issue here <https://github.com/apache/pulsar/issues> we might get some help and
at least get it on the roadmap
+1 : Sijie Guo
----
2020-09-23 21:01:02 UTC - Sijie Guo: Yeah. A github issue will be helpful
----
2020-09-23 23:36:15 UTC - Seun: Thanks @chris
----
2020-09-24 01:58:53 UTC - David Miller: Sorry for the delay. Here’s another
gist that has several samples of the stats-internal for the topic.
<https://gist.github.com/David-Miller-Bose/4ad8d460499ec4eae091f16ab239075d>
----
2020-09-24 01:59:53 UTC - David Miller: Here’s an image of the metrics we have
that show the graph during the samples included.
----
2020-09-24 08:48:17 UTC - Gilles Barbier: @Devin G. Bost sure, please tell me
if you make progress or have a different experience
----
2020-09-24 08:57:08 UTC - amir hossain Delvarani: @amir hossain Delvarani has
joined the channel
----

Slack digest for #general - 2020-09-24

Reply via email to